使用Tensorize评估硬件内部特性
使用Tensorize評估硬件內部特性
這是有關如何在TVM中執行張量的入門文檔。
通過使用調度原語tensorize,人們可以用相應的內部函數代替計算單元,從而輕松利用handcrafted micro-kernels,擴展TVM以支持新的硬件體系結構。
本文的目的是展示張量的功能和用法,而不是提供有效的解決方案。
from future import absolute_import, print_function
import tvm
from tvm import te
import numpy as np
定義矩陣乘法
以矩陣乘法為例。Matmul首先將兩個矩陣之間的對應元素相乘,然后在某個軸上累積。以下幾行描述了TVM中A * B^T的計算。
N, M, L = 1024, 512, 64
A = te.placeholder((N, L), name=“A”)
B = te.placeholder((M, L), name=“B”)
k = te.reduce_axis((0, L), name=“k”)
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[j, k], axis=k), name=“C”)
s = te.create_schedule(C.op)
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, 1024) {
for (j: int32, 0, 512) {
C_2[((i512) + j)] = 0f32
for (k: int32, 0, 64) {
C_2[((i512) + j)] = ((float32*)C_2[((i512) + j)] + ((float32)A_2[((i64) + k)](float32*)B_2[((j64) + k)]))
}
}
}
}
調度Matmul
假設有一個支持矩陣矢量乘法(GEMV)作為硬件原語的加速器,可以采用任意大小的reduce軸,但另一個軸必須不大于16。因此,分解了matmul循環,生成最里面的一個(16x64)GEMV循環。
factor = 16
x, y = C.op.axis
(z,) = C.op.reduce_axis
yo, yi = s[C].split(y, factor=factor)
s[C].reorder(x, yo, yi, z)
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, 1024) {
for (j.outer: int32, 0, 32) {
for (j.inner: int32, 0, 16) {
C_2[(((i512) + (j.outer16)) + j.inner)] = 0f32
for (k: int32, 0, 64) {
C_2[(((i512) + (j.outer16)) + j.inner)] = ((float32)C_2[(((i512) + (j.outer16)) + j.inner)] + ((float32*)A_2[((i64) + k)](float32*)B_2[(((j.outer1024) + (j.inner64)) + k)]))
}
}
}
}
}
如上面打印的IR所示,內部循環j.inner與k一起形成GEMV的計算-在最內部的兩個循環內,索引i是固定的,對矩陣的訪問A僅變化k,生成A“向量”的訪問模式”。 可以用j.inner張量來評估假設的硬件的GEMV指令。
定義固有的GEMV張量化
調度張量前,先定義GEMV的固有函數。它包括兩部分,第一部分是GEMV的計算定義。TVM使用它來匹配原始Matmul調度中的計算模式。第二個是指定如何在設備上執行GEMV,這在intrin_func下面完成。
def intrin_gemv(m, l):
a = te.placeholder((l,), name=“a”)
b = te.placeholder((m, l), name=“b”)
k = te.reduce_axis((0, l), name=“k”)
c = te.compute((m,), lambda i: te.sum(a[k] * b[i, k], axis=k), name=“c”)
Ab = tvm.tir.decl_buffer(a.shape, a.dtype, name=“A”, offset_factor=1, strides=[1])
Bb = tvm.tir.decl_buffer(b.shape, b.dtype, name=“B”, offset_factor=1, strides=[te.var(“s1”), 1])
Cb = tvm.tir.decl_buffer(c.shape, c.dtype, name=“C”, offset_factor=1, strides=[1])
def intrin_func(ins, outs):ib = tvm.tir.ir_builder.create()aa, bb = inscc = outs[0]ib.emit(tvm.tir.call_extern("int32","gemv_update",cc.access_ptr("w"),aa.access_ptr("r"),bb.access_ptr("r"),m,l,bb.strides[0],))return ib.get()return te.decl_tensor_intrin(c.op, intrin_func, binds={a: Ab, b: Bb, c: Cb})
在此te.decl_tensor_intrin聲明如何執行計算c.op。實現只接受輸入和輸出,將它們轉換為指針并發出外部函數調用。注意,張量需要用戶指定offset_factor,原始數據結構的起始地址和傳遞給張量的偏移量之間對齊的問題,TVM能評估,通過矢量化加載進行優化。為了簡化,將系數設置為1。
為輸入和輸出聲明了緩沖區,盡管這不是必需的,將從緩沖區提供的額外信息中受益。例如,bb.strides[0]作為參數傳遞 給外部函數gemv_update。將看到bb.strides[0] == l如何與更復雜的調度區分開。
注意,將te.var(“s1”)用作第一個步幅B。如果可以推理出步幅(在這種情況下,TVM確定張量B是緊湊的,步幅是[L, 1]),可以使用此類placeholder讓TVM自動為綁定推理的值。
gemv = intrin_gemv(factor, L)
s[C].tensorize(yi, gemv)
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, 1024) {
for (j.outer: int32, 0, 32) {
@tir.call_extern(“gemv_update”, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i512) + (j.outer16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
}
}
}
通過張大yi,最里面的兩個循環被之前定義的內在函數代替。為了構建和運行該模塊,定義外部函數gemv_update,它是GEMV的naive實現,僅用于演示。
def gemv_impl():
cc_code = “”"
extern “C” int gemv_update(float *cc, float *aa, float *bb, int m, int l, int stride) {
for (int i = 0; i < m; ++i) {
for (int j = 0; j < l; ++j) {
cc[i] += aa[j] * bb[i * stride + j];
}
}
return 0;
}
“”"
from tvm.contrib import utils, clang
temp = utils.tempdir()
ll_path = temp.relpath("temp.ll")
# Create LLVM ir from c source code
ll_code = clang.create_llvm(cc_code, output=ll_path)
return ll_code
利用pragma屬性import_llvm導入llvm asm內聯。導入在執行張量的GEMV之前進行。
s[C].pragma(x, “import_llvm”, gemv_impl())
print(tvm.lower(s, [A, B, C], simple_mode=True))
出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
attr [IterVar(i: int32, (nullptr), “DataPar”, “”)] “pragma_import_llvm” = "; ModuleID = ‘/tmp/tmpinr5hwkd/input0.cc’
source_filename = “/tmp/tmpinr5hwkd/input0.cc”
target datalayout = “e-m:e-i64:64-f80:128-n8:16:32:64-S128”
target triple = “x86_64-pc-linux-gnu”
; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {
%7 = alloca float*, align 8
%8 = alloca float*, align 8
%9 = alloca float*, align 8
%10 = alloca i32, align 4
%11 = alloca i32, align 4
%12 = alloca i32, align 4
%13 = alloca i32, align 4
%14 = alloca i32, align 4
store float* %0, float** %7, align 8
store float* %1, float** %8, align 8
store float* %2, float** %9, align 8
store i32 %3, i32* %10, align 4
store i32 %4, i32* %11, align 4
store i32 %5, i32* %12, align 4
store i32 0, i32* %13, align 4
br label %15
15: ; preds = %50, %6
%16 = load i32, i32* %13, align 4
%17 = load i32, i32* %10, align 4
%18 = icmp slt i32 %16, %17
br i1 %18, label %19, label %53
19: ; preds = %15
store i32 0, i32* %14, align 4
br label %20
20: ; preds = %46, %19
%21 = load i32, i32* %14, align 4
%22 = load i32, i32* %11, align 4
%23 = icmp slt i32 %21, %22
br i1 %23, label %24, label %49
24: ; preds = %20
%25 = load float*, float** %8, align 8
%26 = load i32, i32* %14, align 4
%27 = sext i32 %26 to i64
%28 = getelementptr inbounds float, float* %25, i64 %27
%29 = load float, float* %28, align 4
%30 = load float*, float** %9, align 8
%31 = load i32, i32* %13, align 4
%32 = load i32, i32* %12, align 4
%33 = mul nsw i32 %31, %32
%34 = load i32, i32* %14, align 4
%35 = add nsw i32 %33, %34
%36 = sext i32 %35 to i64
%37 = getelementptr inbounds float, float* %30, i64 %36
%38 = load float, float* %37, align 4
%39 = fmul float %29, %38
%40 = load float*, float** %7, align 8
%41 = load i32, i32* %13, align 4
%42 = sext i32 %41 to i64
%43 = getelementptr inbounds float, float* %40, i64 %42
%44 = load float, float* %43, align 4
%45 = fadd float %44, %39
store float %45, float* %43, align 4
br label %46
46: ; preds = %24
%47 = load i32, i32* %14, align 4
%48 = add nsw i32 %47, 1
store i32 %48, i32* %14, align 4
br label %20
49: ; preds = %20
br label %50
50: ; preds = %49
%51 = load i32, i32* %13, align 4
%52 = add nsw i32 %51, 1
store i32 %52, i32* %13, align 4
br label %15
53: ; preds = %15
ret i32 0
}
attributes #0 = { noinline nounwind optnone uwtable “correctly-rounded-divide-sqrt-fp-math”=“false” “disable-tail-calls”=“false” “less-precise-fpmad”=“false” “min-legal-vector-width”=“0” “no-frame-pointer-elim”=“true” “no-frame-pointer-elim-non-leaf” “no-infs-fp-math”=“false” “no-jump-tables”=“false” “no-nans-fp-math”=“false” “no-signed-zeros-fp-math”=“false” “no-trapping-math”=“false” “stack-protector-buffer-size”=“8” “target-cpu”=“x86-64” “target-features”="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" “unsafe-fp-math”=“false” “use-soft-float”=“false” }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !“wchar_size”, i32 4}
!1 = !{!"clang version 9.0.1-+20191211110317+c1a0a213378-1exp120191211221711.104 "}
";
for (i, 0, 1024) {
for (j.outer: int32, 0, 32) {
@tir.call_extern(“gemv_update”, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i512) + (j.outer16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
}
}
}
最后,將張量版本與numpy.dot產生的張量版本進行比較,確保實現正確。
func = tvm.build(s, [A, B, C], target=“llvm”, name=“gemv”)
from tvm.topi.utils import get_const_tuple
dtype = A.dtype
ctx = tvm.context(“cpu”, 0)
a = np.random.uniform(size=get_const_tuple(A.shape)).astype(dtype)
b = np.random.uniform(size=get_const_tuple(B.shape)).astype(dtype)
c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=dtype), ctx)
func(tvm.nd.array(a, ctx), tvm.nd.array(b, ctx), c)
tvm.testing.assert_allclose(c.asnumpy(), np.dot(a, b.T), rtol=1e-3)
進行Tensorize更新Reduce-update
已經了解了張量化的基本概念,現在讓向更復雜的情況邁進一步。
假設加速器只能將向量乘以一個矩陣,向量的大小必須不大于16。考慮到硬件限制,需要按如下方式拆分reduce軸。
zo, zi = s[C].split(z, factor=factor)
s[C].reorder(x, yo, zo, yi, zi)
由于張量內在函數現在僅覆蓋了reduce軸的一部分,而不是使用一個“ body”函數,因此TVM需要一個reduce_reset在reduce for循環之前調用的reduce_update函數,以及一個定義“ update”的函數。計算策略。
def gemv_impl():
cc_code = “”"
extern “C” int gemv_update(float *cc, float *aa, float *bb, int m, int l, int stride) {
for (int i = 0; i < m; ++i) {
for (int j = 0; j < l; ++j) {
cc[i] += aa[j] * bb[i * stride + j];
}
}
return 0;
}
extern “C” int gemv_reset(float *cc, int m) {
for (int i = 0; i < m; ++i) {
cc[i] = 0.0;
}
return 0;
}
“”"
from tvm.contrib import utils, clang
temp = utils.tempdir()
ll_path = temp.relpath("temp.ll")
# Create LLVM ir from c source code
ll_code = clang.create_llvm(cc_code, output=ll_path)
return ll_code
def intrin_gemv(m, l):
a = te.placeholder((l,), name=“a”)
b = te.placeholder((m, l), name=“b”)
k = te.reduce_axis((0, l), name=“k”)
c = te.compute((m,), lambda i: te.sum(a[k] * b[i, k], axis=k), name=“c”)
Ab = tvm.tir.decl_buffer(a.shape, a.dtype, name=“A”, offset_factor=1, strides=[1])
Bb = tvm.tir.decl_buffer(b.shape, b.dtype, name=“B”, offset_factor=1, strides=[te.var(“s1”), 1])
Cb = tvm.tir.decl_buffer(c.shape, c.dtype, name=“C”, offset_factor=1, strides=[1])
def intrin_func(ins, outs):aa, bb = inscc = outs[0]def _body():ib = tvm.tir.ir_builder.create()ib.emit(tvm.tir.call_extern("int32","gemv_update",cc.access_ptr("w"),aa.access_ptr("r"),bb.access_ptr("r"),m,l,bb.strides[0],))return ib.get()def _reduce_reset():ib = tvm.tir.ir_builder.create()ib.emit(tvm.tir.call_extern("int32", "gemv_reset", cc.access_ptr("w"), m))return ib.get()def _reduce_update():return _body()return _body(), _reduce_reset(), _reduce_update()return te.decl_tensor_intrin(c.op, intrin_func, binds={a: Ab, b: Bb, c: Cb})
注意,intrin_func返回一個三元組: 如果tensorization包括所有的reduce軸,功能將被調用,否則一起將被使用。在示例中,共享相同的實現,而在其它情況下,硬件對于這兩個功能可能具有不同的指令。此外,由于平鋪,可以看到現在是不同的。(body, reduce_reset, reduce_update)body()reduce_reset()reduce_update()body()reduce_update()bb.strides[0]l
張量squared GEMV,生成并檢查結果
gemv = intrin_gemv(factor, factor)
s[C].tensorize(yi, gemv)
s[C].pragma(yo, “import_llvm”, gemv_impl())
func = tvm.build(s, [A, B, C], target=“llvm”, name=“gemv”)
a = np.random.uniform(size=get_const_tuple(A.shape)).astype(dtype)
b = np.random.uniform(size=get_const_tuple(B.shape)).astype(dtype)
c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=dtype), ctx)
func(tvm.nd.array(a, ctx), tvm.nd.array(b, ctx), c)
tvm.testing.assert_allclose(c.asnumpy(), np.dot(a, b.T), rtol=1e-3)
概要
本文演示了TVM中張量內在函數的用法。Tensorize為用戶提供了一種通過微內核獲得完全優化的調度方式。例如,英特爾CPU上使用張量化直接調用AVX指令進行INT8量化。使TVM可以編譯為ASIC-有關詳細信息,請參閱VTA:深度學習加速器堆棧。演示了如何使用內聯程序集導入,這可以幫助用戶輕松地將asm輸入調度中。
總結
以上是生活随笔為你收集整理的使用Tensorize评估硬件内部特性的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 使用元组输入进行计算和归约
- 下一篇: 代码生成codegen