如何在 CPU 上优化 GEMM
如何在 CPU 上優(yōu)化 GEMM
(TL;DR) TVM 提供抽象接口,允許用戶(hù)分別描述算法和算法的實(shí)施組織(所謂的調(diào)度)。通常,在高性能調(diào)度中編寫(xiě)算法,會(huì)破壞算法的可讀性和模塊化。嘗試各種看似有前途的調(diào)度也很耗時(shí)。在 TVM 的幫助下,可以有效地嘗試這些調(diào)度,提高性能。
將演示如何使用 TVM 優(yōu)化矩陣乘法,通過(guò)簡(jiǎn)單地添加 18 行額外代碼,實(shí)現(xiàn)比基線(xiàn)快 200 倍。
在 CPU 上執(zhí)行的密集計(jì)算應(yīng)用程序,有兩個(gè)重要的優(yōu)化:
- 提高內(nèi)存訪(fǎng)問(wèn)的緩存命中率。復(fù)雜的數(shù)值計(jì)算和熱點(diǎn)內(nèi)存訪(fǎng)問(wèn),都可以通過(guò)高緩存命中率加速。需要將原始內(nèi)存訪(fǎng)問(wèn)模式,轉(zhuǎn)換為適合緩存策略的模式。
- SIMD(單指令多數(shù)據(jù)),或者稱(chēng)向量處理單元。每次都會(huì)處理一小批數(shù)據(jù),不是單個(gè)網(wǎng)格。需要統(tǒng)一模式,轉(zhuǎn)換循環(huán)體中的數(shù)據(jù)訪(fǎng)問(wèn)模式, LLVM 后端可以降低為 SIMD。
實(shí)際上,所有方法都是repo 中提到的一個(gè)子技巧 。一些已被 TVM 抽象自動(dòng)應(yīng)用,由于 TVM 的限制,一些不能簡(jiǎn)單地應(yīng)用。
下面所有實(shí)驗(yàn)結(jié)果,都是在配備 Intel i7-4770HQ CPU 的,15’ MacBook 上實(shí)現(xiàn)的。對(duì)于所有 x86 CPU,緩存行大小應(yīng)為 64 字節(jié)。
準(zhǔn)備和基線(xiàn)
將演示如何使用 TVM 優(yōu)化矩陣乘法。在實(shí)際演示前,先定義這些變量。然后編寫(xiě)一個(gè)基線(xiàn)實(shí)現(xiàn),這是在 TVM 中編寫(xiě)矩陣乘法的最簡(jiǎn)單方法。
import tvm
import tvm.testing
from tvm import te
import numpy
import timeit
The size of the matrix
(M, K) x (K, N)
You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.
M = 1024
K = 1024
N = 1024
The default tensor type in tvm
dtype = “float32”
using Intel AVX2(Advanced Vector Extensions) ISA for SIMD
To get the best performance, please change the following line
to llvm -mcpu=core-avx2, or specific type of CPU you use
target = “l(fā)lvm”
dev = tvm.device(target, 0)
Random generated tensor for testing
a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)
b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)
np_repeat = 100
np_runing_time = timeit.timeit(
setup=“import numpy\n”
"M = " + str(M) + “\n”
"K = " + str(K) + “\n”
"N = " + str(N) + “\n”
‘dtype = “float32”\n’
“a = numpy.random.rand(M, K).astype(dtype)\n”
“b = numpy.random.rand(K, N).astype(dtype)\n”,
stmt=“answer = numpy.dot(a, b)”,
number=np_repeat,
)
print(“Numpy running time: %f” % (np_runing_time / np_repeat))
answer = numpy.dot(a.numpy(), b.numpy())
Algorithm
k = te.reduce_axis((0, K), “k”)
A = te.placeholder((M, K), name=“A”)
B = te.placeholder((K, N), name=“B”)
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name=“C”)
Default schedule
s = te.create_schedule(C.op)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print(“Baseline: %f” % evaluator(a, b, c).mean)
輸出:
Numpy running time: 0.009345
Baseline: 3.291115
在 TVM 中,檢查較低級(jí)別的 IR,調(diào)試或優(yōu)化調(diào)度。這是使用基線(xiàn)調(diào)度生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m: int32, 0, 1024) {
for (n: int32, 0, 1024) {
C_2[((m1024) + n)] = 0f32
for (k: int32, 0, 1024) {
C_2[((m1024) + n)] = ((float32*)C_2[((m1024) + n)] + ((float32)A_2[((m1024) + k)](float32*)B_2[((k*1024) + n)]))
}
}
}
}
阻塞
提高緩存命中率的一個(gè)重要技巧是阻塞——數(shù)據(jù)塊將逐塊計(jì)算。塊內(nèi)部的內(nèi)存訪(fǎng)問(wèn)是一個(gè)具有高內(nèi)存局部性的小鄰域。選擇了 32 作為分塊因子。該塊將填充 32 * 32 * sizeof(float) 即 4KB 的緩存,總大小為 32KB(L1 數(shù)據(jù)緩存)。
bn = 32
kfactor = 4
s = te.create_schedule(C.op)
Blocking by loop tiling
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
Hoist reduction domain outside the blocking loop
s[C].reorder(mo, no, ko, ki, mi, ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,
we can see big speedup compared with the baseline.
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt1: %f” % evaluator(a, b, c).mean)
輸出:
Opt1: 0.310688
這是阻塞后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
for (n.inner.init: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner.init1024)) + (n.outer32)) + n.inner.init)] = 0f32
}
}
for (k.outer: int32, 0, 256) {
for (k.inner: int32, 0, 4) {
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = ((float32*)C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] + ((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)](float32*)B_2[((((k.outer4096) + (k.inner1024)) + (n.outer*32)) + n.inner)]))
}
}
}
}
}
}
}
向量化
另一個(gè)重要的技巧是向量化。當(dāng)內(nèi)存訪(fǎng)問(wèn)模式一致時(shí),編譯器可以檢測(cè)到這種模式,將連續(xù)內(nèi)存?zhèn)鬟f給向量處理器。在 TVM 中,可以使用vectorize接口,提示編譯器這種模式,這樣就可以大大加速。
選擇向量化內(nèi)循環(huán)行數(shù)據(jù),這是緩存友好的。
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
s[C].reorder(mo, no, ko, ki, mi, ni)
Vectorization
s[C].vectorize(ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt2: %f” % evaluator(a, b, c).mean)
輸出:
Opt2: 0.341067
向量化后,生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (k.inner: int32, 0, 4) {
for (m.inner: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) + (k.inner1024)) + (n.outer*32)), 1, 32)]))
}
}
}
}
}
}
循環(huán)排列
查看上面的 IR,可以看到 B 和 C 的內(nèi)循環(huán)行數(shù)據(jù),都進(jìn)行了向量化。接下來(lái),查看 A 的訪(fǎng)問(wèn)模式。在當(dāng)前調(diào)度中,A 是逐列訪(fǎng)問(wèn)的,這對(duì)緩存不友好. 如果改變 ki 和內(nèi)軸 mi 的嵌套循環(huán)順序,A 矩陣的訪(fǎng)問(wèn)模式,對(duì)緩存更友好。
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
re-ordering
s[C].reorder(mo, no, ko, mi, ki, ni)
s[C].vectorize(ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt3: %f” % evaluator(a, b, c).mean)
輸出:
Opt3: 0.111449
循環(huán)排列后,生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.inner: int32, 0, 32) {
for (k.inner: int32, 0, 4) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) + (k.inner1024)) + (n.outer*32)), 1, 32)]))
}
}
}
}
}
}
陣列封裝
另一個(gè)重要的技巧是數(shù)組打包。訣竅是對(duì)多維數(shù)組的存儲(chǔ),進(jìn)行重新排序,展平存儲(chǔ)在一維內(nèi)存中后,按順序訪(fǎng)問(wèn)。
可以使用數(shù)組打包,解決 B 的訪(fǎng)問(wèn)模式。觀察扁平化后 B 的數(shù)組訪(fǎng)問(wèn)模式,在 K 維度上迭代時(shí),這不是連續(xù)的。可以用維度 [K][N] 重新排序 B,使其具有維度 [N/bn][K][bn],bn 是阻塞因子,也是內(nèi)循環(huán)中 B 的向量大小。這種重新排序,將 N 分成兩個(gè)維度 — bigN (N/bn) 和 littleN (bn) —新維度 [N/bn][K][bn] 匹配 B,從外循環(huán)到內(nèi)循環(huán)的索引(no, ko, ki, ni) ,在展平后,導(dǎo)致 B 的順序訪(fǎng)問(wèn)模式。
We have to re-write the algorithm slightly.
packedB = te.compute(
(N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name=“packedB”
)
C = te.compute(
(M, N),
lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k),
name=“C”,
)
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
s[C].reorder(mo, no, ko, mi, ki, ni)
s[C].vectorize(ni)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt4: %f” % evaluator(a, b, c).mean)
輸出:
Opt4: 0.217310
陣列打包后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.inner: int32, 0, 32) {
for (k.inner: int32, 0, 4) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + (k.inner*32)), 1, 32)]))
}
}
}
}
}
}
}
塊的寫(xiě)緩存
阻塞后,程序?qū)⒔Y(jié)果逐塊寫(xiě)入C,訪(fǎng)問(wèn)模式不是順序的。可以使用一個(gè)順序緩存數(shù)組,保存塊結(jié)果,在所有塊結(jié)果準(zhǔn)備好時(shí),寫(xiě)入 C。
s = te.create_schedule(C.op)
Allocate write cache
CC = s.cache_write(C, “global”)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
Write cache is computed at no
s[CC].compute_at(s[C], no)
New inner axes
mc, nc = s[CC].op.axis
(kaxis,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(kaxis, factor=kfactor)
s[CC].reorder(ko, mc, ki, nc)
s[CC].vectorize(nc)
TODO: Add separate optimization step to discuss loop unrolloing
unrolling is a loop optimization strategy which can reduce branch
prediction failures and increases the chance of concurrent execution
unroll kfactor loops
s[CC].unroll(ki)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt5: %f” % evaluator(a, b, c).mean)
輸出:
Opt5: 0.215912
阻塞后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global;
allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.c.init: int32, 0, 32) {
C.global[ramp((m.c.init32), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.c: int32, 0, 32) {
C.global[ramp((m.c32), 1, 32)] = ((float32x32*)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[(((m.outer32768) + (m.c1024)) + (k.outer4))], 32)(float32x32*)packedB[ramp(((n.outer32768) + (k.outer128)), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 1)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 32), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 2)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 64), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 3)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 96), 1, 32)]))
}
}
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = (float32)C.global[((m.inner*32) + n.inner)]
}
}
}
}
}
}
并行化
可以利用多核處理器,進(jìn)行線(xiàn)程級(jí)并行化。
s = te.create_schedule(C.op)
CC = s.cache_write(C, “global”)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
s[CC].compute_at(s[C], no)
mc, nc = s[CC].op.axis
(kaxis,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(kaxis, factor=kfactor)
s[CC].reorder(ko, mc, ki, nc)
s[CC].vectorize(nc)
s[CC].unroll(ki)
parallel
s[C].parallel(mo)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=50)
opt6_time = evaluator(a, b, c).mean
print(“Opt6: %f” % opt6_time)
輸出:
Opt6: 0.066558
并行化后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) “parallel” {
allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global;
for (n.outer: int32, 0, 32) {
for (m.c.init: int32, 0, 32) {
C.global[ramp((m.c.init32), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.c: int32, 0, 32) {
C.global[ramp((m.c32), 1, 32)] = ((float32x32*)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[(((m.outer32768) + (m.c1024)) + (k.outer4))], 32)(float32x32*)packedB[ramp(((n.outer32768) + (k.outer128)), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 1)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 32), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 2)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 64), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 3)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 96), 1, 32)]))
}
}
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = (float32)C.global[((m.inner*32) + n.inner)]
}
}
}
}
}
}
總結(jié)
僅用 18 行代碼,應(yīng)用上述簡(jiǎn)單優(yōu)化后,生成的代碼,可以使用 MKL實(shí)現(xiàn)numpy性能的60% 。輸出反映了非排他性 Docker 容器上的運(yùn)行時(shí)間,是不可靠的。強(qiáng)烈建議自己運(yùn)行,觀察 TVM 實(shí)現(xiàn)的性能提升。
參考鏈接:
https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html
總結(jié)
以上是生活随笔為你收集整理的如何在 CPU 上优化 GEMM的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 一些量化(quantization)技巧
- 下一篇: Conda安装Glossary词汇表