當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

如何在CPU上优化GEMM（下）

發布時間：2023/11/28 生活经验 26 豆豆

生活随笔收集整理的這篇文章主要介紹了如何在CPU上优化GEMM（下）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

如何在CPU上優化GEMM（下）

Array Packing

另一個重要的技巧是數組打包。這個技巧是對數組的存儲維度進行重新排序，將某個維度上的連續訪問模式在平滑后轉換為順序模式。

如上圖所示，在阻塞計算之后，可以觀察到B的數組訪問模式（扁平化后），它是規則的但不連續的。期望經過一些轉換，可以得到連續訪問模式。可以將[16][16]數組重新排序為[16/4][16][4]數組，這樣當從壓縮數組中獲取相應的值時，B的訪問模式將是順序的。

We have to re-write the algorithm slightly.

packedB = te.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name=“packedB”)

C = te.compute((M, N),
lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k),

name=“C”,

)

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

s[C].reorder(xo, yo, ko, xi, ki, yi)

s[C].vectorize(yi)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer,
rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx,
number=10)

print(“Opt4: %f” % evaluator(a, b, c).mean)

Out:

Opt4: 0.105409

Here is the generated IR after array packing.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle,
C_1: handle) -> ()

attr = {“global_symbol”: “main”,
“tir.noalias”: True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

         B: Buffer(B_2: Pointer(float32),

float32, [1024, 1024], []),

         A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] “storage_scope” =
“global”;

allocate(packedB, float32x32, [32768]) {

for (x: int32, 0, 32) “parallel” {

for (y: int32, 0, 1024) {

packedB[ramp(((x32768) + (y32)), 1,32)] = (float32x32*)B_2[ramp(((y1024) + (x32)), 1, 32)]

}

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

    for (x.inner.init: int32, 0, 32) {C_2[ramp((((x.outer*32768) +

(x.inner.init1024)) + (y.outer32)), 1, 32)] = broadcast(0f32, 32)

    }for (k.outer: int32, 0, 256) {for (x.inner: int32, 0, 32) {for (k.inner: int32, 0, 4) {C_2[ramp((((x.outer*32768) +

(x.inner1024)) + (y.outer32)), 1, 32)] =
((float32x32*)C_2[ramp((((x.outer32768) + (x.inner1024)) + (y.outer32)), 1,
32)] + (broadcast((float32)A_2[((((x.outer32768) + (x.inner1024)) +
(k.outer4)) + k.inner)], 32)(float32x32*)packedB[ramp((((y.outer32768) +
(k.outer128)) + (k.inner*32)), 1, 32)]))

}}}

}

Write cache for blocks

分塊后，程序將結果逐塊寫入C，訪問模式不是順序的。因此，可以使用一個順序緩存數組來保存塊結果，并在所有塊結果就緒時寫入C。

s = te.create_schedule(C.op)

Allocate write cache

CC = s.cache_write(C, “global”)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

Write cache is computed at yo

s[CC].compute_at(s[C], yo)

New inner axes

xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis

ko, ki = s[CC].split(k, factor=4)

s[CC].reorder(ko, xc, ki, yc)

s[CC].unroll(ki)

s[CC].vectorize(yc)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer,
rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print(“Opt5: %f” % evaluator(a, b, c).mean)

Out:

Opt5: 0.098048

Here is the generated IR after blocking.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle,
C_1: handle) -> ()

attr = {“global_symbol”: “main”, “tir.noalias”: True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

         B: Buffer(B_2: Pointer(float32),

float32, [1024, 1024], []),

         A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] “storage_scope” =
“global”;

allocate(packedB, float32x32, [32768]);

attr [C.global: Pointer(float32)] “storage_scope” =
“global”;

allocate(C.global, float32, [1024]) {

for (x: int32, 0, 32) “parallel” {

for (y: int32, 0, 1024) {

    packedB[ramp(((x*32768) + (y*32)), 1, 32)] = (float32x32*)B_2[ramp(((y*1024) + (x*32)), 1, 32)]

}

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

    for (x.c.init: int32, 0, 32) {C.global[ramp((x.c.init*32), 1, 32)]

= broadcast(0f32, 32)

for (k.outer: int32, 0, 256) {

      for (x.c: int32, 0, 32) {C.global[ramp((x.c*32), 1, 32)] =

((float32x32*)C.global[ramp((x.c32), 1, 32)] +
(broadcast((float32)A_2[(((x.outer32768) + (x.c1024)) + (k.outer4))],
32)(float32x32*)packedB[ramp(((y.outer32768) + (k.outer128)), 1, 32)]))

        C.global[ramp((x.c*32), 1, 32)] =

((float32x32*)C.global[ramp((x.c32), 1, 32)] +
(broadcast((float32)A_2[((((x.outer32768) + (x.c1024)) + (k.outer4)) + 1)],
32)(float32x32*)packedB[ramp((((y.outer32768) + (k.outer128)) + 32), 1,
32)]))

        C.global[ramp((x.c*32), 1, 32)] =

((float32x32*)C.global[ramp((x.c32), 1, 32)] +
(broadcast((float32)A_2[((((x.outer32768) + (x.c1024)) + (k.outer4)) + 2)],
32)(float32x32*)packedB[ramp((((y.outer32768) + (k.outer128)) + 64), 1,
32)]))

        C.global[ramp((x.c*32), 1, 32)] =

((float32x32*)C.global[ramp((x.c32), 1, 32)] +
(broadcast((float32)A_2[((((x.outer32768) + (x.c1024)) + (k.outer4)) + 3)],
32)(float32x32*)packedB[ramp((((y.outer32768) + (k.outer128)) + 96), 1,
32)]))

      }}for (x.inner: int32, 0, 32) {for (y.inner: int32, 0, 32) {C_2[((((x.outer*32768) +

(x.inner1024)) + (y.outer32)) + y.inner)] = (float32*)C.global[((x.inner*32)

y.inner)]
```
    }}
```

}

Parallel

此外，還可以利用多核處理器來實現線程級的并行化。

s = te.create_schedule(C.op)

CC = s.cache_write(C, “global”)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

s[CC].compute_at(s[C], yo)

xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis

ko, ki = s[CC].split(k, factor=4)

s[CC].reorder(ko, xc, ki,
yc)

s[CC].unroll(ki)

s[CC].vectorize(yc)

parallel

s[C].parallel(xo)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer,
rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx,
number=50)

opt6_time = evaluator(a, b, c).mean

print(“Opt6: %f” % opt6_time)

Out:

Opt6: 0.032347

Here is the generated IR after parallelization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle,
C_1: handle) -> ()

attr = {“global_symbol”: “main”,
“tir.noalias”: True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

         B: Buffer(B_2: Pointer(float32),

float32, [1024, 1024], []),

         A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] “storage_scope” =
“global”;

allocate(packedB, float32x32, [32768]) {

for (x: int32, 0, 32) “parallel” {

for (y: int32, 0, 1024) {

    packedB[ramp(((x*32768) + (y*32)), 1, 32)] = (float32x32*)B_2[ramp(((y*1024) + (x*32)), 1, 32)]

}

for (x.outer: int32, 0, 32) “parallel” {

attr [C.global: Pointer(float32)] “storage_scope” =
“global”;

allocate(C.global, float32, [1024]);

for (y.outer: int32, 0, 32) {

    for (x.c.init: int32, 0, 32) {C.global[ramp((x.c.init*32), 1, 32)]

= broadcast(0f32, 32)

    }for (k.outer: int32, 0, 256) {for (x.c: int32, 0, 32) {C.global[ramp((x.c*32), 1, 32)] =

((float32x32*)C.global[ramp((x.c32), 1, 32)] +
(broadcast((float32)A_2[(((x.outer32768) + (x.c1024)) + (k.outer4))],
32)(float32x32*)packedB[ramp(((y.outer32768) + (k.outer128)), 1, 32)]))

        C.global[ramp((x.c*32), 1, 32)] =

        C.global[ramp((x.c*32), 1, 32)] =

        C.global[ramp((x.c*32), 1, 32)] =

      }}for (x.inner: int32, 0, 32) {for (y.inner: int32, 0, 32) {C_2[((((x.outer*32768) +

(x.inner1024)) + (y.outer32)) + y.inner)] = (float32*)C.global[((x.inner*32)

y.inner)]
```
    }}
```

}

Summary

在用18行代碼應用上述簡單的優化之后，生成的代碼可以達到MKL的60%的numpy性能。請注意，網頁上的輸出反映了非獨占Docker容器上的運行時間，因此是不可靠的。強烈建議自己來完成，以觀察TVM所獲得的性能提升。

https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py

總結

以上是生活随笔為你收集整理的如何在CPU上优化GEMM（下）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。