日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 人文社科 > 生活经验 >内容正文

生活经验

如何在CPU上优化GEMM(上)

發布時間:2023/11/28 生活经验 40 豆豆
生活随笔 收集整理的這篇文章主要介紹了 如何在CPU上优化GEMM(上) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

如何在CPU上優化GEMM(上)

(TL;DR)TVM提供了抽象接口,用戶分別描述算法和算法的實現組織(所謂的調度)。通常,在高性能調度中編寫算法會破壞算法的可讀性和模塊性。嘗試各種看似有希望的時間表是很耗時的。在TVM的幫助下,可以有效地嘗試這些調度來提高性能。

本文將演示如何使用TVM優化平方矩陣乘法,并通過簡單地添加18行額外的代碼來實現比baseline基線快200倍的速度。

在CPU上執行的高強度計算應用程序有兩個重要的優化:

提高內存訪問的緩存命中率。高速緩存命中率可以加速復雜的數值計算和熱點內存訪問。這需要我們將源內存訪問模式轉換為適合緩存策略的模式。

SIMD(單指令多數據)或稱之為向量處理單元。每次都會處理一小批數據,而不是單個網格。這就要求將循環體中的數據訪問模式轉換為統一模式,以便LLVM后端能夠將其降低到SIMD。

實際上,使用的所有方法都是本文所述技巧的子集。其中一些已經被TVM抽象自動應用,但有些由于TVM的約束而不能簡單地應用。

下面提到的所有實驗結果,都是在配備Intel i7-4770HQ CPU的2015款15寸MacBook上執行的。對于所有x86
CPU,緩存線大小應為64字節。

Preparation and Baseline

本文將演示如何使用TVM優化矩陣乘法。在實際演示之前,首先定義這些變量。然后編寫了一個基線實現,這是在TVM中編寫矩陣乘法的最簡單方法。

import tvm

import tvm.testing

from tvm import te

import numpy

import timeit

The size of the matrix

(M, K) x (K, N)

You are free to try out different

shapes, sometimes TVM optimization outperforms numpy with MKL.

M = 1024

K = 1024

N = 1024

The default tensor type in tvm

dtype = “float32”

using Intel AVX2(Advanced Vector

Extensions) ISA for SIMD

To get the best performance,

please change the following line

to llvm -mcpu=core-avx2, or

specific type of CPU you use

target = “llvm”

ctx = tvm.context(target, 0)

Random generated tensor for

testing

a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), ctx)

b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), ctx)

np_repeat = 100

np_runing_time = timeit.timeit(

setup=“import numpy\n”

"M = " + str(M) + “\n”

"K = " + str(K) + “\n”

"N = " + str(N) + “\n”

‘dtype = “float32”\n’

“a = numpy.random.rand(M,
K).astype(dtype)\n”

“b = numpy.random.rand(K,
N).astype(dtype)\n”,

stmt=“answer = numpy.dot(a, b)”,

number=np_repeat,

)

print(“Numpy running
time: %f” % (np_runing_time / np_repeat))

answer = numpy.dot(a.asnumpy(), b.asnumpy())

Algorithm

k = te.reduce_axis((0, K), “k”)

A = te.placeholder((M, K), name=“A”)

B = te.placeholder((K, N), name=“B”)

C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), name=“C”)

Default schedule

s = te.create_schedule(C.op)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=1)

print(“Baseline: %f” % evaluator(a, b, c).mean)

Out:

Numpy running time: 0.006963

Baseline: 3.516655

In TVM, we can always inspect lower level IR to debug or
optimize our schedule. Here is the generated IR using our baseline schedule.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle,
C_1: handle) -> ()

attr = {“global_symbol”: “main”,
“tir.noalias”: True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

         B: Buffer(B_2: Pointer(float32),

float32, [1024, 1024], []),

         A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

for (x: int32, 0, 1024) {

for (y: int32, 0, 1024) {

C_2[((x*1024) + y)] = 0f32

for (k: int32, 0, 1024) {

    C_2[((x*1024) + y)] =

((float32*)C_2[((x1024) + y)] + ((float32)A_2[((x1024) +
k)]
(float32*)B_2[((k*1024) + y)]))

}

}

}

}

Blocking

提高緩存命中率的一個重要技巧是分塊——數據塊將逐塊計算。塊內的內存訪問是一個具有高內存局部性的小鄰域。本文選擇了32作為阻塞因子。因此,塊將填充3232sizeof(float),即緩存中的4KB,其總大小為32KB(一級數據緩存)

bn = 32

s = te.create_schedule(C.op)

Blocking by loop tiling

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

Hoist reduction domain outside the

blocking loop

s[C].reorder(xo, yo, ko, ki, xi, yi)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

By simply tiling the loop 32x32,

and hoisting ko, ki outside the blocking loops,

we can see big speedup compared

with the baseline.

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print(“Opt1: %f” % evaluator(a, b, c).mean)

Out:

Opt1: 0.284967

Here is the generated IR after blocking.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle,
C_1: handle) -> ()

attr = {“global_symbol”: “main”, “tir.noalias”:
True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

         B: Buffer(B_2: Pointer(float32),

float32, [1024, 1024], []),

         A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

for (x.inner.init: int32, 0, 32) {

    for (y.inner.init: int32, 0, 32) {C_2[((((x.outer*32768) +

(x.inner.init1024)) + (y.outer32)) + y.inner.init)] = 0f32

    }

}

for (k.outer: int32, 0, 256) {

    for (k.inner: int32, 0, 4) {for (x.inner: int32, 0, 32) {for (y.inner: int32, 0, 32) {C_2[((((x.outer*32768) +

(x.inner1024)) + (y.outer32)) + y.inner)] = ((float32*)C_2[((((x.outer*32768)

  • (x.inner1024)) + (y.outer32)) + y.inner)] +
    ((float32*)A_2[((((x.outer32768) + (x.inner1024)) + (k.outer4)) +
    k.inner)]
    (float32*)B_2[((((k.outer4096) + (k.inner1024)) + (y.outer*32)) +
    y.inner)]))

          }}}
    

}

}

}

}

Vectorization

另一個重要的技巧是矢量化。當內存訪問模式是一致的時,編譯器可以檢測到這種模式并將連續內存傳遞給向量處理器。在TVM中,可以使用向量化接口來提示編譯器這個模式,這樣就可以大大加速它。

本文選擇將內部循環行數據矢量化,因為它對緩存友好。

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

s[C].reorder(xo, yo, ko, ki, xi, yi)

Vectorization

s[C].vectorize(yi)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print(“Opt2: %f” % evaluator(a, b, c).mean)

Out:

Opt2: 0.321595

Here is the generated IR after vectorization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

attr = {“global_symbol”: “main”, “tir.noalias”: True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

for (x.inner.init: int32, 0, 32) {

C_2[ramp((((x.outer32768) +
(x.inner.init
1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)

}

for (k.outer: int32, 0, 256) {

    for (k.inner: int32, 0, 4) {for (x.inner: int32, 0, 32) {C_2[ramp((((x.outer*32768) +

(x.inner1024)) + (y.outer32)), 1, 32)] =
((float32x32*)C_2[ramp((((x.outer32768) + (x.inner1024)) + (y.outer32)), 1,32)] + (broadcast((float32)A_2[((((x.outer32768) + (x.inner1024)) +
(k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) +
(k.inner
1024)) + (y.outer*32)), 1, 32)]))

      }}

}

}

}

}

Loop Permutation

上面的IR,可以看到內循環行數據被矢量化,B被轉換成PackedB。PackedB的遍歷現在是連續的。因此,將研究A的訪問模式。在當前調度中,A被逐列訪問,這對緩存不友好。如果改變了KI和內軸席的嵌套循環順序,則矩陣的訪問模式更容易緩存。

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

re-ordering

s[C].reorder(xo, yo, ko, xi, ki, yi)

s[C].vectorize(yi)

func = tvm.build(s, [A, B, C], target=target, name=“mmult”)

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print(“Opt3: %f” % evaluator(a, b, c).mean)

Out:

Opt3: 0.111657

Here is the generated IR after loop permutation.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

attr = {“global_symbol”: “main”, “tir.noalias”: True}

buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

         C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),A: Buffer(A_2: Pointer(float32),

float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

for (x.inner.init: int32, 0, 32) {

    C_2[ramp((((x.outer*32768) +

(x.inner.init1024)) + (y.outer32)), 1, 32)] = broadcast(0f32, 32)

}

for (k.outer: int32, 0, 256) {

    for (x.inner: int32, 0, 32) {for (k.inner: int32, 0, 4) {C_2[ramp((((x.outer*32768) +

(x.inner1024)) + (y.outer32)), 1, 32)] =
((float32x32*)C_2[ramp((((x.outer32768) + (x.inner1024)) + (y.outer32)), 1,
32)] + (broadcast((float32
)A_2[((((x.outer32768) + (x.inner1024)) +
(k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) +
(k.inner
1024)) + (y.outer*32)), 1, 32)]))

      }}

}

}

}

}

總結

以上是生活随笔為你收集整理的如何在CPU上优化GEMM(上)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。