當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

TVM优化Deep Learning GPU算子

發布時間：2023/11/28 生活经验 48 豆豆

生活随笔收集整理的這篇文章主要介紹了 TVM优化Deep Learning GPU算子小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

TVM優化Deep Learning GPU算子
高效的深度學習算子是深度學習系統的核心。通常，這些算子很難優化，需要HPC專家付出巨大的努力。端到端張量IR / DSL堆棧TVM使這一過程變得更加容易。
如何在TVM的幫助下編寫高性能GPU運算符內核。本文以深度卷積（即topi.nn.depthwise_conv2d_nchw）為例，并演示如何在tensorflow中改進已經手工優化的CUDA內核。在不同的工作負載下，最終版本比tf-1.2中優化的內核快2到4倍，在啟用了算子融合的情況下，最終版本快3到7倍。以下是在GTX1080上測試的結果，filter size = [1, 256, 3, 3], stride = [1, 1], padding = ‘SAME’:

深度卷積簡介
深度卷積是現代體系結構的重要組成部分，例如Xception和MobileNet。這是降低深度神經網絡計算復雜度的有效方法。

在TVM中，深度卷積可以聲明為：

padding stage

PaddedInput = tvm.compute(
(batch, in_channel, height_after_pad, width_after_pad),
lambda b, c, i, j: tvm.select(
tvm.all(i >= pad_top, i - pad_top < in_height, j >= pad_left, j - pad_left < in_width),
Input[b, c, i - pad_top, j - pad_left], tvm.const(0.0)),
name=“PaddedInput”)

depthconv stage

di = tvm.reduce_axis((0, filter_height), name=‘di’)
dj = tvm.reduce_axis((0, filter_width), name=‘dj’)
Output = tvm.compute(
(batch, out_channel, out_height, out_width),
lambda b, c, i, j: tvm.sum(
PaddedInput[b, c/channel_multiplier, istride_h + di, jstride_w + dj] * Filter[c/channel_multiplier, c%channel_multiplier, di, dj],
axis=[di, dj]),
name=‘DepthwiseConv2d’)
通用GPU優化準則
本部分簡要討論了優化CUDA代碼時應了解的三個概念：數據重用，共享內存和存儲沖突。
數據重用
在現代計算體系結構中，從內存中加載數據的成本要比進行單個浮點計算高得多。因此，始終希望在將輸入數據加載到寄存器或共享內存（高速緩存）后重用。
深度卷積有兩種形式的數據重用：filter過濾器重用和輸入重用。當filter過濾器在輸入通道上滑動并多次計算時，會發生filter過濾器重用。輸入重用是通過平鋪實現的，以3x3深度轉換為例：
General GPU Optimization Guidelines

在不進行平鋪的情況下，每個線程計算1個輸出元素并加載3x3輸入數據。16個線程加在一起有9x16的負載。

通過平鋪，每個線程計算2x2輸出元素并加載4x4輸入數據。4個線程加在一起有16x4的負載。
共享內存和bank沖突
共享內存可以看作是GPU中的緩存。這是片上的，并且比全局存儲快得多。

共享內存按block塊分配。通常的做法是將數據從全局內存加載到共享內存中， block塊中的所有線程都從共享內存中讀取數據。
共享內存的大小是有限的（通常為48K），必須謹慎對待共享內存的溢出。此外，分配給一個block塊的共享內存過多，限制了每個多處理器的活動塊數。
共享內存的另一個性能問題是存儲區沖突。共享內存分為大小相等的內存模塊（可同時訪問），但是，如果多個線程訪問同一內存庫（導致內存庫沖突），則訪問將被序列化，從而降低了有效帶寬。
共享存儲體的組織方式是將連續的地址分配給連續的存儲體。為避免存儲區沖突，最好連續的線程，訪問連續的內存地址，如下圖所示（每種顏色代表一個共享存儲區）：

開始優化TVM中的深度卷積。
調度優化
計算PaddedInput內聯以節省內存分配
從第1部分中可以看到，padding填充被顯式聲明為一個單獨的階段。內聯計算以避免冗余的內存分配：
s = tvm.create_schedule(Output.op)
s[PaddedInput].compute_inline()
將一個大通道劃分為較小的塊
深度卷積的一個簡單明了的調度表是，一個cuda塊負責一個輸入通道和相應的filter過濾器，加載到共享內存中，然后進行計算：
IS = s.cache_read(PaddedInput, “shared”, [DepthwiseConv2d])
FS = s.cache_read(Filter, “shared”, [DepthwiseConv2d])
block_y = tvm.thread_axis(“blockIdx.y”)
block_x = tvm.thread_axis(“blockIdx.x”)

bind the dimension of batch (N in NCHW) with block_y

s[Output].bind(Output.op.axis[0], block_y)

bind the dimension of channel (C in NCHW) with block_x

s[Output].bind(Output.op.axis[1], block_x)
Here is the result: 測試了在GTX 1080上運行1000次的平均時間成本，并與tensorflow中的depthwise_conv2d進行了比較。結果如下：
Input Filter stride tf-1.2 SAME pad (us) TVM SAME pad (us)
[1, 256, 21, 21] [256, 1, 3, 3] [1, 1] 16.1 9.1
[1, 256, 32, 32] [256, 1, 3, 3] [1, 1] 34.8 14.5
[1, 256, 64, 64] [256, 1, 3, 3] [1, 1] 130.9 98.9
[1, 256, 96, 96] [256, 1, 3, 3] [1, 1] 251.6 387.4
As we can see, this schedule performs well with small channel size like 21 x 21 or 32 x 32, however, its performance drops seriously as the channel size increases to larger than 64 x 64. One main reason is that too much shared memory allocated to one block limits the number of active blocks per multiprocessor.
此調度在較小的通道大小（例如21 x 21或32 x 32）下表現良好，但是，當通道大小增加到大于64 x 64時，其性能會嚴重下降。一個主要原因是分配的共享內存過多分配到一塊，限制每個多處理器的活動塊數。
修改了調度表，將一個大頻道劃分為多個較小的塊。例如，一個通道（64 x 64或96 x 96）被分成32 x 32的塊，而一個cuda塊負責一個32 x 32的塊：

blocking_h = 32
blocking_w = 32

split the dimension of height (H in NCHW)

bx1, _ = s[Output].split(Output.op.axis[2], factor=blocking_h)

split the dimension of width (W in NCHW)

bx2, _ = s[Output].split(Output.op.axis[3], factor=blocking_w)

assign one 32 x 32 block to one cuda block

by = s[Output].fuse(Output.op.axis[0], Output.op.axis[1])
s[Output].bind(by, block_y)
bx = s[Output].fuse(bx1, bx2)
s[Output].bind(bx, block_x)
結果如下:
Input [blocking_h, blocking_w] tf-1.2 SAME pad (us) TVM SAME pad (us)
[1, 256, 64, 64] [32, 32] 130.9 63.4
[1, 256, 96, 96] [32, 32] 251.6 132.5
封鎖策略有效！對于64 x 64通道大小，帶來1.6倍加速（98.9us-> 63.4us）；對于96 x 96通道大小，帶來2.9倍加速（387.4us-> 132.5us）。
線程的調整參數
如何在一個cuda塊的線程之間調度工作負載（例如32x32）？直觀地，應該是這樣的：
num_thread_y = 8
num_thread_x = 8
thread_y = tvm.thread_axis((0, num_thread_y), “threadIdx.y”)
thread_x = tvm.thread_axis((0, num_thread_x), “threadIdx.x”)
ty, yi = s[Output].split(h_dim, nparts=num_thread_y)
tx, xi = s[Output].split(w_dim, nparts=num_thread_x)
s[Output].reorder(ty, tx, yi, xi)
s[Output].bind(ty, thread_y)
s[Output].bind(tx, thread_x)
調度表中有兩個參數：num_thread_y和num_thread_x。如何確定最佳組合？先做一些實驗。以下是Filter = [256，1，3，3]和stride = [1，1]的結果：
Case Input num_thread_y num_thread_x TVM SAME pad (us)
1 [1, 256, 32, 32] 8 32 9.7
2 [1, 256, 32, 32] 4 32 8.8
3 [1, 256, 32, 32] 1 32 17.7
4 [1, 256, 32, 32] 32 1 32.5
從以上結果中可以得到：
? 情況2比情況1快。在情況2中，每個線程在輸出中計算一個8x1的圖塊，對應于輸入中的10x3的圖塊。比情況1的4x1 tile具有更好的數據重用性。
? 情況3比情況2慢。這是因為在情況3中，每個線程的工作量太大，導致讀取本地內存的成本較高。
? 情況4比情況3慢。這是因為num_thread_x = 32確保沒有bank沖突，而num_thread_y = 32沒有。
總結一下：
? 大圖塊有利于數據重用，但不利于本地內存讀取。
? num_thread_y和num_thread_x對bank沖突的影響是不對稱的。
? 為了找到num_thread_y和num_thread_x的最佳組合，實現高效共享存儲器訪問（避免組沖突），數據復用，本地存儲器read的平衡。
如何才能找到最佳組合呢？答案是蠻力搜索?？梢詫um_thread_y和num_thread_x作為參數傳遞給schedule函數，并嘗試所有可能的組合以找到最佳組合。這可以在TVM中輕松完成：
def schedule_depthwise_conv2d(…, num_thread_y=8, num_thread_x=8):
num_thread_y = num_thread_y
num_thread_x = num_thread_x
do_schedule_as_usual
return schedule

min_time_cost = inf
for num_thread_y, num_thread_x in all_possible_combinations:
schedule = schedule_depthwise_conv2d(…, num_thread_y=num_thread_y, num_thread_x=num_thread_x)
time_cost = test_depthwise_conv2d(…, schedule)
if time_cost < min_time_cost:
min_time_cost = time_cost
optimal_combination = [num_thread_y, num_thread_x]
實際上，可以看作是一個簡單的自動調度程序。
Vthread和交叉模式
引入TVM中的Vthread（虛擬線程），支持跨步模式?？梢赃@樣使用：
num_vthread_y = 2
num_vthread_x = 2
num_thread_y = 8
num_thread_x = 8
thread_vy = tvm.thread_axis((0, num_vthread_y), “vthread”, name=“vy”)
thread_vx = tvm.thread_axis((0, num_vthread_x), “vthread”, name=“vx”)
thread_y = tvm.thread_axis((0, num_thread_y), “threadIdx.y”)
thread_x = tvm.thread_axis((0, num_thread_x), “threadIdx.x”)

split the dimension of height (H in NCHW) twice

tvy, vyi = s[Output].split(h_dim, nparts=num_vthread_y)
ty, yi = s[Output].split(vyi, nparts=num_thread_y)

split the dimension of width (W in NCHW) twice

tvx, vxi = s[Output].split(w_dim, nparts=num_vthread_x)
tx, xi = s[Output].split(vxi, nparts=num_thread_x)

bind thread and vthread respectively

s[Output].bind(tvy, thread_vy)
s[Output].bind(tvx, thread_vx)
s[Output].bind(ty, thread_y)
s[Output].bind(tx, thread_x)
s[Output].reorder(tvy, tvx, ty, tx, yi, xi)
Let’s print the IR to see what vthread does:
/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = ‘SAME’ */
produce DepthwiseConv2d {
// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1
// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8
// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8
for (i.inner.inner.inner, 0, 2) {
for (j.inner.inner.inner, 0, 2) {
DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner)] = 0.000000f
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 512)] = 0.000000f
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 16)] = 0.000000f
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 528)] = 0.000000f
for (di, 0, 3) {
for (dj, 0, 3) {
DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner)] = (DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x16) + threadIdx.y)2)) && ((((blockIdx.x16) + threadIdx.y)2) < ((33 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x2))) && ((threadIdx.x2) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + (di32)) + dj) + -33)], 0.000000f)Filter[((di3) + dj)]))
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 512)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 512)] + (tvm_if_then_else(((((((-15 - di) - i.inner.inner.inner) <= (((blockIdx.x16) + threadIdx.y)2)) && ((((blockIdx.x16) + threadIdx.y)2) < ((17 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x2))) && ((threadIdx.x2) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + (di32)) + dj) + 479)], 0.000000f)Filter[((di3) + dj)]))
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 16)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 16)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x16) + threadIdx.y)2)) && ((((blockIdx.x16) + threadIdx.y)2) < ((33 - di) - i.inner.inner.inner))) && (((-15 - dj) - j.inner.inner.inner) <= (threadIdx.x2))) && ((threadIdx.x2) < ((17 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + (di32)) + dj) + -17)], 0.000000f)Filter[((di3) + dj)]))
DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)*16) + threadIdx.y)*32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 528)] = (DepthwiseConv2d[(((((((((blockIdx.y + blockIdx.x)16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + 528)] + (tvm_if_then_else(((((((-15 - di) - i.inner.inner.inner) <= (((blockIdx.x16) + threadIdx.y)2)) && ((((blockIdx.x16) + threadIdx.y)2) < ((17 - di) - i.inner.inner.inner))) && (((-15 - dj) - j.inner.inner.inner) <= (threadIdx.x2))) && ((threadIdx.x2) < ((17 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)16) + threadIdx.y)32) + threadIdx.x)2) + (i.inner.inner.inner32)) + j.inner.inner.inner) + (di32)) + dj) + 495)], 0.000000f)Filter[((di3) + dj)]))
}
}
}
}
}
Without vthread (just set to 1), the IR is:
/ Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = ‘SAME’ */
produce DepthwiseConv2d {
// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1
// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8
// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8
for (i.inner.inner.inner, 0, 4) {
for (j.inner.inner.inner, 0, 4) {
DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)4) + (i.inner.inner.inner32)) + j.inner.inner.inner)] = 0.000000f
for (di, 0, 3) {
for (dj, 0, 3) {
DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)*32) + threadIdx.x)4) + (i.inner.inner.inner32)) + j.inner.inner.inner)] = (DepthwiseConv2d[((((((((blockIdx.y + blockIdx.x)8) + threadIdx.y)32) + threadIdx.x)4) + (i.inner.inner.inner32)) + j.inner.inner.inner)] + (tvm_if_then_else(((((((1 - di) - i.inner.inner.inner) <= (((blockIdx.x8) + threadIdx.y)4)) && ((((blockIdx.x8) + threadIdx.y)4) < ((33 - di) - i.inner.inner.inner))) && (((1 - dj) - j.inner.inner.inner) <= (threadIdx.x4))) && ((threadIdx.x4) < ((33 - dj) - j.inner.inner.inner))), Input[(((((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)32) + threadIdx.x)4) + (i.inner.inner.inner32)) + j.inner.inner.inner) + (di32)) + dj) + -33)], 0.000000f)Filter[((di3) + dj)]))
}
}
}
}
}
可以看到，當num_vthread_y = 2和時num_vthread_x = 2，將32 x 32通道分為四個16 x 16子通道。每個線程一次計算四個輸出元素，一個子通道中一個元素。
以下是Filter = [256，1，3，3]，stride = [1，1]，blocking_h = 32，blocking_w = 32的結果：
Case Input num_thread_y, num_thread_x num_vthread_y, num_vthread_x TVM SAME pad (us)
1 [1, 256, 96, 96] 8, 8 1, 1 132.5
2 [1, 256, 96, 96] 8, 8 1, 4 103.1
3 [1, 256, 96, 96] 4, 32 1, 1 95.9
4 [1, 256, 96, 96] 8, 16 1, 2 90.9
Case 2比Case 1快。在Case 2中，num_thread_x=8并且num_vthread_x=4一起確保連續的線程訪問連續的內存地址，從而避免了存儲區沖突，如下所示（每種顏色代表一個線程的工作量）：

從理論上講，case 3和case 4應該同樣很快，每個線程的工作量相同，并且都享有有效的共享內存訪問。不管怎樣，case 4快了一點。
還記得tensorflow的速度嗎？現在是251.6us，現在TVM快了2.8倍。387.4-> 132.5-> 95.9-> 90.9，封鎖最有幫助；調整線程數可節省37us；vthread可以節省額外的5us。
實際上，TVM可以比具有大內核大小或channel_multiplier的tensorflow快得多（因為更多的filter過濾器重用）：
Input Filter stride tf-1.2 SAME pad (us) TVM SAME pad (us) How faster is TVM
[1, 256, 96, 96] [256, 1, 3, 3] [1, 1] 251.6 90.9 2.8x
[1, 256, 96, 96] [256, 1, 5, 5] [1, 1] 597.6 128.9 4.6x
[1, 256, 96, 96] [256, 2, 3, 3] [1, 1] 659.9 143.7 4.6x
[1, 256, 96, 96] [256, 2, 5, 5] [1, 1] 1203.9 170.5 7.1x
Consider a common pattern in neural networks: depthwise_conv2d + scale_shift + relu. We can fuse the three operators into one, by slightly modifying the original schedule:
算子融合
算子融合是可以在深度學習中進行的一種典型優化，可以在單個內核中一起計算多個算子，無需將中間結果保存回全局內存中。TVM對此提供了開箱即用的支持。
神經網絡中的一個常見模式：depthwise_conv2d+ scale_shift+ relu。稍微修改原始調度表，可以將三個算子融合為一個：
DepthwiseConv2d = topi.nn.depthwise_conv2d(Input, Filter, stride, padding)
ScaleShift = topi.nn.scale_shift(DepthwiseConv2d, Scale, Shift)
Relu = topi.nn.relu(ScaleShift)

Output = Relu # is no longer DepthwiseConv2d
s[ScaleShift].compute_inline() # this line fuses ScaleShift, explicitly
s[DepthwiseConv2d].set_scope(“local”) # this line fuses DepthwiseConv2d, implicitly
schedule(Output) # schedule for Output the same way we schedule for DepthwiseConv2d as discussed above
s[DepthwiseConv2d].compute_at(s[Output], tx) # tx is the inner most axis, bound to threadIdx.x
生成IR，如下所示：
/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = ‘SAME’ /
produce Relu {
// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1
// attr [DepthwiseConv2d] storage_scope = “local”
allocate DepthwiseConv2d[float32 * 1 * 1 * 4 * 4]
// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8
// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8
produce DepthwiseConv2d {
for (i, 0, 4) {
for (j, 0, 4) {
DepthwiseConv2d[((i4) + j)] = 0.000000f
for (di, 0, 3) {
for (dj, 0, 3) {
DepthwiseConv2d[((i4) + j)] = (DepthwiseConv2d[((i4) + j)] + (tvm_if_then_else(((((((1 - di) - i) <= (((blockIdx.x8) + threadIdx.y)4)) && ((((blockIdx.x8) + threadIdx.y)4) < ((33 - di) - i))) && (((1 - dj) - j) <= (threadIdx.x4))) && ((threadIdx.x4) < ((33 - dj) - j))), Input[(((((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)32) + threadIdx.x)4) + (i32)) + j) + (di32)) + dj) + -33)], 0.000000f)Filter[((di3) + dj)]))
}
}
}
}
}
for (i2.inner.inner.inner, 0, 4) {
for (i3.inner.inner.inner, 0, 4) {
Relu[((((((((blockIdx.y + blockIdx.x)*8) + threadIdx.y)32) + threadIdx.x)4) + (i2.inner.inner.inner32)) + i3.inner.inner.inner)] = max(((DepthwiseConv2d[((i2.inner.inner.inner4) + i3.inner.inner.inner)]*Scale[0]) + Shift[0]), 0.000000f)
}
}
}
寫入depthwise_conv2d全局內存的結果之前，每個線程計算scale_shift和relu。融合算子的速度與single depthwise_conv2d一樣快。以下是輸入= [1、256、96、96]，filter過濾器= [256、1、3、3]，stride步幅= [1、1]，padding填充='SAME’的結果：
? tf-1.2 depthwise_conv2d: 251.6 us
? tf-1.2 depthwise_conv2d + scale_shift + relu (separate): 419.9 us
? TVM depthwise_conv2d: 90.9 us
? TVM depthwise_conv2d + scale_shift + relu (fused): 91.5 us
The advantage of operator fusion is obvious.
This is not the end, TVM can do operator fusion in a smarter way. You may refer to this and read the source code provided below.
Show me the code算子融合的優勢顯而易見的。
這不是終點，TVM可以以更智能的方式進行算子融合。參考鏈接：
? Declare: https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/depthwise_conv2d.py
? Schedule: https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/depthwise_conv2d.py
? Test: https://github.com/apache/incubator-tvm/blob/main/topi/recipe/conv/depthwise_conv2d_test.py

總結

以上是生活随笔為你收集整理的TVM优化Deep Learning GPU算子的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：端到端TVM编译器（下）
下一篇：用TVM在硬件平台上部署深度学习工作负载