算子扫描与递归核
算子掃描與遞歸核
這是關(guān)于如何在TVM中進(jìn)行循環(huán)計(jì)算的介紹資料。遞歸計(jì)算是神經(jīng)網(wǎng)絡(luò)的一種典型模式。
from future import absolute_import, print_function
import tvm
import tvm.testing
from tvm import te
import numpy as np
TVM支持掃描運(yùn)算符來(lái)描述符號(hào)循環(huán)。下面的掃描操作計(jì)算X列上的累計(jì)值。 掃描在張量的最高維上進(jìn)行。s_state是一個(gè)占位符,用于描述掃描的轉(zhuǎn)換狀態(tài)。s_init描述了如何初始化前k個(gè)時(shí)間步。在這里,由于s_init’s的第一個(gè)維度是1,它描述了如何在第一時(shí)間步初始化狀態(tài)。
s_update描述如何在時(shí)間步t更新值。更新值可以通過(guò)狀態(tài)占位符引用上一個(gè)時(shí)間步的值。雖然在當(dāng)前或以后的時(shí)間步驟中引用s_state是無(wú)效的。
掃描采用狀態(tài)占位符、初始值和更新描述。還建議(盡管不是必需的)列出掃描單元的輸入。掃描的結(jié)果是一個(gè)張量,給出在時(shí)域上更新后的s_state的結(jié)果。
m = te.var(“m”)
n = te.var(“n”)
X = te.placeholder((m, n), name=“X”)
s_state = te.placeholder((m, n))
s_init = te.compute((1, n), lambda _, i: X[0, i])
s_update = te.compute((m, n), lambda t, i: s_state[t - 1, i] + X[t, i])
s_scan = tvm.te.scan(s_init, s_update, s_state, inputs=[X])
Schedule the Scan Cell
可以通過(guò)分別調(diào)度更新和初始化部分來(lái)調(diào)度掃描主體。調(diào)度更新部件的第一個(gè)迭代維度是無(wú)效的。要在時(shí)間上拆分迭代,用戶(hù)可以調(diào)度scan_op.scan_axis。
s = te.create_schedule(s_scan.op)
num_thread = 256
block_x = te.thread_axis(“blockIdx.x”)
thread_x = te.thread_axis(“threadIdx.x”)
xo, xi = s[s_init].split(s_init.op.axis[1], factor=num_thread)
s[s_init].bind(xo, block_x)
s[s_init].bind(xi, thread_x)
xo, xi = s[s_update].split(s_update.op.axis[1], factor=num_thread)
s[s_update].bind(xo, block_x)
s[s_update].bind(xi, thread_x)
print(tvm.lower(s, [X, s_scan], simple_mode=True))
Out:
primfn(X_1: handle, scan_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {scan: Buffer(scan_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
X: Buffer(X_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=“auto”)}
buffer_map = {X_1: X, scan_1: scan} {
attr [IterVar(blockIdx.x: int32, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = floordiv((n + 255), 256);
attr [IterVar(threadIdx.x: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 256;
if @tir.likely((((blockIdx.x256) + threadIdx.x) < n), dtype=bool) {
scan_2[(((blockIdx.x256) + threadIdx.x)stride_1)] = (float32)X_2[(((blockIdx.x256) + threadIdx.x)stride_3)]
}
for (scan.idx: int32, 0, (m - 1)) {
attr [IterVar(blockIdx.x, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = floordiv((n + 255), 256);
attr [IterVar(threadIdx.x, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 256;
if @tir.likely((((blockIdx.x256) + threadIdx.x) < n), dtype=bool) {
scan_2[(((scan.idx + 1)stride) + (((blockIdx.x256) + threadIdx.x)stride_1))] = ((float32)scan_2[((scan.idxstride) + (((blockIdx.x256) + threadIdx.x)stride_1))] + (float32)X_2[(((scan.idx + 1)stride_2) + (((blockIdx.x256) + threadIdx.x)stride_3))])
}
}
}
Build and Verify
可以像其他TVM內(nèi)核一樣構(gòu)建掃描內(nèi)核,使用numpy來(lái)驗(yàn)證結(jié)果的正確性。
fscan = tvm.build(s, [X, s_scan], “cuda”, name=“myscan”)
ctx = tvm.gpu(0)
n = 1024
m = 10
a_np = np.random.uniform(size=(m, n)).astype(s_scan.dtype)
a = tvm.nd.array(a_np, ctx)
b = tvm.nd.array(np.zeros((m, n), dtype=s_scan.dtype), ctx)
fscan(a, b)
tvm.testing.assert_allclose(b.asnumpy(), np.cumsum(a_np, axis=0))
Multi-Stage Scan Cell
在上面的例子中,描述了掃描單元使用一個(gè)張量計(jì)算階段在s_update。可以在掃描單元中使用多個(gè)張量級(jí)。
以下幾行顯示掃描單元中有兩個(gè)階段操作的掃描。
m = te.var(“m”)
n = te.var(“n”)
X = te.placeholder((m, n), name=“X”)
s_state = te.placeholder((m, n))
s_init = te.compute((1, n), lambda _, i: X[0, i])
s_update_s1 = te.compute((m, n), lambda t, i: s_state[t - 1, i] * 2, name=“s1”)
s_update_s2 = te.compute((m, n), lambda t, i: s_update_s1[t, i] + X[t, i], name=“s2”)
s_scan = tvm.te.scan(s_init, s_update_s2, s_state, inputs=[X])
這些中間張量也可以正常調(diào)度。為了確保正確性,TVM創(chuàng)建了一個(gè)組約束,禁止在掃描循環(huán)之外的位置compute_at掃描體。
s = te.create_schedule(s_scan.op)
xo, xi = s[s_update_s2].split(s_update_s2.op.axis[1], factor=32)
s[s_update_s1].compute_at(s[s_update_s2], xo)
print(tvm.lower(s, [X, s_scan], simple_mode=True))
Out:
primfn(X_1: handle, scan_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {scan: Buffer(scan_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
X: Buffer(X_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=“auto”)}
buffer_map = {X_1: X, scan_1: scan} {
attr [s1: Pointer(float32)] “storage_scope” = “global”;
allocate(s1, float32, [32]) {
for (i: int32, 0, n) {
scan_2[(istride_1)] = (float32)X_2[(istride_3)]
}
for (scan.idx: int32, 0, (m - 1)) {
for (i.outer: int32, 0, floordiv((n + 31), 32)) {
for (i_1: int32, 0, 32) {
if @tir.likely((((i.outer32) + i_1) < n), dtype=bool) {
s1[i_1] = ((float32*)scan_2[((scan.idxstride) + (((i.outer32) + i_1)stride_1))]2f32)
}
}
for (i.inner: int32, 0, 32) {
if @tir.likely((((i.outer32) + i.inner) < n), dtype=bool) {
scan_2[(((scan.idx + 1)stride) + (((i.outer32) + i.inner)stride_1))] = ((float32)s1[i.inner] + (float32)X_2[(((scan.idx + 1)stride_2) + (((i.outer32) + i.inner)stride_3))])
}
}
}
}
}
}
Multiple States
對(duì)于像RNN這樣的復(fù)雜應(yīng)用程序,可能需要不止一個(gè)遞歸狀態(tài)。掃描支持多種重復(fù)狀態(tài)。下面的示例演示了如何使用兩種狀態(tài)構(gòu)建遞歸。
m = te.var(“m”)
n = te.var(“n”)
l = te.var(“l(fā)”)
X = te.placeholder((m, n), name=“X”)
s_state1 = te.placeholder((m, n))
s_state2 = te.placeholder((m, l))
s_init1 = te.compute((1, n), lambda _, i: X[0, i])
s_init2 = te.compute((1, l), lambda _, i: 0.0)
s_update1 = te.compute((m, n), lambda t, i: s_state1[t - 1, i] + X[t, i])
s_update2 = te.compute((m, l), lambda t, i: s_state2[t - 1, i] + s_state1[t - 1, 0])
s_scan1, s_scan2 = tvm.te.scan(
[s_init1, s_init2], [s_update1, s_update2], [s_state1, s_state2], inputs=[X]
)
s = te.create_schedule(s_scan1.op)
print(tvm.lower(s, [X, s_scan1, s_scan2], simple_mode=True))
Out:
primfn(X_1: handle, scan.v0_1: handle, scan.v1_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {scan.v1: Buffer(scan.v1_2: Pointer(float32), float32, [m: int32, l: int32], [stride: int32, stride_1: int32], type=“auto”),
scan.v0: Buffer(scan.v0_2: Pointer(float32), float32, [m, n: int32], [stride_2: int32, stride_3: int32], type=“auto”),
X: Buffer(X_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type=“auto”)}
buffer_map = {X_1: X, scan.v0_1: scan.v0, scan.v1_1: scan.v1} {
for (i: int32, 0, n) {
scan.v0_2[(istride_3)] = (float32*)X_2[(istride_5)]
}
for (i_1: int32, 0, l) {
scan.v1_2[(i_1stride_1)] = 0f32
}
for (scan.idx: int32, 0, (m - 1)) {
for (i_2: int32, 0, n) {
scan.v0_2[(((scan.idx + 1)stride_2) + (i_2stride_3))] = ((float32*)scan.v0_2[((scan.idxstride_2) + (i_2stride_3))] + (float32*)X_2[(((scan.idx + 1)stride_4) + (i_2stride_5))])
}
for (i_3: int32, 0, l) {
scan.v1_2[(((scan.idx + 1)stride) + (i_3stride_1))] = ((float32*)scan.v1_2[((scan.idxstride) + (i_3stride_1))] + (float32*)scan.v0_2[(scan.idx*stride_2)])
}
}
}
Summary
本文提供掃描原語(yǔ)的概況。
用init和update描述掃描。
按正常計(jì)劃安排掃描單元。
對(duì)于復(fù)雜的工作負(fù)載,在掃描單元中使用多個(gè)狀態(tài)和步驟。
https://tvm.apache.org/docs/tutorials/language/scan.html
下載Python源代碼:scan.py
下載Jupyter筆記:scan.ipynb
總結(jié)