日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 人文社科 > 生活经验 >内容正文

生活经验

自动调度GPU的卷积层

發(fā)布時(shí)間:2023/11/28 生活经验 51 豆豆
生活随笔 收集整理的這篇文章主要介紹了 自动调度GPU的卷积层 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

自動(dòng)調(diào)度GPU的卷積層
這是有關(guān)如何對(duì)GPU使用自動(dòng)調(diào)度程序的文檔。
與依靠手動(dòng)模板定義搜索空間的基于模板的autotvm不同,自動(dòng)調(diào)度程序不需要任何模板。用戶只需要編寫計(jì)算聲明,而無需任何調(diào)度命令或模板。自動(dòng)調(diào)度程序可以自動(dòng)生成較大的搜索空間,并在該空間中找到良好的調(diào)度。
本文以卷積層為例。
注意,本文無法在Windows或最新版本的macOS上運(yùn)行。要使其運(yùn)行,需要將本文的內(nèi)容包裝在一個(gè)if name == “main”:塊中。
import os

import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python
定義計(jì)算
首先,定義卷積層的計(jì)算。該函數(shù)應(yīng)返回輸入/輸出張量的列表。通過這些張量,自動(dòng)調(diào)度器可以獲取整個(gè)計(jì)算圖。
@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name=“data”)
kernel = te.placeholder((CO, CI, KH, KW), name=“kernel”)
bias = te.placeholder((1, CO, 1, 1), name=“bias”)
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype=“float32”)
out = topi.nn.relu(conv + bias)
return [data, kernel, bias, out]
創(chuàng)建搜索任務(wù)
然后,為resnet中的最后一個(gè)卷積層創(chuàng)建搜索任務(wù)。
target = tvm.target.Target(“cuda”)

Use the last layer in ResNet-50

N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)

Inspect the computational graph

print(“Computational DAG:”)
print(task.compute_dag)
輸出:
Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)
接下來,為自動(dòng)調(diào)度程序設(shè)置參數(shù)。這些參數(shù)主要指定在搜索過程中如何進(jìn)行測(cè)量。
? measure_ctx啟動(dòng)不同的測(cè)量過程以提供隔離。它可以保護(hù)主進(jìn)程免受測(cè)量期間GPU崩潰的影響,并避免其他運(yùn)行時(shí)runtime沖突。
? min_repeat_ms定義每次測(cè)量中一次“重復(fù)”的最小持續(xù)時(shí)間。這樣可以預(yù)熱GPU,對(duì)于獲得準(zhǔn)確的測(cè)量結(jié)果是必不可少的。通常,建議閾值> = 300 ms。
? num_measure_trials是在搜索過程中可以使用的測(cè)量試驗(yàn)的數(shù)量。為了快速演示,在本文中僅進(jìn)行10次試用。實(shí)際上,1000是搜索收斂的一個(gè)很好的值。可以根據(jù)自己的時(shí)間預(yù)算進(jìn)行更多試驗(yàn)。
? 此外,還用于RecordToFile將測(cè)量記錄轉(zhuǎn)儲(chǔ)到文件conv2d.json中。測(cè)量記錄可用于最好地查詢歷史記錄,恢復(fù)搜索以及以后進(jìn)行更多分析。
? 有關(guān)更多參數(shù), 請(qǐng)參見auto_scheduler.TuningOptions,auto_scheduler.LocalRPCMeasureContext。
log_file = “conv2d.json”
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10, # change this to 1000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
輸出:
Get devices for measurement successfully!
運(yùn)行搜索
現(xiàn)在準(zhǔn)備好所有輸入。開始搜索,讓自動(dòng)調(diào)度程序發(fā)揮作用。經(jīng)過一些測(cè)量試驗(yàn)后,可以從日志文件中加載最佳調(diào)度并應(yīng)用它。

Run auto-tuning (search)

task.tune(tune_option)

Apply the best schedule

sch, args = task.apply_best(log_file)

Kill the measurement process

del measure_ctx
輸出:
可以降低調(diào)度,以便在自動(dòng)調(diào)度后查看IR。自動(dòng)調(diào)度程序可以正確執(zhí)行優(yōu)化,包括多層平鋪,協(xié)作提取,展開和算子融合。
print(“Lowered TIR:”)
print(tvm.lower(sch, args, simple_mode=True))
輸出:
Lowered TIR:
primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
attr [IterVar(blockIdx.x: int32, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = 28;
attr [compute_3: Pointer(float32)] “storage_scope” = “l(fā)ocal”;
allocate(compute_3, float32, [14]);
attr [pad_temp.shared: Pointer(float32)] “storage_scope” = “shared”;
allocate(pad_temp.shared, float32, [72]);
attr [kernel.shared: Pointer(float32)] “storage_scope” = “shared”;
allocate(kernel.shared, float32, [3072]);
attr [IterVar(threadIdx.x: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64 {
compute_3[0] = 0f32
compute_3[1] = 0f32
compute_3[2] = 0f32
compute_3[3] = 0f32
compute_3[4] = 0f32
compute_3[5] = 0f32
compute_3[6] = 0f32
compute_3[7] = 0f32
compute_3[8] = 0f32
compute_3[9] = 0f32
compute_3[10] = 0f32
compute_3[11] = 0f32
compute_3[12] = 0f32
compute_3[13] = 0f32
for (rc.outer.outer: int32, 0, 64) {
for (ry.outer.outer: int32, 0, 3) {
attr [IterVar(threadIdx.x_1: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64 {
if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
pad_temp.shared[(threadIdx.x_14)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod((threadIdx.x_14), 9))) && (floormod((threadIdx.x_14), 9) < 8)), (float32)data_2[((((((rc.outer.outer392) + (floordiv((threadIdx.x_14), 9)49)) + (ry.outer.outer7)) + (floormod(blockIdx.x, 7)7)) + floormod((threadIdx.x_14), 9)) - 8)], 0f32, dtype=float32)
}
if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
pad_temp.shared[((threadIdx.x_14) + 1)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_14) + 1), 9))) && (floormod(((threadIdx.x_14) + 1), 9) < 8)), (float32)data_2[((((((rc.outer.outer392) + (floordiv(((threadIdx.x_14) + 1), 9)49)) + (ry.outer.outer7)) + (floormod(blockIdx.x, 7)7)) + floormod(((threadIdx.x_14) + 1), 9)) - 8)], 0f32, dtype=float32)
}
if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
pad_temp.shared[((threadIdx.x_14) + 2)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_14) + 2), 9))) && (floormod(((threadIdx.x_14) + 2), 9) < 8)), (float32)data_2[((((((rc.outer.outer392) + (floordiv(((threadIdx.x_14) + 2), 9)49)) + (ry.outer.outer7)) + (floormod(blockIdx.x, 7)7)) + floormod(((threadIdx.x_14) + 2), 9)) - 8)], 0f32, dtype=float32)
}
if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
pad_temp.shared[((threadIdx.x_14) + 3)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_14) + 3), 9))) && (floormod(((threadIdx.x_14) + 3), 9) < 8)), (float32)data_2[((((((rc.outer.outer392) + (floordiv(((threadIdx.x_14) + 3), 9)49)) + (ry.outer.outer7)) + (floormod(blockIdx.x, 7)7)) + floormod(((threadIdx.x_14) + 3), 9)) - 8)], 0f32, dtype=float32)
}
}
attr [IterVar(threadIdx.x_2: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[threadIdx.x_2] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 64)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 64), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 128)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 128), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 192)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 36864)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 256)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 256), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 320)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 320), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 384)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 73728)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 448)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 448), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 512)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 512), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 576)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 110592)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 640)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 640), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 704)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 704), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 768)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 147456)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 832)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 832), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 896)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 896), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 960)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 184320)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1024)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1024), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1088)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1088), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1152)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 221184)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1216)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1216), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1280)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1280), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1344)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 258048)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1408)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1408), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1472)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1472), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1536)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 294912)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1600)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1600), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1664)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1664), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1728)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 331776)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1792)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1792), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1856)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1856), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1920)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 368640)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 1984)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 1984), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2048)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2048), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2112)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 405504)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2176)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2176), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2240)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2240), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2304)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 442368)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2368)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2368), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2432)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2432), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2496)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 479232)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2560)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2560), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2624)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2624), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2688)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 516096)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2752)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2752), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2816)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2816), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2880)] = (float32
)kernel_2[(((((((floordiv(blockIdx.x, 7)589824) + (floordiv(threadIdx.x_2, 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)9)) + (ry.outer.outer3)) + floormod(threadIdx.x_2, 3)) + 552960)]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 2944)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 2944), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 1), 3))]
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
kernel.shared[(threadIdx.x_2 + 3008)] = (float32
)kernel_2[((((((floordiv(blockIdx.x, 7)589824) + (floordiv((threadIdx.x_2 + 3008), 24)4608)) + (rc.outer.outer72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)9)) + (ry.outer.outer3)) + floormod((threadIdx.x_2 + 2), 3))]
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[0](float32)kernel.shared[(threadIdx.x48)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[9](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[1](float32)kernel.shared[(threadIdx.x48)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[10](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[2](float32)kernel.shared[(threadIdx.x48)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[3](float32)kernel.shared[(threadIdx.x48)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[4](float32)kernel.shared[(threadIdx.x48)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[5](float32)kernel.shared[(threadIdx.x48)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[6](float32)kernel.shared[(threadIdx.x48)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 3)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[0](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[9](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[1](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[10](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[2](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[3](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[4](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[5](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[6](float32)kernel.shared[((threadIdx.x48) + 24)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 27)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[1](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[10](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[2](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[3](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[4](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[5](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[6](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[7](float32)kernel.shared[((threadIdx.x48) + 1)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[16](float32)kernel.shared[((threadIdx.x48) + 4)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[1](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[10](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[2](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[3](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[4](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[5](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[6](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[7](float32)kernel.shared[((threadIdx.x48) + 25)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[16](float32)kernel.shared[((threadIdx.x48) + 28)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[2](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[3](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[4](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[5](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[6](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[7](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[16](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[8](float32)kernel.shared[((threadIdx.x48) + 2)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[17](float32)kernel.shared[((threadIdx.x48) + 5)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[2](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[11](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[3](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[12](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[4](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[13](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[5](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[14](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[6](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[15](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[7](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[16](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[8](float32)kernel.shared[((threadIdx.x48) + 26)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[17](float32)kernel.shared[((threadIdx.x48) + 29)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[18](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[27](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[19](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[28](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 6)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 9)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[18](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[27](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[19](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[28](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 30)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 33)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[19](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[28](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[25](float32)kernel.shared[((threadIdx.x48) + 7)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[34](float32)kernel.shared[((threadIdx.x48) + 10)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[19](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[28](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[25](float32)kernel.shared[((threadIdx.x48) + 31)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[34](float32)kernel.shared[((threadIdx.x48) + 34)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[25](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[34](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[26](float32)kernel.shared[((threadIdx.x48) + 8)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[35](float32)kernel.shared[((threadIdx.x48) + 11)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[20](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[29](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[21](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[30](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[22](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[31](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[23](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[32](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[24](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[33](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[25](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[34](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[26](float32)kernel.shared[((threadIdx.x48) + 32)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[35](float32)kernel.shared[((threadIdx.x48) + 35)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[36](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[45](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[37](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[46](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 12)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 15)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[36](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[45](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[37](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[46](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 36)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 39)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[37](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[46](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[43](float32)kernel.shared[((threadIdx.x48) + 13)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[52](float32)kernel.shared[((threadIdx.x48) + 16)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[37](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[46](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[43](float32)kernel.shared[((threadIdx.x48) + 37)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[52](float32)kernel.shared[((threadIdx.x48) + 40)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[43](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[52](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[44](float32)kernel.shared[((threadIdx.x48) + 14)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[53](float32)kernel.shared[((threadIdx.x48) + 17)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[38](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[47](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[39](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[48](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[40](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[49](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[41](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[50](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[42](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[51](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[43](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[52](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[44](float32)kernel.shared[((threadIdx.x48) + 38)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[53](float32)kernel.shared[((threadIdx.x48) + 41)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[54](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[63](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[55](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[64](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 18)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 21)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[54](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[63](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[55](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[64](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 42)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 45)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[55](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[64](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[61](float32)kernel.shared[((threadIdx.x48) + 19)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[70](float32)kernel.shared[((threadIdx.x48) + 22)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[55](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[64](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[61](float32)kernel.shared[((threadIdx.x48) + 43)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[70](float32)kernel.shared[((threadIdx.x48) + 46)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[1] = ((float32
)compute_3[1] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[2] = ((float32
)compute_3[2] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[3] = ((float32
)compute_3[3] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[4] = ((float32
)compute_3[4] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[61](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[5] = ((float32
)compute_3[5] + ((float32*)pad_temp.shared[70](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[62](float32)kernel.shared[((threadIdx.x48) + 20)]))
compute_3[6] = ((float32
)compute_3[6] + ((float32*)pad_temp.shared[71](float32)kernel.shared[((threadIdx.x48) + 23)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[56](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[7] = ((float32
)compute_3[7] + ((float32*)pad_temp.shared[65](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[57](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[8] = ((float32
)compute_3[8] + ((float32*)pad_temp.shared[66](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[58](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[9] = ((float32
)compute_3[9] + ((float32*)pad_temp.shared[67](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[59](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[10] = ((float32
)compute_3[10] + ((float32*)pad_temp.shared[68](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[60](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[11] = ((float32
)compute_3[11] + ((float32*)pad_temp.shared[69](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[61](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[12] = ((float32
)compute_3[12] + ((float32*)pad_temp.shared[70](float32)kernel.shared[((threadIdx.x48) + 47)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[62](float32)kernel.shared[((threadIdx.x48) + 44)]))
compute_3[13] = ((float32
)compute_3[13] + ((float32*)pad_temp.shared[71](float32)kernel.shared[((threadIdx.x48) + 47)]))
}
}
for (i1.inner: int32, 0, 2) {
for (i3.inner: int32, 0, 7) {
compute_2[(((((floordiv(blockIdx.x, 7)6272) + (threadIdx.x98)) + (i1.inner
49)) + (floormod(blockIdx.x, 7)7)) + i3.inner)] = max(((float32)compute_3[((i1.inner7) + i3.inner)] + (float32)bias_2[(((floordiv(blockIdx.x, 7)128) + (threadIdx.x2)) + i1.inner)]), 0f32)
}
}
}
}
檢查正確性并評(píng)估性能
構(gòu)建二進(jìn)制文件并檢查其正確性和性能。
func = tvm.build(sch, args, target)

Check correctness

data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)

ctx = tvm.gpu()
data_tvm = tvm.nd.array(data_np, ctx=ctx)
weight_tvm = tvm.nd.array(weight_np, ctx=ctx)
bias_tvm = tvm.nd.array(bias_np, ctx=ctx)
out_tvm = tvm.nd.empty(out_np.shape, ctx=ctx)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)

Check results

np.testing.assert_allclose(out_np, out_tvm.asnumpy(), rtol=1e-3)

Evaluate execution time

evaluator = func.time_evaluator(func.entry_name, ctx, min_repeat_ms=500)
print(
“Execution time of this operator: %.3f ms”
% (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)
出:
Execution time of this operator: 0.417 ms
使用記錄文件
搜索期間,所有測(cè)量記錄都將轉(zhuǎn)儲(chǔ)到記錄文件“ conv2d.json”中。測(cè)量記錄可用于重新應(yīng)用搜索結(jié)果,繼續(xù)搜索以及執(zhí)行其他分析。
這是一個(gè)示例,其中從文件加載最佳調(diào)度,并打印等效的python,調(diào)度API和CUDA源代碼。可用于調(diào)試和學(xué)習(xí)自動(dòng)調(diào)度程序的行為。
print(“Equivalent python schedule:”)
print(task.print_best(log_file, print_mode=“schedule”))

print(“CUDA source code:”)
print(task.print_best(log_file, print_mode=“cuda”))
輸出:
Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1)
compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1)
compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)
compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)
compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)
compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=2)
compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=64)
compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=1)
compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)
compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)
compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=1)
compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)
compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=1)
compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=7)
compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)
compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1)
compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=2)
compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4)
compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1)
compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=1)
compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=1)
compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=3)
s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=64)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[compute].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, “shared”, [compute])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[compute], compute_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, “shared”, [compute])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis(“blockIdx.x”))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis(“vthread”))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis(“threadIdx.x”))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(“threadIdx.x”))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(“threadIdx.x”))
s[compute].pragma(compute_nn_o_o_o_o, “auto_unroll_max_step”, 512)
s[compute].pragma(compute_nn_o_o_o_o, “unroll_explicit”, True)

CUDA source code:
extern “C” global void default_function_kernel0(float* restrict data, float* restrict kernel, float* restrict compute, float* restrict bias) {
float compute1[14];
shared float pad_temp_shared[72];
shared float kernel_shared[3072];
compute1[(0)] = 0.000000e+00f;
compute1[(1)] = 0.000000e+00f;
compute1[(2)] = 0.000000e+00f;
compute1[(3)] = 0.000000e+00f;
compute1[(4)] = 0.000000e+00f;
compute1[(5)] = 0.000000e+00f;
compute1[(6)] = 0.000000e+00f;
compute1[(7)] = 0.000000e+00f;
compute1[(8)] = 0.000000e+00f;
compute1[(9)] = 0.000000e+00f;
compute1[(10)] = 0.000000e+00f;
compute1[(11)] = 0.000000e+00f;
compute1[(12)] = 0.000000e+00f;
compute1[(13)] = 0.000000e+00f;
for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) {
for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
__syncthreads();
if (((int)threadIdx.x) < 18) {
pad_temp_shared[((((int)threadIdx.x) * 4))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= ((((int)threadIdx.x) * 4) % 9))) && (((((int)threadIdx.x) * 4) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + (((((int)threadIdx.x) * 4) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + ((((int)threadIdx.x) * 4) % 9)) - 8))] : 0.000000e+00f);
}
if (((int)threadIdx.x) < 18) {
pad_temp_shared[(((((int)threadIdx.x) * 4) + 1))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 1) % 9))) && ((((((int)threadIdx.x) * 4) + 1) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 1) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 1) % 9)) - 8))] : 0.000000e+00f);
}
if (((int)threadIdx.x) < 18) {
pad_temp_shared[(((((int)threadIdx.x) * 4) + 2))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 2) % 9))) && ((((((int)threadIdx.x) * 4) + 2) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 2) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 2) % 9)) - 8))] : 0.000000e+00f);
}
if (((int)threadIdx.x) < 18) {
pad_temp_shared[(((((int)threadIdx.x) * 4) + 3))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 3) % 9))) && ((((((int)threadIdx.x) * 4) + 3) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 3) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 3) % 9)) - 8))] : 0.000000e+00f);
}
kernel_shared[(((int)threadIdx.x))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)))];
kernel_shared[((((int)threadIdx.x) + 64))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 64) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 128))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 128) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 192))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36864))];
kernel_shared[((((int)threadIdx.x) + 256))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 256) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 320))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 320) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 384))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 73728))];
kernel_shared[((((int)threadIdx.x) + 448))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 512))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 512) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 576))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 110592))];
kernel_shared[((((int)threadIdx.x) + 640))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 640) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 704))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 704) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 768))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 147456))];
kernel_shared[((((int)threadIdx.x) + 832))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 832) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 896))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 896) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 960))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 184320))];
kernel_shared[((((int)threadIdx.x) + 1024))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1024) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1088))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1088) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1152))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 221184))];
kernel_shared[((((int)threadIdx.x) + 1216))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1216) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1280))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1280) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1344))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 258048))];
kernel_shared[((((int)threadIdx.x) + 1408))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1408) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1472))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1472) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1536))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 294912))];
kernel_shared[((((int)threadIdx.x) + 1600))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1600) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1664))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1664) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1728))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 331776))];
kernel_shared[((((int)threadIdx.x) + 1792))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1792) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1856))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1856) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 1920))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 368640))];
kernel_shared[((((int)threadIdx.x) + 1984))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1984) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2048))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2048) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2112))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 405504))];
kernel_shared[((((int)threadIdx.x) + 2176))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2176) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2240))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2240) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2304))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 442368))];
kernel_shared[((((int)threadIdx.x) + 2368))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2368) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2432))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2432) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2496))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 479232))];
kernel_shared[((((int)threadIdx.x) + 2560))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2624))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2624) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2688))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 516096))];
kernel_shared[((((int)threadIdx.x) + 2752))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2752) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2816))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2816) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
kernel_shared[((((int)threadIdx.x) + 2880))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 552960))];
kernel_shared[((((int)threadIdx.x) + 2944))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2944) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
kernel_shared[((((int)threadIdx.x) + 3008))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 3008) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
__syncthreads();
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(0)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(1)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(2)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(3)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(4)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(5)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(6)] * kernel_shared[((((int)threadIdx.x) * 48))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(0)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
}
}
for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {
compute[(((((((((int)blockIdx.x) / 7) * 6272) + (((int)threadIdx.x) * 98)) + (i1_inner * 49)) + ((((int)blockIdx.x) % 7) * 7)) + i3_inner))] = max((compute1[(((i1_inner * 7) + i3_inner))] + bias[(((((((int)blockIdx.x) / 7) * 128) + (((int)threadIdx.x) * 2)) + i1_inner))]), 0.000000e+00f);
}
}
}
一個(gè)更復(fù)雜的示例是繼續(xù)搜索。在這種情況下,需要自己創(chuàng)建搜索策略和成本模型,并使用日志文件恢復(fù)搜索策略和成本模型的狀態(tài)。在下面的示例中,恢復(fù)狀態(tài)并進(jìn)行5次以上的試用。
cost_model = auto_scheduler.XGBModel()
cost_model.update_from_file(log_file)
search_policy = auto_scheduler.SketchPolicy(
task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=5,
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
task.tune(tune_option, search_policy=search_policy)

Kill the measurement process

del measure_ctx
輸出:
Get devices for measurement successfully!
腳本的總運(yùn)行時(shí)間:( 1分鐘34.433秒)
https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

總結(jié)

以上是生活随笔為你收集整理的自动调度GPU的卷积层的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。

av中文字幕不卡 | 人人爱人人射 | 中文字幕文字幕一区二区 | 欧美日韩国产综合一区二区 | 国产手机在线观看视频 | 9999毛片| 亚洲区另类春色综合小说 | 亚洲精品乱码久久久久久蜜桃91 | 日本激情中文字幕 | 在线超碰av| 久久69精品久久久久久久电影好 | 中文字幕亚洲国产 | 欧美色精品天天在线观看视频 | 五月婷婷爱 | 国产日本亚洲高清 | 国产精品一区免费在线观看 | 国产91精品看黄网站 | 91av视频在线免费观看 | 国产欧美最新羞羞视频在线观看 | 911香蕉视频| 黄色在线免费观看网址 | 亚洲一级黄色av | 狠狠狠狠狠狠狠 | 五月综合久久 | 日韩乱色精品一区二区 | 日韩av手机在线观看 | 久草在线免费资源 | 国产视频亚洲 | 黄色国产在线 | av中文国产 | 久久久精品福利视频 | 日韩专区 在线 | 色婷婷免费| 国产精品18久久久久久久久久久久 | 国产成人免费av电影 | 黄色成人av在线 | www免费在线观看 | 人人玩人人添人人澡超碰 | 久久精品老司机 | 日本大片免费观看在线 | 久久久久久国产精品999 | 国产成人精品免费在线观看 | 成人av在线一区二区 | 久久成人免费视频 | 亚洲深夜影院 | 最近中文字幕视频完整版 | 国产流白浆高潮在线观看 | 91日韩精品视频 | 美女免费黄视频网站 | 中文日韩在线 | 新av在线| 国产午夜精品久久久久久久久久 | 99久久精品免费一区 | 欧美日韩激情网 | 日本久久免费视频 | 在线免费观看黄色小说 | 特级西西www44高清大胆图片 | 日b视频国产 | 我爱av激情网 | 中文字幕久久精品一区 | 国产在线高清精品 | 免费看国产a | 一级欧美一级日韩 | 成人av免费在线 | 国内一区二区视频 | 天天操天天操天天操天天操天天操 | 美女视频黄是免费的 | 精品毛片一区二区免费看 | 992tv在线成人免费观看 | 免费黄色a级毛片 | 日韩在线视频一区二区三区 | 久久精品毛片 | 99国产免费网址 | 免费美女久久99 | 国产欧美日韩视频 | 国产精品系列在线观看 | 中文字幕在线观看完整 | 狠狠gao| 欧美二区视频 | 国产精品一区二区无线 | 手机看片1042 | 91精品视频在线免费观看 | 五月婷婷在线视频观看 | 青草视频在线看 | 日韩理论片在线 | 韩日精品在线 | 日韩在线| 美女免费网视频 | 欧美在线视频一区二区三区 | 久久五月激情 | 成人va天堂 | 日韩精品欧美精品 | 中文在线免费一区三区 | 欧美日韩久久 | 久草视频一区 | 99久久精品免费看国产四区 | 国产精彩视频一区二区 | 99re久久资源最新地址 | av黄色免费网站 | 成人三级网站在线观看 | 最新日韩在线观看视频 | 亚洲 中文 在线 精品 | 99精品热视频 | 91黄视频在线观看 | 色噜噜在线观看 | 91精品成人 | 97理论电影 | 天天激情天天干 | 色99之美女主播在线视频 | 国产福利一区二区三区在线观看 | 久久99深爱久久99精品 | 国产精品成人一区二区三区吃奶 | 97手机电影网 | 精品久久久久一区二区国产 | 青青草在久久免费久久免费 | 456成人精品影院 | 一区二区三区精品在线 | 99视频这里只有 | 韩国在线一区 | 国产成人在线观看免费 | 国产玖玖在线 | 亚洲精品综合一区二区 | 深爱五月激情网 | 精品视频久久 | 99精品视频在线观看视频 | 国产人成在线观看 | 欧美成人精品欧美一级乱黄 | 美女免费黄网站 | 久久国产视屏 | 97在线视频免费播放 | 国产成人三级在线播放 | 日韩精品黄 | 福利久久久 | 亚洲国产久 | 在线播放亚洲 | 在线免费观看黄色av | 91在线一区 | 丁香六月五月婷婷 | 日韩黄在线观看 | 久艹在线免费观看 | 中文字幕免费久久 | 日日夜夜婷婷 | 精品久久久久久久久久久久久久久久 | 中文字幕有码在线观看 | 婷婷丁香在线视频 | 欧美日韩不卡在线观看 | 日韩精品久久久久久久电影99爱 | 99在线免费观看 | 欧洲亚洲精品 | 激情www| 三级av免费看 | 99国内精品久久久久久久 | 日韩二区在线观看 | 日韩欧美一区二区三区视频 | 婷婷丁香九月 | 成人激情开心网 | 毛片永久免费 | 精品一区av | 日日夜夜av | 成人黄在线观看 | 日p视频| 91av视屏 | 婷婷伊人五月天 | 中文字幕在线第一页 | 91豆麻精品91久久久久久 | 国产日产高清dvd碟片 | 射久久久 | 99精品国产高清在线观看 | 国产高清av | 在线视频区 | 欧美一级特黄aaaaaa大片在线观看 | 国产成人精品三级 | 99精品视频网 | 国产精品一区在线播放 | 四月婷婷在线观看 | 91九色视频在线 | av电影免费在线看 | 国产a网站 | 天天操天天操 | 在线免费观看成人 | 久久久91精品国产一区二区三区 | 精品1区2区 | 亚洲精品乱码久久久久久蜜桃动漫 | 丝袜美腿一区 | 国产精品久久久久久久久岛 | 五月婷婷精品 | 天天综合成人网 | 中文字幕a在线 | 亚洲aⅴ一区二区三区 | 麻豆免费看片 | 国产91勾搭技师精品 | 韩国中文三级 | 日韩精品国产一区 | 狠狠的干狠狠的操 | 天天艹 | 国产va饥渴难耐女保洁员在线观看 | 国产香蕉在线 | 日本久久精 | 色综合婷婷| 成人免费网站在线观看 | 亚洲高清视频在线 | 午夜精品成人一区二区三区 | 日本天天操 | 久久99久久99精品免观看粉嫩 | 久久另类小说 | 日韩在线观看你懂得 | 天天躁天天操 | 亚洲小视频在线 | 欧美另类一二三四区 | 国产精品每日更新 | 成人久久久电影 | 亚洲国产网址 | 免费高清在线视频一区· | 日韩一级理论片 | 国产福利在线免费观看 | 日韩欧美精品在线观看视频 | 五月天久久婷 | 久av电影 | 欧美成人精品三级在线观看播放 | 成年人免费观看国产 | 中文字幕在线观看一区二区三区 | 中文字幕丰满人伦在线 | 91chinesexxx| 欧美韩国日本在线观看 | 国产麻豆电影在线观看 | 中国一级片免费看 | 亚洲综合小说电影qvod | 精品在线观看免费 | 久久国产精品网站 | 日韩三级在线观看 | 中文字幕在线观看网 | 99久久爱 | 色永久免费视频 | 久久情网 | 欧美国产日韩一区二区三区 | www欧美色| www.91成人| 中文字幕在线观看第三页 | 久久成人视屏 | 国产无限资源在线观看 | 最近中文字幕免费观看 | 91综合在线| 亚洲免费公开视频 | 免费观看黄色12片一级视频 | 丝袜美腿一区 | 国产精品美女视频 | 精品国模一区二区三区 | 成人高清av在线 | 成片视频在线观看 | 久久99国产精品久久99 | 国产精品久久一区二区三区, | 黄色小视频在线观看免费 | 亚洲国产成人精品久久 | 在线观看视频亚洲 | 日韩在线大片 | 亚洲精品久久久久久久不卡四虎 | 国产系列在线观看 | 天天色天天干天天色 | 精品国产三级a∨在线欧美 免费一级片在线观看 | 91在线视频免费91 | 午夜精品一区二区三区在线视频 | 99婷婷狠狠成为人免费视频 | 日韩av电影中文字幕在线观看 | 国产色婷婷精品综合在线手机播放 | www.午夜 | 胖bbbb搡bbbb擦bbbb| 欧美成人一区二区 | 欧美精品一级视频 | 黄色在线观看免费网站 | 婷婷色社区| 婷婷在线资源 | 日韩久久久久久久久久 | 日韩不卡高清 | 久久综合免费视频 | 久草在线视频在线观看 | 亚洲欧美日本一区二区三区 | 91视频免费网址 | 91亚色免费视频 | 免费亚洲一区二区 | 808电影| 婷婷av网| 丰满少妇在线观看资源站 | 中文字幕亚洲精品日韩 | 日韩高清免费观看 | 18性欧美xxxⅹ性满足 | 国产一区不卡在线 | 亚洲情感电影大片 | 日韩影视在线观看 | 精品国产欧美 | 麻豆传媒在线免费看 | 久久综合网色—综合色88 | 久久久穴 | 在线观看自拍 | 日韩在线中文字幕视频 | 六月丁香激情综合 | 亚洲精品18日本一区app | 91精品一区在线观看 | 91 在线视频 | 精品国产人成亚洲区 | 国语自产偷拍精品视频偷 | 狠狠色丁香婷综合久久 | 婷婷色视频 | 亚洲在线免费视频 | 日韩精品一区电影 | av高清一区 | 国产三级国产精品国产专区50 | av高清一区二区三区 | 国产精品自产拍在线观看网站 | 精品国产美女在线 | 天天干天天草天天爽 | 国产精品久久久久久吹潮天美传媒 | 色妞色视频一区二区三区四区 | 日本论理电影 | 亚洲六月丁香色婷婷综合久久 | 91九色视频在线观看 | 欧美一级片在线 | 国产第一页精品 | 夜夜狠狠 | 国产一级片毛片 | 国产成免费视频 | 在线观看精品视频 | 日日操天天操夜夜操 | 五月天综合婷婷 | 91麻豆精品国产91久久久使用方法 | 久久久久亚洲精品成人网小说 | 中文字幕久久精品亚洲乱码 | 五月天久久久久 | 狠狠干电影 | 人人爱夜夜操 | 香蕉网在线| 国产激情电影综合在线看 | 日韩欧美在线中文字幕 | 丁香电影小说免费视频观看 | 亚州精品天堂中文字幕 | 欧美日韩性 | 免费福利视频网站 | 国产不卡视频在线 | 999热线在线观看 | 婷婷黄色片 | 成年人免费电影在线观看 | 国产黄色精品网站 | 国产最新91 | 久久久精品国产免费观看同学 | 国模精品一区二区三区 | 色多多视频在线观看 | 最近中文国产在线视频 | 婷婷香蕉 | www.69xx| 欧美污在线观看 | 成人欧美一区二区三区在线观看 | 国产九九在线 | 999视频网 | 久久久久久久久免费 | 日韩av成人在线观看 | 日韩视频三区 | 亚洲视频在线播放 | 亚洲欧美日本一区二区三区 | 亚州av免费 | 日本久久高清视频 | 久久久精品国产一区二区电影四季 | 欧美一级视频在线观看 | 国内精品久久久久影院优 | 蜜桃麻豆www久久囤产精品 | 国产综合在线观看视频 | 日韩视频在线不卡 | 人人射人人爽 | 欧美另类高清 | 香蕉久久久久久av成人 | 日本天天色 | 特级黄色电影 | 日韩在线观看电影 | 在线播放视频一区 | 99久久电影| 中文字幕在线国产 | 97天天干| 91av观看 | 狠狠色丁香九九婷婷综合五月 | 最近的中文字幕大全免费版 | 91亚洲精品乱码久久久久久蜜桃 | 91av在线免费播放 | 中文字幕免费中文 | 国产精品18久久久久久久 | 日本在线观看一区二区三区 | 日本精品va在线观看 | 精品欧美小视频在线观看 | 亚洲免费精品视频 | 国产v在线| 五月婷婷丁香网 | 色综合久久久久久中文网 | 久久精品综合视频 | 国产精品影音先锋 | 亚洲精品乱码久久久久久蜜桃欧美 | 亚洲女人天堂成人av在线 | 国产免费观看久久 | 久久午夜剧场 | av在线网站免费观看 | 激情婷婷综合网 | 三级在线播放视频 | 欧美精品二区 | av女优中文字幕在线观看 | 天天做天天爱天天爽综合网 | 日韩国产欧美在线视频 | 国产美女视频免费 | 成人蜜桃 | 天天爽天天爽夜夜爽 | 日韩中文字幕免费在线播放 | 日韩三级视频在线观看 | 99精品免费视频 | 中文字幕av免费在线观看 | 欧美久久精品 | 日韩免费福利 | 精品久久久99 | 在线看片日韩 | 国产精品九九视频 | 97涩涩视频 | 婷婷六月激情 | 一级黄色免费网站 | 亚洲黄色免费电影 | 久久精品国产v日韩v亚洲 | 日韩电影久久 | 久久99视频免费 | 中文字幕资源网 国产 | 国产精品一区二区三区观看 | www.久久久久 | 黄色.com| 在线观看亚洲精品视频 | 超碰人人射 | 亚洲人人网 | 91成熟丰满女人少妇 | 一区二区三区福利 | 成人久久18免费网站麻豆 | 久久久久免费精品 | 中国一 片免费观看 | 久久久久亚洲精品男人的天堂 | 丰满少妇一级片 | 久久天天躁夜夜躁狠狠85麻豆 | 激情视频在线高清看 | 色之综合网 | 激情视频免费在线 | 亚洲精品啊啊啊 | 免费成人结看片 | 91精品久久久久久久久久入口 | 国产美女在线免费观看 | 中文字幕在线播放一区 | 四虎影视精品 | 91网在线观看 | 久久久久久久看片 | 天天做天天爱天天爽综合网 | 99精品99| 天堂av色婷婷一区二区三区 | 香蕉在线播放 | 香蕉在线播放 | 久久久精品在线观看 | 日本久久久精品视频 | 天天干天天干天天 | 久久久人 | 99久视频| 国产特级毛片aaaaaa毛片 | 中文视频在线 | 特级黄录像视频 | 成人动态视频 | 久草视频中文 | 婷婷六月丁香激情 | 婷婷色在线| 欧美日韩不卡一区二区三区 | 久久久精品一区二区三区 | 国产在线免费av | 久久 地址| 91九色国产 | 国产精品久久久久久久久搜平片 | 亚洲色图美腿丝袜 | 国产精品不卡 | 婷婷网在线 | 亚洲精品乱码久久久久久蜜桃不爽 | 国产高清在线永久 | 九色最新网址 | 久久露脸国产精品 | 精品在线免费观看 | 欧美精品一区二区三区四区在线 | 五月婷婷欧美视频 | 国产免费久久精品 | 亚洲国产福利视频 | 欧美黄色高清 | 免费黄色在线播放 | 亚州日韩中文字幕 | 久久人人97超碰国产公开结果 | 国产精品久久精品国产 | 日韩在线观看不卡 | 波多野结衣日韩 | 97视频在线观看免费 | 久久久久久久影视 | 国产精品视屏 | 在线播放一区二区三区 | 91精品一区二区三区蜜臀 | 国产高清久久久 | 一级黄色片网站 | 免费91麻豆精品国产自产在线观看 | 中文字幕第一页在线播放 | 黄色免费观看视频 | 五月天丁香视频 | 日韩精品在线播放 | 久草精品在线 | av电影在线观看完整版一区二区 | 日韩在线观看视频网站 | 国产一在线精品一区在线观看 | 久草亚洲视频 | 亚洲人成网站精品片在线观看 | v片在线播放 | 日本精品视频在线播放 | 亚洲精品在线观看中文字幕 | 99色精品视频 | 久久精品国产99国产 | 福利一区视频 | 午夜国产福利视频 | 成人影音av | 成人a免费视频 | 亚洲黄色免费电影 | 国产精品美女视频网站 | 国产手机视频在线观看 | 91成人免费在线 | 美女在线免费观看视频 | 人人爽人人爽 | 亚洲精品videossex少妇 | 亚洲国产精品一区二区尤物区 | 视频二区 | 国产成人精品免高潮在线观看 | 精品99久久 | 亚洲黄色免费观看 | 亚洲va男人天堂 | 日本中文字幕免费观看 | 中文字幕一区二区三区久久蜜桃 | 日韩精品在线播放 | 久久草视频 | 黄色a在线| 五月婷婷综合久久 | 国产在线日本 | 久草免费看 | 一区二区三区在线视频111 | 二区三区视频 | 亚洲国产精彩中文乱码av | 久久综合九色九九 | 最新国产中文字幕 | 精品久久一区 | 精品免费久久久久 | av中文天堂 | 国产手机在线观看视频 | 欧美日韩高清一区二区 国产亚洲免费看 | 在线一区观看 | 国产精品成久久久久 | 久热电影 | 婷婷丁香社区 | 丁香在线视频 | 精品一区精品二区高清 | 在线观看视频国产一区 | 91chinesexxx| 亚洲少妇激情 | 国产精品久久久久久久婷婷 | 精品国产亚洲一区二区麻豆 | 超碰在线免费97 | 日韩毛片一区 | 久久久久亚洲精品成人网小说 | 中文字幕在线观看日本 | 日韩精品在线看 | 在线v片免费观看视频 | 少妇视频一区 | 毛片网免费 | 色欧美成人精品a∨在线观看 | 日韩超碰在线 | 国产视频在线免费观看 | 五月天六月色 | 久久国产精品免费看 | 天天碰天天操 | 97精品视频在线播放 | 日韩在线视频二区 | 精品亚洲视频在线 | 免费a视频在线 | 日韩国产欧美在线播放 | 精品久久九九 | 亚洲免费高清视频 | 免费看的av片 | 玖玖在线播放 | 日韩在线三级 | 99久久精品视频免费 | 国产精品综合久久久久久 | av在线电影播放 | 亚洲视频 中文字幕 | 在线观看亚洲精品 | 国产精品女人久久久 | 亚洲a免费| 婷婷丁香色综合狠狠色 | 中文字幕资源网 国产 | 日日草av| 视频一区二区精品 | 日日操网站 | 天堂av一区二区 | 久久久久久久久久久久国产精品 | 黄色网免费| 日日干干夜夜 | 亚洲每日更新 | 午夜精品99久久免费 | 欧美性色黄大片在线观看 | 久久天堂网站 | 日韩一区精品 | 91成人精品一区在线播放 | 99色视频在线 | 久久不卡国产精品一区二区 | 免费在线激情电影 | 91 中文字幕 | 99草在线视频 | 一本一本久久aa综合精品 | 亚洲精选久久 | 在线免费黄色av | 在线国产99 | 国产男男gay做爰 | 国产美女久久久 | 操操操日日 | 黄污视频网站大全 | 免费毛片aaaaaa | 国产又黄又爽无遮挡 | 日本婷婷色 | 欧美va天堂va视频va在线 | 91麻豆精品国产91久久久久久 | 高潮久久久久久久久 | 国产精品福利无圣光在线一区 | 波多野结衣资源 | 久久99国产精品久久99 | 久久久久久久综合色一本 | 人人艹人人 | 成人三级黄色 | 久久久久久久久国产 | www.亚洲精品视频 | 亚洲精品综合一二三区在线观看 | 欧美老女人xx | 成人在线免费视频 | 久久国产三级 | 国内精品久久久久影院优 | 黄色av电影 | 国产小视频在线观看免费 | 欧美做受高潮电影o | 超碰国产人人 | 精品自拍网 | 一区二区三区精品在线 | 在线看黄网站 | 欧美一级激情 | 久久精品中文字幕一区二区三区 | 成人久久久精品国产乱码一区二区 | 91色吧| 欧美一区二区伦理片 | 丁香花五月 | 麻豆免费在线视频 | 国产精品成人久久久 | 五月花婷婷 | 69av国产| 日韩av电影网站在线观看 | 国产乱对白刺激视频在线观看女王 | 一区二区三区在线不卡 | 在线视频观看你懂的 | 999久久国精品免费观看网站 | 中文乱幕日产无线码1区 | 热re99久久精品国产99热 | 久久伊人精品一区二区三区 | 99久久夜色精品国产亚洲96 | 美女网站免费福利视频 | 国产在线观看国语版免费 | 在线观看黄色免费视频 | 国产免费观看高清完整版 | 麻豆视频网址 | 日韩在线二区 | 又粗又长又大又爽又黄少妇毛片 | 日韩一级片大全 | 亚洲人人网 | 国产色在线视频 | www.久久成人 | 中文字幕 国产精品 | 综合色伊人 | 青青久草在线视频 | 久久视频精品在线 | 欧美日韩一二三四区 | 色先锋av资源中文字幕 | 韩国精品一区二区三区六区色诱 | 欧美在线观看视频一区二区 | 欧美 亚洲 另类 激情 另类 | 亚洲综合小说 | 精品国产理论片 | 国内精品久久久久久久影视简单 | 亚洲综合色婷婷 | 国产免费亚洲 | 国产成人一区二区三区 | 色婷婷在线观看视频 | 免费黄色网址网站 | 国产三级视频在线 | 国产亚洲精品女人久久久久久 | 91视视频在线直接观看在线看网页在线看 | 亚洲无吗av| 久久夜色精品国产欧美乱 | 久草在线中文视频 | 五月天国产 | 免费精品久久久 | 日韩欧美99 | 欧美午夜性生活 | 国产69精品久久久久9999apgf | 97人人精品 | 狠狠色丁香久久综合网 | 久久国产精品99国产精 | 国产高清视频在线播放 | 99精品免费久久久久久久久日本 | 免费亚洲黄色 | 96看片 | 亚洲三区在线 | 国产精品成人免费一区久久羞羞 | 丝袜美女在线观看 | 天天射天天干天天爽 | 精品视频在线免费观看 | 免费色视频网站 | 国产日韩精品久久 | av大全在线免费观看 | 一级黄色免费 | 91香蕉视频污在线 | 日韩最新在线 | 精品免费久久久久久 | 免费进去里的视频 | av免费看网站 | 久久国产精品免费一区二区三区 | 国产一区二区网址 | 国产精品免费久久久久久久久久中文 | 日本中文字幕久久 | 99久久日韩精品视频免费在线观看 | 国产成人免费高清 | 超碰.com| 国产精品女人久久久久久 | 国产1区在线观看 | 色妞色视频一区二区三区四区 | 久久久精品成人 | 91热精品视频| 久久久精品在线观看 | 亚洲精品videossex少妇 | 18国产精品白浆在线观看免费 | 国产精品久久久久久久婷婷 | 高清一区二区 | 久久国产精品99久久久久久丝袜 | 麻豆视频在线播放 | 午夜视频黄| 精品国偷自产在线 | 黄色h在线观看 | 中文字幕在线专区 | 国产精品国产三级国产不产一地 | 91av九色 | www.97视频| 免费亚洲视频 | 国产中文字幕国产 | 国产一二三区在线观看 | 五月激情婷婷丁香 | 超碰免费av | 国产精品日韩高清 | 久久久午夜视频 | 欧美一区二区三区不卡 | av电影中文字幕在线观看 | 久草视频看看 | 国产一级大片在线观看 | 天堂av官网 | 超碰成人免费电影 | 91麻豆精品91久久久久同性 | 欧美性天天 | 亚洲精品午夜aaa久久久 | 成人一级视频在线观看 | 中文字幕在线播放av | 91精品视频在线观看免费 | 福利网址在线观看 | 韩国一区二区三区视频 | a级成人毛片 | 九月婷婷人人澡人人添人人爽 | 在线观看v片 | 亚洲h色精品| 黄网站色欧美视频 | 青青色影院 | 国产最新福利 | 日韩午夜电影网 | 在线观看一级片 | 麻豆影视在线观看 | av黄色在线观看 | 亚洲国产无| 精品免费久久久久 | 久久综合狠狠综合 | 在线观看视频97 | 99在线视频免费观看 | 国产不卡av在线播放 | 天天添夜夜操 | 久久久久久久久黄色 | 久久久高清 | 免费av电影网站 | 久久精品电影院 | 久久天天躁夜夜躁狠狠85麻豆 | 久久永久免费 | 日日夜夜av| 制服丝袜成人在线 | 三三级黄色片之日韩 | 日韩av在线网站 | 日韩高清毛片 | 91片黄在线观看动漫 | 精品国产乱码久久久久久久 | 国产录像在线观看 | 亚洲欧美在线视频免费 | 国产一级免费av | 婷婷av在线 | 久久五月婷婷丁香社区 | 五月激情站 | 久久超碰网 | 激情xxxx| 中文字幕在线免费观看 | 亚洲一区二区视频 | 五月天久久久久久 | 麻豆你懂的 | 不卡电影一区二区三区 | 成人在线播放免费观看 | 久久成人精品电影 | 91九色综合| zzijzzij亚洲成熟少妇 | 精品国产综合区久久久久久 | 黄色精品在线看 | 99精品免费久久久久久久久日本 | 午夜久久久久久久 | 成人一区二区在线 | 色综合天天狠天天透天天伊人 | 丁香网婷婷 | 香蕉在线观看 | 麻豆久久一区 | 99久久精品免费看国产四区 | www.在线观看视频 | 国产美女精品视频免费观看 | 中文字幕丝袜一区二区 | 国产理论片在线观看 | 992tv在线| 一区在线免费观看 | 亚洲天堂激情 | 亚洲一区二区三区毛片 | 成人app在线播放 | 成人免费在线观看av | 99久久夜色精品国产亚洲96 | 黄色av电影网 | 在线观看91精品视频 | 免费在线激情电影 | av大片免费在线观看 | 综合天天色 | 久久超碰免费 | 国产99久久九九精品 | 日韩.com| 久久久久久片 | 欧美国产高清 | 久久精品中文视频 | 国产精品午夜久久 | 国产又黄又爽又猛视频日本 | 一本一本久久a久久精品综合 | 天天干,天天射,天天操,天天摸 | 99热精品国产一区二区在线观看 | 午夜精品久久久 | 天天性天天草 | www.狠狠干 | 亚洲视频大全 | 亚洲国产精品影院 | www黄色大片 | 欧美夫妻生活视频 | 色婷婷88av视频一二三区 | 国产色a在线观看 | 国产精品第三页 | 国产精品综合久久久久久 | 天天婷婷| 国产在线观看h | 日韩电影久久久 | 国产v在线播放 | 欧美性另类 | 久久精彩免费视频 | 伊人激情网| 少妇18xxxx性xxxx片 | 91男人影院 | 在线观看av国产 | 色干干 | 国产中文字幕精品 | 天天se天天cao天天干 | 久久草在线免费 | 热久久最新地址 | 日韩视频免费在线观看 | 最新国产福利 | 久久99精品国产91久久来源 | 精品欧美一区二区精品久久 | 日韩伦理一区二区三区av在线 | 国产精品久久久久三级 | 精品99999 | 国产97在线播放 | 国产日韩精品在线 | 97超碰成人在线 | 日韩视频中文字幕在线观看 | 一区二区三区播放 | 免费在线观看一区 | 在线看小早川怜子av | 中文字幕韩在线第一页 | 精品久久一区二区 | 久久精品一区二 | 69精品 | 在线视频你懂得 | 中文字幕资源在线观看 | 国产香蕉视频在线观看 | 国产一级片久久 | 午夜久久网站 | 婷婷在线网| 国产在线中文字幕 | 久久久久久久久久久久久9999 | 在线色吧| 成人免费观看电影 | 色妞色视频一区二区三区四区 | 97视频免费在线 | 91精品视频在线观看免费 | 日本护士三级少妇三级999 | 精品国内自产拍在线观看视频 | 香蕉视频在线播放 | 人人添人人澡 | 久久久.com | 99久久婷婷国产综合亚洲 | 国产精品久久久久一区二区国产 | av资源在线看 | 久久99国产视频 | 国产精品久久久久久久婷婷 | 国产在线一线 | 国产黄大片在线观看 | 美女黄视频免费看 | 在线观看一区视频 | 四虎国产精品免费 | 亚洲视频综合在线 | 五月婷婷免费 | 日韩在线免费电影 | 欧美福利在线播放 | 国产免费又粗又猛又爽 | 午夜久久影视 | 在线国产能看的 | 日韩黄色大片在线观看 | 在线视频 影院 | 国产精久久 | 美女久久久 | 国产精品久久久久久一二三四五 | 又色又爽又黄高潮的免费视频 | 欧美尹人| 尤物97国产精品久久精品国产 | 欧美视频www| 婷婷在线五月 | 国产又粗又猛又黄又爽视频 | 丁香电影小说免费视频观看 | 国产精品久久久免费看 | 亚洲免费永久精品国产 | 韩国av一区二区三区在线观看 | 日韩网| 9在线观看免费高清完整 | 免费av黄色 | www黄色软件 | 美女视频久久久 | 国产免费视频一区二区裸体 | 久久精品—区二区三区 | 亚洲精选视频在线 | 国产高清第一页 | 99精品一区| 国产探花| 在线精品亚洲一区二区 | 成年人在线观看免费视频 | 日韩欧美在线影院 | 狠狠色丁香婷婷综合最新地址 | 日韩免费在线 | 一区二区三区电影 | 91大神视频网站 | 国产99视频在线观看 | 国产精品午夜8888 | 日韩久久精品一区二区三区下载 | 中文 一区二区 | 国产成人在线免费观看 | 国产精品99精品久久免费 | 一区二区视频免费在线观看 | 激情av在线播放 | 亚洲人在线7777777精品 | 在线 你懂| 精品三级av | 久久精品99 | 999精品| 涩涩网站在线观看 | 九九精品久久 | 久久久99精品免费观看 | 日韩免费在线观看视频 | 中文字幕免费成人 | www.婷婷色 | 日韩免费视频网站 | 免费高清在线观看成人 | 99精品久久久 | 亚洲精品久久激情国产片 | 亚洲国产中文字幕 | 久久久久久久久久久久电影 | 四虎国产精品免费观看视频优播 | 久久久久高清 | 国产精品1区2区 | 99久久视频| 粉嫩av一区二区三区免费 | 美女av在线免费 | 久久成人一区二区 |