自动调试用于移动GPU的卷积网络
自動(dòng)調(diào)試用于移動(dòng)GPU的卷積網(wǎng)絡(luò)
對(duì)特定設(shè)備進(jìn)行自動(dòng)調(diào)試對(duì)于獲得最佳性能至關(guān)重要。這是有關(guān)如何調(diào)試整個(gè)卷積網(wǎng)絡(luò)的說(shuō)明文檔。
TVM中Mobile GPU的算子實(shí)現(xiàn)以模板形式編寫(xiě)。模板具有許多可調(diào)旋鈕(平鋪因子,矢量化,展開(kāi)等)。將調(diào)試神經(jīng)網(wǎng)絡(luò)中的所有卷積,深度卷積和密集算子。調(diào)試后,生成一個(gè)日志文件,其中存儲(chǔ)了所有必需算子的最佳旋鈕值。當(dāng)TVM編譯器編譯這些算子時(shí),將查詢此日志文件以獲得最佳旋鈕值。
發(fā)布了一些ARM設(shè)備的預(yù)調(diào)參數(shù)。參考 Mobile GPU Benchmark 。
注意,本文無(wú)法在Windows或最新版本的macOS上運(yùn)行。要使其運(yùn)行,需要將本文內(nèi)容包裝在一個(gè)if name == “main”:塊中。
安裝依賴
要在tvm中使用autotvm軟件包,需要安裝一些額外的依賴項(xiàng)。(如果使用python2,請(qǐng)將“ 3”更改為“ 2”):
pip3 install --user psutil xgboost tornado
為了使TVM在調(diào)試期間更快地運(yùn)行,建議將cython用作tvm的FFI。在tvm的根目錄中,執(zhí)行(如果使用python2,請(qǐng)將“ 3”更改為“ 2”):
pip3 install --user cython
sudo make cython3
返回python代碼,導(dǎo)入包。
import os
import numpy as np
import tvm
from tvm import relay, autotvm
import tvm.relay.testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.utils import tempdir
import tvm.contrib.graph_runtime as runtime
定義網(wǎng)絡(luò)
首先,需要在中繼前端API中定義網(wǎng)絡(luò)。可以從relay.testing加載一些預(yù)定義的網(wǎng)絡(luò)。還可以從MXNet,ONNX和TensorFlow加載模型。
def get_network(name, batch_size):
“”“Get the symbol definition and random weight of a network”""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:n_layer = int(name.split("-")[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size, dtype=dtype)
elif "vgg" in name:n_layer = int(name.split("-")[1])mod, params = relay.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size, dtype=dtype)
elif name == "mobilenet":mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "squeezenet_v1.1":mod, params = relay.testing.squeezenet.get_workload(batch_size=batch_size, version="1.1", dtype=dtype)
elif name == "inception_v3":input_shape = (batch_size, 3, 299, 299)mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "mxnet":# an example for mxnet modelfrom mxnet.gluon.model_zoo.vision import get_modelblock = get_model("resnet18_v1", pretrained=True)mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)net = mod["main"]net = relay.Function(net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs)mod = tvm.IRModule.from_expr(net)
else:raise ValueError("Unsupported network: " + name)return mod, params, input_shape, output_shape
**啟動(dòng)RPC跟蹤器
TVM使用RPC會(huì)話與ARM板進(jìn)行通信。調(diào)諧期間,調(diào)諧器會(huì)將生成的代碼發(fā)送到電路板上,并測(cè)量電路板上的代碼速度。
為了擴(kuò)大調(diào)試范圍,TVM使用RPC Tracker來(lái)管理分布式設(shè)備。RPC跟蹤器是一個(gè)集中式控制器節(jié)點(diǎn)。可以將所有設(shè)備注冊(cè)到跟蹤器。例如,如果有10部電話,可以將它們?nèi)孔?cè)到跟蹤器,并行運(yùn)行10次測(cè)量,從而加快了調(diào)諧過(guò)程。
要啟動(dòng)RPC跟蹤器,請(qǐng)?jiān)谥鳈C(jī)上運(yùn)行此命令。在整個(gè)調(diào)試過(guò)程中都需要使用跟蹤器,因此需要為此命令打開(kāi)一個(gè)新終端:
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
預(yù)期的輸出是
INFO:RPCTracker:bind to 0.0.0.0:9190
將設(shè)備注冊(cè)到RPC跟蹤器
將設(shè)備注冊(cè)到跟蹤器。第一步是為ARM設(shè)備構(gòu)建TVM運(yùn)行時(shí)runtime。
? 對(duì)于Linux:遵循本節(jié)在設(shè)備上構(gòu)建TVM運(yùn)行時(shí)runtime,在設(shè)備上構(gòu)建TVM運(yùn)行時(shí)runtime。然后通過(guò)以下方式將設(shè)備注冊(cè)到跟蹤器。
? python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rk3399
(替換[HOST_IP]為主機(jī)的IP地址)
? 對(duì)于Android:按照此readme page在Android設(shè)備上安裝TVM RPC APK。確保可以通過(guò)android RPC測(cè)試。這樣就已經(jīng)注冊(cè)了設(shè)備。在調(diào)試過(guò)程中,必須轉(zhuǎn)到開(kāi)發(fā)人員選項(xiàng)并啟用“更改時(shí)保持屏幕喚醒”并為手機(jī)充電以使其穩(wěn)定。
注冊(cè)設(shè)備后,可以通過(guò)查詢r(jià)pc_tracker進(jìn)行確認(rèn)。
python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
例如,如果有2個(gè)Huawei mate10 pro,11個(gè)Raspberry Pi 3B和2個(gè)rk3399,則輸出可以是
Queue Status
key total free pending
mate10pro 2 2 0
rk3399 2 2 0
rpi3b 11 11 0
----------------------------------**
可以將多個(gè)設(shè)備注冊(cè)到跟蹤器,以加快調(diào)諧中的測(cè)量速度。
設(shè)置調(diào)試選項(xiàng)
調(diào)試之前,應(yīng)該應(yīng)用一些配置。這里以RK3399板為例。在設(shè)置中,應(yīng)該相應(yīng)地修改目標(biāo)和device_key。如果使用的是Android手機(jī),設(shè)置use_android為T(mén)rue。
DEVICE CONFIG
target = tvm.target.Target(“opencl -device=mali”)
Replace “aarch64-linux-gnu” with the correct target of your board.
This target host is used for cross compilation. You can query it by :code:gcc -v on your device.
target_host = “l(fā)lvm -mtriple=aarch64-linux-gnu”
Also replace this with the device key in your tracker
device_key = “rk3399”
Set this to True if you use android phone
use_android = False
TUNING OPTION
network = “resnet-18”
log_file = “%s.%s.log” % (device_key, network)
dtype = “float32”
tuning_option = {
“l(fā)og_filename”: log_file,
“tuner”: “xgb”,
“n_trial”: 1000,
“early_stopping”: 450,
“measure_option”: autotvm.measure_option(
builder=autotvm.LocalBuilder(build_func=“ndk” if use_android else “default”),
runner=autotvm.RPCRunner(
device_key,
host=“0.0.0.0”,
port=9190,
number=10,
timeout=5,
),
),
}
注意
如何設(shè)置調(diào)試選項(xiàng)
通常,此處提供的默認(rèn)值效果很好。如果你有足夠的時(shí)間預(yù)算,你可以設(shè)置n_trial,early_stopping更大,這使運(yùn)行的時(shí)間更長(zhǎng)的調(diào)試。如果設(shè)備運(yùn)行速度非常慢,或者conv2d算子有很多GFLOP,請(qǐng)考慮將超時(shí)設(shè)置為更大。
開(kāi)始調(diào)試
現(xiàn)在,可以從網(wǎng)絡(luò)中提取調(diào)試任務(wù)并開(kāi)始調(diào)試。在這里,提供了一個(gè)簡(jiǎn)單的實(shí)用程序功能來(lái)調(diào)試任務(wù)列表。此函數(shù)只是一個(gè)初始實(shí)現(xiàn),可以按順序?qū)ζ溥M(jìn)行調(diào)試。將來(lái),推出更復(fù)雜的調(diào)優(yōu)調(diào)度程序。
You can skip the implementation of this function for this tutorial.
def tune_tasks(
tasks,
measure_option,
tuner=“xgb”,
n_trial=1000,
early_stopping=None,
log_filename=“tuning.log”,
use_transfer_learning=True,
):
# create tmp log file
tmp_log_file = log_filename + “.tmp”
if os.path.exists(tmp_log_file):
os.remove(tmp_log_file)
for i, tsk in enumerate(reversed(tasks)):prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))# create tunerif tuner == "xgb" or tuner == "xgb-rank":tuner_obj = XGBTuner(tsk, loss_type="rank")elif tuner == "ga":tuner_obj = GATuner(tsk, pop_size=50)elif tuner == "random":tuner_obj = RandomTuner(tsk)elif tuner == "gridsearch":tuner_obj = GridSearchTuner(tsk)else:raise ValueError("Invalid tuner: " + tuner)if use_transfer_learning:if os.path.isfile(tmp_log_file):tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))# do tuningtsk_trial = min(n_trial, len(tsk.config_space))tuner_obj.tune(n_trial=tsk_trial,early_stopping=early_stopping,measure_option=measure_option,callbacks=[autotvm.callback.progress_bar(tsk_trial, prefix=prefix),autotvm.callback.log_to_file(tmp_log_file),],)# pick best records to a cache file
autotvm.record.pick_best(tmp_log_file, log_filename)
os.remove(tmp_log_file)
最后,啟動(dòng)調(diào)試作業(yè)并評(píng)估端到端性能。
def tune_and_evaluate(tuning_opt):
# extract workloads from relay program
print(“Extract tasks…”)
mod, params, input_shape, _ = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_program(
mod[“main”],
target=target,
target_host=target_host,
params=params,
ops=(relay.op.get(“nn.conv2d”),),
)
# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)# compile kernels with history best records
with autotvm.apply_history_best(log_file):print("Compile...")with tvm.transform.PassContext(opt_level=3):lib = relay.build_module.build(mod, target=target, params=params, target_host=target_host)# export librarytmp = tempdir()if use_android:from tvm.contrib import ndkfilename = "net.so"lib.export_library(tmp.relpath(filename), ndk.create_shared)else:filename = "net.tar"lib.export_library(tmp.relpath(filename))# upload module to deviceprint("Upload...")remote = autotvm.measure.request_remote(device_key, "0.0.0.0", 9190, timeout=10000)remote.upload(tmp.relpath(filename))rlib = remote.load_module(filename)# upload parameters to devicectx = remote.context(str(target), 0)module = runtime.GraphModule(rlib["default"](ctx))data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))module.set_input("data", data_tvm)# evaluateprint("Evaluate inference time cost...")ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=30)prof_res = np.array(ftimer().results) * 1000 # convert to millisecondprint("Mean inference time (std dev): %.2f ms (%.2f ms)"% (np.mean(prof_res), np.std(prof_res)))
We do not run the tuning in our webpage server since it takes too long.
Uncomment the following line to run it by yourself.
tune_and_evaluate(tuning_option)
樣本輸出
調(diào)優(yōu)需要編譯許多程序并從中提取功能。因此,建議使用高性能CPU。下面列出了一個(gè)示例輸出。使用32T AMD Ryzen Threadripper大約需要3個(gè)小時(shí)。
Extract tasks…
Tuning…
[Task 1/17] Current/Best: 25.30/ 39.12 GFLOPS | Progress: (992/1000) | 751.22 s Done.
[Task 2/17] Current/Best: 40.70/ 45.50 GFLOPS | Progress: (736/1000) | 545.46 s Done.
[Task 3/17] Current/Best: 38.83/ 42.35 GFLOPS | Progress: (992/1000) | 1549.85 s Done.
[Task 4/17] Current/Best: 23.31/ 31.02 GFLOPS | Progress: (640/1000) | 1059.31 s Done.
[Task 5/17] Current/Best: 0.06/ 2.34 GFLOPS | Progress: (544/1000) | 305.45 s Done.
[Task 6/17] Current/Best: 10.97/ 17.20 GFLOPS | Progress: (992/1000) | 1050.00 s Done.
[Task 7/17] Current/Best: 8.98/ 10.94 GFLOPS | Progress: (928/1000) | 421.36 s Done.
[Task 8/17] Current/Best: 4.48/ 14.86 GFLOPS | Progress: (704/1000) | 582.60 s Done.
[Task 9/17] Current/Best: 10.30/ 25.99 GFLOPS | Progress: (864/1000) | 899.85 s Done.
[Task 10/17] Current/Best: 11.73/ 12.52 GFLOPS | Progress: (608/1000) | 304.85 s Done.
[Task 11/17] Current/Best: 15.26/ 18.68 GFLOPS | Progress: (800/1000) | 747.52 s Done.
[Task 12/17] Current/Best: 17.48/ 26.71 GFLOPS | Progress: (1000/1000) | 1166.40 s Done.
[Task 13/17] Current/Best: 0.96/ 11.43 GFLOPS | Progress: (960/1000) | 611.65 s Done.
[Task 14/17] Current/Best: 17.88/ 20.22 GFLOPS | Progress: (672/1000) | 670.29 s Done.
[Task 15/17] Current/Best: 11.62/ 13.98 GFLOPS | Progress: (736/1000) | 449.25 s Done.
[Task 16/17] Current/Best: 19.90/ 23.83 GFLOPS | Progress: (608/1000) | 708.64 s Done.
[Task 17/17] Current/Best: 17.98/ 22.75 GFLOPS | Progress: (736/1000) | 1122.60 s Done.
Compile…
Upload…
Evaluate inference time cost…
Mean inference time (std dev): 128.05 ms (7.74 ms)
注意
遇到困難?
自動(dòng)調(diào)試模塊容易出錯(cuò)。如果始終看到“ 0.00 / 0.00 GFLOPS”,則一定有問(wèn)題。
首先,確保設(shè)置了正確的設(shè)備配置。然后,可以通過(guò)在腳本的開(kāi)頭添加這些行來(lái)打印調(diào)試信息。它將打印每個(gè)測(cè)量結(jié)果,可以在其中找到有用的錯(cuò)誤消息。
import logging
logging.getLogger(‘a(chǎn)utotvm’).setLevel(logging.DEBUG)
最后,隨時(shí)在https://discuss.tvm.apache.org上向社區(qū)尋求幫助。
總結(jié)
以上是生活随笔為你收集整理的自动调试用于移动GPU的卷积网络的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: CPU的自动调度矩阵乘法
- 下一篇: 使用元组输入进行计算和归约