當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

ARM CPU自动调度神经网络

發布時間：2023/11/28 生活经验 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 ARM CPU自动调度神经网络小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

ARM CPU自動調度神經網絡
對特定設備和工作負載進行自動調度，對于獲得最佳性能至關重要。通過RPC使用自動調度器為ARM CPU調度整個神經網絡。
為了自動調度神經網絡，將網絡劃分為小的子圖，進行獨立調度。每個子圖被視為一個搜索任務。任務調度程序對時間進行分片，為這些任務動態分配時間資源。任務調度程序預測每個任務對端到端執行時間的影響，確定最大程度地減少執行時間的任務的優先級。
對于每個子圖，使用compute聲明tvm/python/topi，獲取張量表達式形式的計算DAG。使用自動調度器來構造此DAG的搜索空間，搜索良好的調度（低級優化）。
與手動模板定義搜索空間的基于模板的autotvm不同，自動調度程序不需要任何調度模板。換句話說，自動調度程序僅在其中使用計算聲明tvm/python/topi，而不使用現有的調度模板。
本文無法在Windows或最新版本的macOS上運行。要使其運行，需要將本文的內容包裝在一個if name == “main”:塊中。
import numpy as np

import tvm
from tvm import relay, auto_scheduler
import tvm.relay.testing
from tvm.contrib import graph_runtime
from tvm.contrib.utils import tempdir
定義網絡
首先，需要使用Relay中繼前端API定義網絡。從加載一些預定義的網絡tvm.relay.testing。還從MXNet，ONNX，PyTorch和TensorFlow加載模型。
對于卷積神經網絡，盡管自動調度程序在任何布局下正常工作，使用NHWC布局通常實現最佳性能。使用自動調度程序對NHWC布局實施了更多優化。建議將模型轉換為NHWC布局以使用自動調度程序。使用ConvertLayout傳遞在TVM中進行布局轉換。
def get_network(name, batch_size, layout=“NHWC”, dtype=“float32”):
“”“Get the symbol definition and random weight of a network”""

# auto-scheduler prefers NHWC layout
if layout == "NHWC":image_shape = (224, 224, 3)
elif layout == "NCHW":image_shape = (3, 224, 224)
else:raise ValueError("Invalid layout: " + layout)input_shape = (batch_size,) + image_shape
output_shape = (batch_size, 1000)if name.startswith("resnet-"):n_layer = int(name.split("-")[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer,batch_size=batch_size,layout=layout,dtype=dtype,image_shape=image_shape,)
elif name.startswith("resnet3d-"):n_layer = int(name.split("-")[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer,batch_size=batch_size,layout=layout,dtype=dtype,image_shape=image_shape,)
elif name == "mobilenet":mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape)
elif name == "squeezenet_v1.1":assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"mod, params = relay.testing.squeezenet.get_workload(version="1.1",batch_size=batch_size,dtype=dtype,image_shape=image_shape,)
elif name == "inception_v3":input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "mxnet":# an example for mxnet modelfrom mxnet.gluon.model_zoo.vision import get_modelassert layout == "NCHW"block = get_model("resnet50_v1", pretrained=True)mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)net = mod["main"]net = relay.Function(net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs)mod = tvm.IRModule.from_expr(net)return mod, params, input_shape, output_shape

啟動RPC跟蹤器
TVM使用RPC會話與ARM板進行通信。在調度期間，調諧器會將生成的代碼發送到電路板上，測量電路板上的代碼速度。
為了擴大調度范圍，TVM使用RPC Tracker來管理分布式設備。RPC跟蹤器是一個集中式控制器節點。將所有設備注冊到跟蹤器。例如，如果有10部電話，全部注冊到跟蹤器，并行運行10次測量，從而加快了調諧過程。
要啟動RPC跟蹤器，在主機上運行此命令。在整個調度過程中都需要使用跟蹤器，因此需要為此命令打開一個新終端：
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
預期的輸出是
INFO:RPCTracker:bind to 0.0.0.0:9190
將設備注冊到RPC跟蹤器
現在，將設備注冊到跟蹤器。第一步是為ARM設備構建TVM運行時。
? 對于Linux：遵循本節在設備上構建TVM運行時，以在設備上構建TVM運行時。然后通過以下方式將設備注冊到跟蹤器
? python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rasp4b-64
（用[HOST_IP]主機的IP地址代替）
? 對于Android：在Android設備上安裝TVM RPC APK。確保通過android rpc測試。已經注冊了設備。在調度過程中，必須轉到開發人員選項，啟用“更改時保持屏幕喚醒”并為手機充電以使其穩定。
注冊設備后，通過查詢rpc_tracker進行確認
python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
例如，如果有2個Huawei mate10 pro，11個具有64位操作系統的Raspberry Pi 4B和2個rk3399，則輸出是
Queue Status

key total free pending

mate10pro 2 2 0
rk3399 2 2 0
rasp4b-64 11 11 0

將多個設備注冊到跟蹤器，以加快調諧過程中的測量速度。
設置調度選項
調度之前，應該應用一些配置。以帶有64位操作系統（Ubuntu 20.04）的Raspberry Pi 4b 4GB主板為例。在設置中，應該相應地修改目標和device_key。如果使用的是Android手機，use_ndk設置為True。

DEVICE CONFIG

Replace “aarch64-linux-gnu” with the correct target of your board.

This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.

FIXME(tmoreau89, merrymercy): We leave ‘-device=arm_cpu’ out of the target string

because we’re sharing x86 op strategy.

target = tvm.target.Target(“llvm -mtriple=aarch64-linux-gnu -mattr=+neon”)

Also replace this with the device key in your tracker

device_key = “rasp4b-64”

Set this to True if you use ndk tools for cross compiling

And also set the environment variable below to point to the cross compiler

use_ndk = False

os.environ[“TVM_NDK_CC”] = “/usr/bin/aarch64-linux-gnu-g++”

TUNING OPTION

network = “mobilenet”
batch_size = 1
layout = “NHWC”
dtype = “float32”
log_file = “%s-%s-B%d-%s.json” % (network, layout, batch_size, target.kind.name)
提取搜索任務
接下來，從網絡中提取搜索任務及其權重。任務的權重是該任務的子圖在整個網絡中的出現次數。通過使用權重，將網絡的端到端延遲近似為，sum(latency[t] * weight[t])其中是任務的延遲，latency[t]weight[t]是任務的權重。任務調度程序將僅優化此目標。

Extract tasks from the network

print(“Extract tasks…”)
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod[“main”], params, target)

for idx, task in enumerate(tasks):
print("========== Task %d (workload key: %s) ==========" % (idx, task.workload_key))
print(task.compute_dag)
出去：
Extract tasks…
========== Task 0 (workload key: [“d7b65649a4dd54becea0a52aabbc5af5”, 1, 1000, 1, 1000]) ==========
placeholder = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])

========== Task 1 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 1, 1024, 1000, 1024, 1000, 1, 1000]) ==========
placeholder = PLACEHOLDER [1, 1024]
placeholder = PLACEHOLDER [1000, 1024]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [1000]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

========== Task 2 (workload key: [“69115f188984ae34ede37c3b8ca40b43”, 1, 7, 7, 1024, 1, 1, 1, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax17) + rv0), ((ax27) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)7) + 1)) - (ax17)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)7) + 1)) - (ax27)))))

========== Task 3 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 7, 7, 1024, 1, 1, 1024, 1024, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 4 (workload key: [“06fce76bd84cb904eee50b905ca9449a”, 1, 7, 7, 1024, 3, 3, 1024, 1, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 1024, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 5 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 7, 7, 512, 1, 1, 512, 1024, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 6 (workload key: [“c87ba68bc180312f5716af09a77ca15b”, 1, 14, 14, 512, 3, 3, 512, 1, 1, 1, 1, 512, 1, 7, 7, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i2) + di), ((j2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 7 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 14, 14, 512, 1, 1, 512, 512, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 8 (workload key: [“06fce76bd84cb904eee50b905ca9449a”, 1, 14, 14, 512, 3, 3, 512, 1, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 9 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 14, 14, 256, 1, 1, 256, 512, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 10 (workload key: [“c87ba68bc180312f5716af09a77ca15b”, 1, 28, 28, 256, 3, 3, 256, 1, 1, 1, 1, 256, 1, 14, 14, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i2) + di), ((j2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 11 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 28, 28, 256, 1, 1, 256, 256, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 12 (workload key: [“06fce76bd84cb904eee50b905ca9449a”, 1, 28, 28, 256, 3, 3, 256, 1, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 13 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 28, 28, 128, 1, 1, 128, 256, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 14 (workload key: [“c87ba68bc180312f5716af09a77ca15b”, 1, 56, 56, 128, 3, 3, 128, 1, 1, 1, 1, 128, 1, 28, 28, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i2) + di), ((j2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 15 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 56, 56, 128, 1, 1, 128, 128, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 16 (workload key: [“06fce76bd84cb904eee50b905ca9449a”, 1, 56, 56, 128, 3, 3, 128, 1, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 17 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 56, 56, 64, 1, 1, 64, 128, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 18 (workload key: [“c87ba68bc180312f5716af09a77ca15b”, 1, 112, 112, 64, 3, 3, 64, 1, 1, 1, 1, 64, 1, 56, 56, 64]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 64]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 64, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i2) + di), ((j2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 19 (workload key: [“6b7583cf23c7c37d3212cad9d06e58c1”, 1, 112, 112, 32, 1, 1, 32, 64, 1, 1, 1, 64, 1, 112, 112, 64]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 32]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 32, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 20 (workload key: [“06fce76bd84cb904eee50b905ca9449a”, 1, 112, 112, 32, 3, 3, 32, 1, 1, 1, 1, 32, 1, 112, 112, 32]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 32]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 32, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 32]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 21 (workload key: [“98418eda02701ddd175ad50e364a0638”, 1, 224, 224, 3, 3, 3, 3, 32, 1, 112, 1, 1, 1, 112, 1, 1, 1, 112, 112, 32]) ==========
placeholder = PLACEHOLDER [1, 224, 224, 3]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 225)) && (i2 >= 1)) && (i2 < 225)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 3, 32]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 112, 1, 1]
T_multiply(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3]*placeholder[ax0, ax1, 0, 0])
placeholder = PLACEHOLDER [1, 112, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
調優與評估
設置一些選項來優化和啟動搜索任務
? num_measure_trials是在調度期間使用的測量試驗的次數。將其設置為較小的數字（例如200）以進行快速演示。實際上，建議將其設置為800 * len(tasks)，通常足以使搜索收斂。例如，resnet-50中有29個任務，因此將其設置為20000。根據時間預算調度。
? 此外，還用RecordToFile將測量記錄轉儲到日志文件中，些測量記錄可用于最好地查詢歷史記錄，恢復搜索以及以后進行更多分析。
? 有關更多參數auto_scheduler.TuningOptions，參見auto_scheduler.LocalRunner。
自動調度后，使用發現的最佳調度來編譯網絡。在自動調度期間，所有測量記錄都將轉儲到日志文件中，因此讀取日志文件并加載最佳調度。
def tune_and_evaluate():
print(“Begin tuning…”)
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=200, # change this to 20000 to achieve the best performance
runner=auto_scheduler.RPCRunner(
device_key,
host=“0.0.0.0”,
port=9191,
timeout=30,
repeat=1,
min_repeat_ms=200,
enable_cpu_cache_flush=True,
),
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)

tuner.tune(tune_option)# Compile with the history best
print("Compile...")
with auto_scheduler.ApplyHistoryBest(log_file):with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):lib = relay.build(mod, target=target, params=params)# Export library
tmp = tempdir()
if use_ndk:from tvm.contrib import ndkfilename = "net.so"lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:filename = "net.tar"lib.export_library(tmp.relpath(filename))# Upload module to device
print("Upload...")
remote = auto_scheduler.utils.request_remote(device_key, "0.0.0.0", 9191, timeout=10000)
remote.upload(tmp.relpath(filename))
rlib = remote.load_module(filename)# Create graph runtime
ctx = remote.cpu()
module = graph_runtime.GraphModule(rlib["default"](ctx))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)# Evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res))
)

We do not run the tuning in our webpage server since the server doesn’t have a Raspberry Pi,

or device tracker running.

Uncomment the following line to run it by yourself.

tune_and_evaluate()

筆記
調度期間解釋打印的信息
在調度期間，控制臺上會打印很多信息。用于調試目的。最重要的信息是任務調度程序的輸出。下表是示例輸出。

------------------------------ [ Task Scheduler ]

| ID | Latency (ms) | Speed (GFLOPS) | Trials |

| 0 | 0.013 | 0.31 | 64 |
| 1 | 0.845 | 2.43 | 448 |
| 2 | 0.046 | -0.00 | 64 |
| 3 | 4.194 | 24.53 | 2112 |
| 4 | 0.109 | 9.21 | 64 |
| 5 | 1.759 | 29.27 | 896 |
| 6 | 0.083 | 6.01 | 64 |
| 7 | 3.084 | 33.38 | 7680 |
| 8 | 0.136 | 14.78 | 384 |
| 9 | 1.349 | 38.23 | 768 |
| 10 | 0.133 | 7.55 | 128 |
| 11 | 2.747 | 37.56 | 1536 |
| 12 | 0.338 | 11.87 | 192 |
| 13 | 1.295 | 40.00 | 704 |
| 14 | 0.482 | 4.16 | 256 |
| 15 | 2.686 | 38.56 | 1344 |
| 16 | 0.884 | 9.08 | 448 |
| 17 | 1.332 | 39.18 | 704 |
| 18 | 1.045 | 3.84 | 576 |
| 19 | 1.391 | 38.09 | 704 |
| 20 | 0.777 | 10.34 | 448 |
| 21 | 0.739 | 30.97 | 448 |

Estimated total latency: 38.347 ms Trials: 19992 Used time : 19260 s Next ID: 3
下表列出了所有任務的延遲和（估計）速度。列出了所有任務的測量試驗分配。最后一行顯示這些任務的總加權延遲，粗略估計網絡的端到端執行時間。最后一行還顯示測量試驗的總數，自動調度所花費的總時間，以及要調度的下一個任務的ID。
將出現一些“ dmlc :: Error”錯誤，因為自動調度程序將嘗試一些無效的調度。如果繼續進行調度，則放心地忽略，因為這些錯誤與主要過程是隔離的。
筆記
提前終止調度
通過強制終止此過程來提前終止調度。只要在日志文件中為每個任務至少獲得一個有效的調度，就應該能夠進行編譯（下面的部分）。
其他技巧

在調度過程中，自動調度器需要編譯許多程序并從中提取功能。該部分占用大量CPU，建議使用具有多個內核的高性能CPU，以加快搜索速度。
用來提取大型日志文件，而僅保存最有用的記錄。python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json
從上一個日志文件繼續搜索。load_log_file在function中創建任務調度程序時，只需添加一個新參數run_tuning。
tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
如果有多個目標CPU，則全部用于測量以并行化測量。檢查本節以了解如何使用RPC跟蹤器和RPC服務器。要在自動調度使用RPC跟蹤，用auto_scheduler.RPCRunner更換轉輪TuningOptions。

總結

以上是生活随笔為你收集整理的ARM CPU自动调度神经网络的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。