tvm模型部署c++ 分析
tvm模型部署c++ 分析
tvm c++部署官方教程
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://tvm.apache.org/docs/how_to/deploy/cpp_deploy.html
官方說(shuō)執(zhí)行run_example.sh腳本就可以完成部署。
使用C++ API部署TVM模塊
提供了一個(gè)關(guān)于如何在apps/howto_deploy中,部署TVM模塊的示例。要運(yùn)行該示例,可以使用以下命令
cd apps/howto_deploy
./run_example.sh
獲取TVM運(yùn)行庫(kù)
唯一需要的是鏈接到目標(biāo)平臺(tái)中的TVM運(yùn)行時(shí)。TVM提供了一個(gè)最低運(yùn)行時(shí)間,根據(jù)使用的模塊數(shù)量,成本大約在300K到600K之間。在大多數(shù)情況下,可以使用libtvm_runtime.com,這是構(gòu)建時(shí)附帶的。
如果發(fā)現(xiàn)很難構(gòu)建libtvm_運(yùn)行時(shí),checkout tvm_runtime_pack.cc。這是一個(gè)提供TVM運(yùn)行時(shí)的示例。可以使用構(gòu)建系統(tǒng)編譯此文件,包含到項(xiàng)目中。
可以checkout應(yīng)用程序,如在iOS、Android和其它平臺(tái)上,使用TVM構(gòu)建的應(yīng)用程序。
動(dòng)態(tài)庫(kù)與系統(tǒng)模塊
TVM提供了兩種使用已編譯庫(kù)的方法。可以checkout prepare_test_libs.py(關(guān)于如何生成庫(kù))和cpp_deploy.cc(關(guān)于如何使用)。
將庫(kù)存儲(chǔ)為共享庫(kù),并將庫(kù)動(dòng)態(tài)加載到項(xiàng)目中。
以系統(tǒng)模塊模式將編譯后的庫(kù)綁定到項(xiàng)目中。
動(dòng)態(tài)加載更靈活,可以動(dòng)態(tài)加載新模塊。系統(tǒng)模塊是一種更靜態(tài)的方法。可以在禁止動(dòng)態(tài)庫(kù)加載的地方使用系統(tǒng)模塊。
c++部署代碼
https://github.com/apache/tvm/blob/main/apps/howto_deploy/cpp_deploy.cc
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* “License”); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
/*!* \brief Example code on load and run TVM module.s* \file cpp_deploy.cc*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>#include <cstdio>void Verify(tvm::runtime::Module mod, std::string fname) {// Get the function from the module.tvm::runtime::PackedFunc f = mod.GetFunction(fname);ICHECK(f != nullptr);// Allocate the DLPack data structures.//// Note that we use TVM runtime API to allocate the DLTensor in this example.// TVM accept DLPack compatible DLTensors, so function can be invoked// as long as we pass correct pointer to DLTensor array.//// For more information please refer to dlpack.// One thing to notice is that DLPack contains alignment requirement for// the data pointer and TVM takes advantage of that.// If you plan to use your customized data container, please// make sure the DLTensor you pass in meet the alignment requirement.//DLTensor* x;DLTensor* y;int ndim = 1;int dtype_code = kDLFloat;int dtype_bits = 32;int dtype_lanes = 1;int device_type = kDLCPU;int device_id = 0;int64_t shape[1] = {10};TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);for (int i = 0; i < shape[0]; ++i) {static_cast<float*>(x->data)[i] = i;}// Invoke the function// PackedFunc is a function that can be invoked via positional argument.// The signature of the function is specified in tvm.buildf(x, y);// Print out the outputfor (int i = 0; i < shape[0]; ++i) {ICHECK_EQ(static_cast<float*>(y->data)[i], i + 1.0f);}LOG(INFO) << "Finish verification...";TVMArrayFree(x);TVMArrayFree(y);
}void DeploySingleOp() {// Normally we can directlytvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("lib/test_addone_dll.so");LOG(INFO) << "Verify dynamic loading from test_addone_dll.so";Verify(mod_dylib, "addone");// For libraries that are directly packed as system lib and linked together with the app// We can directly use GetSystemLib to get the system wide library.LOG(INFO) << "Verify load function from system lib";tvm::runtime::Module mod_syslib = (*tvm::runtime::Registry::Get("runtime.SystemLib"))();Verify(mod_syslib, "addonesys");
}void DeployGraphExecutor() {LOG(INFO) << "Running graph executor...";// load in the libraryDLDevice dev{kDLCPU, 0};tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("lib/test_relay_add.so");// create the graph executor moduletvm::runtime::Module gmod = mod_factory.GetFunction("default")(dev);tvm::runtime::PackedFunc set_input = gmod.GetFunction("set_input");tvm::runtime::PackedFunc get_output = gmod.GetFunction("get_output");tvm::runtime::PackedFunc run = gmod.GetFunction("run");// Use the C++ APItvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);for (int i = 0; i < 2; ++i) {for (int j = 0; j < 2; ++j) {static_cast<float*>(x->data)[i * 2 + j] = i * 2 + j;}}// set the right inputset_input("x", x);// run the coderun();// get the outputget_output(0, y);for (int i = 0; i < 2; ++i) {for (int j = 0; j < 2; ++j) {ICHECK_EQ(static_cast<float*>(y->data)[i * 2 + j], i * 2 + j + 1);}}
}int main(void) {DeploySingleOp();DeployGraphExecutor();return 0;
}
Makefile文件
https://github.com/apache/tvm/blob/main/apps/howto_deploy/Makefile
Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.# Makefile Example to deploy TVM modules.
TVM_ROOT=$(shell cd ../..; pwd)
DMLC_CORE=${TVM_ROOT}/3rdparty/dmlc-corePKG_CFLAGS = -std=c++14 -O2 -fPIC\-I${TVM_ROOT}/include\-I${DMLC_CORE}/include\-I${TVM_ROOT}/3rdparty/dlpack/include\-DDMLC_USE_LOGGING_LIBRARY=\<tvm/runtime/logging.h\>PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread.PHONY: clean allall: lib/cpp_deploy_pack lib/cpp_deploy_normal# Build rule for all in one TVM package library
lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc@mkdir -p $(@D)$(CXX) -c $(PKG_CFLAGS) -o $@ $^# The code library built by TVM
lib/test_addone_sys.o: prepare_test_libs.py@mkdir -p $(@D)python3 prepare_test_libs.py# Deploy using the all in one TVM package library
lib/cpp_deploy_pack: cpp_deploy.cc lib/test_addone_sys.o lib/libtvm_runtime_pack.o@mkdir -p $(@D)$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)# Deploy using pre-built libtvm_runtime.so
lib/cpp_deploy_normal: cpp_deploy.cc lib/test_addone_sys.o@mkdir -p $(@D)$(CXX) $(PKG_CFLAGS) -o $@ $^ -ltvm_runtime $(PKG_LDFLAGS)
clean:
結(jié)合Makefile文件和run_example.sh腳本一起看
腳本先創(chuàng)建lib目錄,然后執(zhí)行sudo make命令,make操作的執(zhí)行要看Makefile文件
make命令會(huì)先在lib文件夾中編譯一個(gè)名為libtvm_runtime_pack.o的靜態(tài)鏈接庫(kù)
然后運(yùn)行prepare_test_lib.py文件生成將模型生成為test_addone_dll.so,test_addone_sys.o和test_relay_add.so三個(gè)庫(kù),給cpp_deploy.cc調(diào)用,生成兩個(gè)可執(zhí)行文件cpp_deploy_pack和cpp_deploy_normal
目標(biāo)是用其它框架寫(xiě)的深度學(xué)習(xí)網(wǎng)絡(luò),通過(guò)tvm轉(zhuǎn)換成so文件,使用c++部署,在gpu上進(jìn)行調(diào)用,下面是cpu上部署的代碼
/*
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- “License”); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
*/
/*!
- \brief Example code on load and run TVM module.s
- \file cpp_deploy.cc
*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include
#include
#include
#include
using namespace std;
template
Type stringToNum(const string& str)
{
istringstream iss(str);
Type num;
iss >> num;
return num;
}
void DeployGraphRuntime() {
ifstream in("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/data.txt");
//int image[784];
string s;
int image_index=0;
/*
while(getline(in,s))
{
image[i]=stringToNum(s);
++i;
}*/
LOG(INFO) << “Running graph runtime…”;
// load in the library
DLContext ctx{kDLGPU, 0};
tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/model.so");
// create the graph runtime module
tvm::runtime::Module gmod = mod_factory.GetFunction(“default”)(ctx);
tvm::runtime::PackedFunc set_input = gmod.GetFunction(“set_input”);
tvm::runtime::PackedFunc get_output = gmod.GetFunction(“get_output”);
tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);
// Use the C++ API
tvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({1,784}, DLDataType{kDLFloat, 32, 1}, ctx);
tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({1, 10}, DLDataType{kDLFloat, 32, 1}, ctx);
while(getline(in,s))
{
static_cast<float*>(x->data)[image_index]=((float)stringToNum(s))/255;
image_index++;
}
// set the right input
set_input(“x”, x);
// run the code
run();
// get the output
get_output(0, y);
for(int i=0;i<10;++i)
{
LOG(INFO)<<static_cast<float*>(y->data)[i];
}
/*
for (int i = 0; i < 2; ++i) {
for (int j = 0; j < 2; ++j) {
ICHECK_EQ(static_cast<float*>(y->data)[i * 2 + j], i * 2 + j + 1);
}
}*/
}
int main(void) {
//DeploySingleOp();
DeployGraphRuntime();
return 0;
}
思路很簡(jiǎn)單就是把數(shù)據(jù)讀進(jìn)來(lái),set_input,run然后get_output,修改了將target修改成cuda后并不能成功在gpu上運(yùn)行,會(huì)出現(xiàn)core dump的問(wèn)題
原因是想要讓模型在gpu上運(yùn)行,需要在gpu上開(kāi)辟內(nèi)存,然后將數(shù)據(jù)拷貝到gpu上運(yùn)行,這個(gè)代碼沒(méi)有這些操作所以運(yùn)行時(shí)會(huì)導(dǎo)致core崩潰。
下面是tvm c++部署調(diào)用gpu的完整過(guò)程,深度學(xué)習(xí)模型使用keras寫(xiě)的mnist手寫(xiě)體識(shí)別網(wǎng)絡(luò),保存成了pb格式,模型代碼就不放了,這里直接讀取pb文件進(jìn)行轉(zhuǎn)化,模型輸入是(1,784),輸出是(1,10)
導(dǎo)入頭文件
import tvm
from tvm import te
from tvm import relay
os and numpy
import numpy as np
import os.path
Tensorflow imports
import tensorflow as tf
try:
tf_compat_v1 = tf.compat.v1
except ImportError:
tf_compat_v1 = tf
Tensorflow utility functions
import tvm.relay.testing.tf as tf_testing
from tvm.contrib import graph_runtime
參數(shù)設(shè)置
#cpu
#target = “l(fā)lvm”
#target_host = “l(fā)lvm”
#layout = None
#ctx = tvm.cpu(0)
#gpu
target = “cuda”
target_host = ‘llvm’
layout = “NCHW”
ctx = tvm.gpu(0)
處理數(shù)據(jù)
from tensorflow.python.keras.datasets import mnist
from tensorflow.python.keras.utils import np_utils
(x_train,y_train),(x_test,y_test)=mnist.load_data()
x_test1=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)
x_train=x_train.reshape(x_train.shape[0],x_train.shape[1]*x_train.shape[2])
x_test=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])
x_train=x_train/255
x_test=x_test/255
y_train=np_utils.to_categorical(y_train)
y_test=np_utils.to_categorical(y_test)
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)
with open(“data.txt”,‘w’) as wf:
for i in range(784):
wf.write(str(x_test1[12][i]))
wf.write(’\n’)
讀取模型
with tf_compat_v1.gfile.GFile(’./frozen_models/simple_frozen_graph.pb’, “rb”) as f:
graph_def = tf_compat_v1.GraphDef()
graph_def.ParseFromString(f.read())
graph = tf.import_graph_def(graph_def, name="")
# Call the utility to import the graph definition into default graph.
graph_def = tf_testing.ProcessGraphDefParam(graph_def)
# Add shapes to the graph.
config = tf.compat.v1.ConfigProto(allow_soft_placement=True)
with tf_compat_v1.Session() as sess:graph_def = tf_testing.AddShapesToGraphDef(sess, "Identity")
tensor_name_list = [tensor.name for tensor in tf.compat.v1.get_default_graph().as_graph_def().node]
for tensor_name in tensor_name_list:
print(tensor_name,’\n’)
構(gòu)建
shape_dict = {“x”: x_train[0:1].shape}
print(shape_dict)
dtype_dict = {“x”: “uint8”}
mod, params = relay.frontend.from_tensorflow(graph_def, layout=layout, shape=shape_dict)
print(“Tensorflow protobuf imported to relay frontend.”)
編譯成tvm模型
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, target_host=target_host, params=params)
測(cè)試一下tvm模型能不能用
from tvm.contrib import graph_runtime
tt=np.zeros([1,784])
i=0
file=open(“data.txt”)
while 1:
line=file.readline()
if not line:
break
tt[0][i]=int(line)
i+=1
file.close()
dtype = “float32”
m = graph_runtime.GraphModule(lib"default")
set inputs
m.set_input(“x”, tvm.nd.array(tt.astype(dtype)))
execute
m.run()
get outputs
tvm_output = m.get_output(0, tvm.nd.empty(((1, 10)), “float32”))
print(tvm_output.shape,tvm_output)
保存模型
from tvm.contrib import utils
temp=utils.tempdir()
path_lib=temp.relpath("/home/aiteam/test_code/model.so")
print(path_lib)
lib.export_library(path_lib)
print(temp.listdir())
然后進(jìn)入到tvm/apps/howto_deploy目錄,修改tvm_runtime_pack.cc文件,加上頭文件
#include “…/…/src/runtime/cuda/cuda_device_api.cc”
#include “…/…/src/runtime/cuda/cuda_module.cc”
然后再寫(xiě)一個(gè)cc文件存放自己的部署代碼,修改Makefile文件進(jìn)行編譯
文件名是cpp_deploy_bkp.cc
修改后的Makefile文件
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
“License”); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Makefile Example to deploy TVM modules.
TVM_ROOT=(shellcd../..;pwd)DMLCCORE=(shell cd ../..; pwd) DMLC_CORE=(shellcd../..;pwd)DMLCC?ORE={TVM_ROOT}/3rdparty/dmlc-core
PKG_CFLAGS = -std=c++14 -g -fPIC
-IKaTeX parse error: Undefined control sequence: \ at position 19: …M_ROOT}/include\? ? -I{DMLC_CORE}/include
-I${TVM_ROOT}/3rdparty/dlpack/include
-I/usr/local/cuda/include
PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread -L/usr/local/cuda/lib64 -lcudart -lcuda
.PHONY: clean all
all:lib/libtvm_runtime_pack.o lib/cpp_deploy_pack
#all: lib/cpp_deploy_pack lib/cpp_deploy_normal
Build rule for all in one TVM package library
.PHONY: lib/libtvm_runtime_pack.o
lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc
@mkdir -p $(@D)
$(CXX) -c $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)
Deploy using the all in one TVM package library
.PHONY: lib/cpp_deploy_pack
lib/cpp_deploy_pack: cpp_deploy_bkp.cc lib/libtvm_runtime_pack.o
@mkdir -p $(@D)
$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)
里面要加上cuda頭文件的位置和動(dòng)態(tài)鏈接庫(kù)的位置
cpp_deploy_bkp.cc
/*
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- “License”); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
*/
/*!
- \brief Example code on load and run TVM module.s
- \file cpp_deploy.cc
*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include
#include
#include
#include
using namespace std;
template
Type stringToNum(const string& str)
{
istringstream iss(str);
Type num;
iss >> num;
return num;
}
void DeployGraphRuntime() {
constexpr int dtype_code= 2U;
constexpr int dtype_bits=32;
constexpr int dtype_lines=1;
constexpr int device_type= 2;
constexpr int device_id=0;
int ndim=2;
int64_t in_shape[2]={1,784};
int64_t out_shape[2]={1,10};
DLTensor* DLTX=nullptr;
DLTensor* DLTY=nullptr;
TVMArrayAlloc(in_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTX);
TVMArrayAlloc(out_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTY);
float img[784];
float rslt[10];
ifstream in("/home/aiteam/tiwang/data.txt");
//int image[784];
string s;
int image_index=0;
/*
while(getline(in,s))
{
image[i]=stringToNum(s);
++i;
}*/
bool enabled = tvm::runtime::RuntimeEnabled(“cuda”);
if (!enabled)
{
LOG(INFO) << “Skip heterogeneous test because cuda is not enabled.”<< “\n”;
return;
}
LOG(INFO) << “Running graph runtime…”;
// load in the library
DLContext ctx{kDLGPU, 0};
tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/test_code/model.so");
// create the graph runtime module
tvm::runtime::Module gmod = mod_factory.GetFunction(“default”)(ctx);
tvm::runtime::PackedFunc set_input = gmod.GetFunction(“set_input”);
tvm::runtime::PackedFunc get_output = gmod.GetFunction(“get_output”);
tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);
// Use the C++ API
while(getline(in,s))
{
if(image_index%28==0)
printf("\n");
//static_cast<float*>(x->data)[image_index]=((float)stringToNum(s))/255;
img[image_index]=((float)stringToNum(s))/255;
int a=stringToNum<int>(s);printf("%4d",a);image_index++;
}
TVMArrayCopyFromBytes(DLTX,&img[0],image_indexsizeof(float));
// set the right input
set_input(“x”, DLTX);
// run the code
run();
// get the output
get_output(0, DLTY);
TVMArrayCopyToBytes(DLTY,&rslt[0],10sizeof(float));
for(int i=0;i<10;++i)
{
LOG(INFO)<<rslt[i];
//LOG(INFO)<<static_cast<float*>(y->data)[i];
}
}
int main(void) {
//DeploySingleOp();
DeployGraphRuntime();
return 0;
}
相比于之前cpu部署的代碼,gpu部署多了一個(gè)拷貝張量的過(guò)程
參照
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229
最終結(jié)果
首先在tvm/apps/howto_deplpy目錄下執(zhí)行sudo make
編譯通過(guò),運(yùn)行可執(zhí)行文件 ./lib/cpp_deploy_pack
參考鏈接:
https://www.cnblogs.com/wangtianning1223/p/14662970.html
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17
https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10
[Performance] TVM - pytorch BERT on CPU
@masahi I add ONNX for the experiments in the following and it seems using ONNX-runtime can get the best performance no matter the sequence length is (without tuning). I use ONNX-runtime with GraphOptimizationLevel.ORT_ENABLE_ALL showing in this link 1. Besides, I plot the IR graph for ONNX, which is quite complicated.
Experiment 1 - Different Target
image1170×432 15.7 KB
Experiment 3 - Different Length
image1632×432 18.1 KB
? ONNX IR Graph: drive link
Also, I have some questions about AutoSchedule tuning.
? Q1: @merrymercy I was confused that when I use AutoSchedule to tune TVM, can I use target like llvm -libs=cblas or I should use only llvm. I found this will give different tasks to tune.
? Q2: @comaniac I think MXNet IR is more friendly than Pytorch IR for AutoSchedule tuning. I set the same parameters for tuning but Pytorch cannot get the results as MXNet (16ms for seq_len=128) The following are their tuning tasks and it seems quite different due to different IR graph. I still work on where the problem comes from, TVM front-end or original code implementation. But I think maybe TVM will have some transforms to generate similar IR graph even if from different framework.
o 1st difference: MXNet will use nn.bias_add() and Pytorch will use relay.add(), which cause the tuning tasks not include this operation. (task 0,1,2,6)
o 2nd difference: Their attention softmax operation have different shape, but I think this doesn’t cause too much latency difference (task 4)
Tasks for Pytorch AutoSchedule Tuning (Target = llvm)
========== Task 0 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 3072, 768, 3072, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 3072]
placeholder = PLACEHOLDER [768, 3072]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 1 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 3072, 768, 128, 3072]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [3072, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 2 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 768, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 3 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========
placeholder = PLACEHOLDER [12, 128, 128]
placeholder = PLACEHOLDER [12, 64, 128]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 4 (workload key: [“868c2771b1610bdac0ac73167691f4eb”, 1, 12, 128, 128, 1, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [1, 12, 128, 128]
T_softmax_maxelem(i0, i1, i2) max= placeholder[i0, i1, i2, k]
T_softmax_delta(i0, i1, i2, i3) = (placeholder[i0, i1, i2, i3] - T_softmax_maxelem[i0, i1, i2])
T_fast_exp(ax0, ax1, ax2, ax3) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1, ax2, a …(OMITTED)… max_delta[ax0, ax1, ax2, ax3], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1, ax2, ax3])
T_softmax_expsum(i0, i1, i2) += T_fast_exp[i0, i1, i2, k]
T_softmax_norm(i0, i1, i2, i3) = (T_fast_exp[i0, i1, i2, i3]/T_softmax_expsum[i0, i1, i2])
========== Task 5 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [12, 128, 64]
placeholder = PLACEHOLDER [12, 128, 64]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 6 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 2304, 768, 128, 2304]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [2304, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 7 (workload key: [“2dde9ffcbf97381c0f0307643e09dac5”, 1, 128, 768, 1, 128, 1]) ==========
placeholder = PLACEHOLDER [1, 128, 768]
placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)
========== Task 8 (workload key: [“dde89265d3f1a59075cee648386eac1e”, 1, 128, 768, 1, 128, 1, 1, 128, 1]) ==========
placeholder = PLACEHOLDER [1, 128, 768]
placeholder = PLACEHOLDER [1, 128, 1]
T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])
T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])
T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)
========== Task 9 (workload key: [“9e3bd222d4f8d250aeadf2fef0b15f2b”, 1, 768, 768, 768, 768, 1, 768]) ==========
placeholder = PLACEHOLDER [1, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)
T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)
T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1](((T_maximum[ax0, ax1]T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, …(OMITTED)… T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))
Tasks for MXNet AutoSchedule Tuning (Target = llvm)
========== Task 0 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 3072, 768, 3072, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 3072]
placeholder = PLACEHOLDER [768, 3072]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 1 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 3072, 768, 3072, 128, 3072]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [3072, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [3072]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 2 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 768, 768, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 3 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========
placeholder = PLACEHOLDER [12, 128, 128]
placeholder = PLACEHOLDER [12, 64, 128]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 4 (workload key: [“4b5e216f8244b4e8f7b6543c4a9087e5”, 1536, 128, 1536, 128]) ==========
placeholder = PLACEHOLDER [1536, 128]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_delta(i0, i1) = (placeholder[i0, i1] - T_softmax_maxelem[i0])
T_fast_exp(ax0, ax1) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88. …(OMITTED)… oor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1])
T_softmax_expsum(i0) += T_fast_exp[i0, k]
T_softmax_norm(i0, i1) = (T_fast_exp[i0, i1]/T_softmax_expsum[i0])
========== Task 5 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [12, 128, 64]
placeholder = PLACEHOLDER [12, 128, 64]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 6 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 2304, 768, 2304, 128, 2304]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [2304, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [2304]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 7 (workload key: [“2dde9ffcbf97381c0f0307643e09dac5”, 128, 1, 768, 128, 1, 1]) ==========
placeholder = PLACEHOLDER [128, 1, 768]
placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)
========== Task 8 (workload key: [“dde89265d3f1a59075cee648386eac1e”, 128, 1, 768, 128, 1, 1, 128, 1, 1]) ==========
placeholder = PLACEHOLDER [128, 1, 768]
placeholder = PLACEHOLDER [128, 1, 1]
T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])
T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])
T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)
========== Task 9 (workload key: [“9e3bd222d4f8d250aeadf2fef0b15f2b”, 1, 768, 768, 768, 768, 1, 768]) ==========
placeholder = PLACEHOLDER [1, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)
T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)
T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1](((T_maximum[ax0, ax1]T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, …(OMITTED)… T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))
Sorry for my lots of questions. I’ll do my best to do more experiments and figure out the reasons why Pytorch AutoSchedule not working as MXNet and why TVM is not working as expected when sequence length increasing.
Reply
Thanks for the plentiful information.
For Q1, when you extract tasks with llvm -mcpu=skylake-avx512 -libs=cblas, some operators (i.e., dense) will be offloaded to cblas. It means those operators won’t be compiled by the TVM codegen, so AutoScheduler won’t see and tune them.
For Q2, the two differences you pointed out seem not really impactful. Maybe you can try to use debugger to compare the latency breakdown between two models: Debugger — tvm 0.8.dev0 documentation 4
Reply
@comaniac I follow your instructions to use debugger and compare the latency in the following IR graphs (latency > 100us with orange color). And I find maybe some operations are hard to tune, such as fused_nn_contrib_dense_pack. All my experiments are done with opt_level=3, required_pass=[“FastMath”]
Experiment 1 - Compare BERT on MXNet (68.4ms) and Pytorch (122.9ms)
? MXNet Debug IR Graph: drive link
? Pytorch Debug IR Graph: drive link 3
? From the above graphs, we can find MXNet use fused_nn_contrib_dense_pack_add while Pytorch use fused_nn_contrib_dense_pack operation. This is happened for all FC layers (4 in one transformer block) and I use the latency of first block as example.
o FC for Query, Key, and Value (M: 503us, P: 1314us)
o FC after self-attention (M: 202us, P: 651us)
o FC after layer normalization (M: 798us, P: 2578us)
o FC after GELU (M: 786us, P: 2572us)
image1043×480 60 KB
Experiment 2 - Compare BERT on MXNet (68.4ms) and MXNet-tune (autoschedule) (15.4ms)
? MXNet Debug IR Graph: drive link
? MXNet-tune Debug IR Graph: drive link
? From the above graphs, it is easy to find we can reduce most of dense and batch_matmul operations’ latency. I take the first transformer block as example.
o FC for Query, Key, and Value (M: 503us, M-tune: 229us)
o FC after self-attention (M: 202us, M-tune: 99us)
o FC after layer normalization (M: 798us, M-tune: 292us)
o FC after GELU (M: 786us, M-tune: 367us)
o batch_matmul for Quert and Key (M: 1828us, M-tune:41us)
o batch_matmul for Attention and Value (M: 1312us, M-tune:29us)
Experiment 3 - Compare BERT on Pytorch (122.9ms) and Pytorch-tune (autoschedule) (64.2ms)
? Pytorch Debug IR Graph: drive link 3
? Pytorch-tune Debug IR Graph: drive link 1
? From the above graphs, we can find we only reduce the latency of batch_matmul but not dense operation. I take the first transformer block as example.
o FC for Query, Key, and Value (P: 1314us, P-tune: 1276us)
o FC after self-attention (P: 651us, P-tune: 403us)
o FC after layer normalization (P: 2578us, P-tune: 1779us)
o FC after GELU (P: 2572us, P-tune: 1684us)
o batch_matmul for Quert and Key (P: 1624us, P-tune: 78us)
o batch_matmul for Attention and Value (P: 1265us, P-tune: 70us)
Therefore, I suspect the problem is come from the fused_nn_contrib_dense_pack operation, but I don’t know why this is slower than fused_nn_contrib_dense_pack_add. Even the latter has additional add operation for adding bias. If you want the whole debug files (json and logs), I provide in this drive link 1.
Reply
@comaniac I change the TVM version with commit id: 91e07e1f3a7 (Feb. 5, 2021) which is the same as this repo 1. And the problem is solved because we will use fused_nn_batch_matmul for all FC (dense) layers rather than fused_nn_contrib_dense_pack. I think the problem is coming from this PR 3 you provided, which also causes I cannot use -libs=mkl. I have the debug IR graphs in the following and now Pytorch can speed up from Pytorch script 26.63ms to 17.36ms after tuning with autoschedule.
? Pytorch Debug IR Graph: drive link 2
? Pytorch-tune Debug IR Graph: drive link 1
2 Replies
1
Reply
Hmm I don’t that PR is the root cause of using batch_matmul in BERT model tho. It might be due to this PR that uses dense instead of batch_matmul when the input shape is 2D and 3D:
github.com/apache/tvm 6
Fix PyTorch matmul conversion when given (2-dim, N-dim) input pair 6
apache:main ← yuchaoli:main
opened Apr 14, 2021
This PR change the matmul conversion in PyTorch when given (2-dim, N-dim) input … 6
Reply
There’s a known issue that TVM’s dense op and batch_matmul op with Y = X * W^T does have bad performance in some models.
There’re several matmul & batch_matmul ops in bert that takes data tensor as both input and weight(exp. those in multi-head attentions) rather than use const parameter as weight. In such situation, we would see some explicit transpose inserted when the model are imported from TensorFlow or Pytorch(they use X * W for matmul by default). For the MXNet, as far as I know, it uses X * W^T by default.
The PR you found looks like creating a special schedule for dense + transpose, I’m not sure if that’s the key of the performance improving you got because it is written for AutoTVM and AutoScheduler will never use these manual schedule. You can have a more detailed analyse among those dense/batch_matmul ops’ layout and shape.
I would agree with @comaniac that the miss conversion from dense to batch_matmul caused some waste of computation before.
Reply
3 MONTHS LATER
will auto-schedule work in case of “l(fā)lvm -libs-cblas”? e.g. a network may contain matmul ops(leverage cblas) and other ops(leverage auto-schedule to get performance)?
Deploy NNVM module using C++ on GPU using OpenCL target
2 MONTHS LATER
Jul '18
@masahi Thanks for the sample .
In your sample code is “tvm_input” the CPU byte array copied to “x” (GPU array) ?
means is TVMArrayCopyFromBytes(destination,source,size) ?
for (int i = 0; i < n_samples; ++i) {
TVMArrayCopyFromBytes(x, &tvm_input[i * in_size], in_size * sizeof(float));
set_input(input_name.c_str(), x);
run();
get_output(0, y);
TVMArrayCopyToBytes(y, &tvm_output[i * out_size], out_size * sizeof(float));
}
Reply
? C++ deploy example: switch to OpenCL7
Jul '18
yes, source is on cpu and x is on gpu. In my code, tvm_input should contain input data coming from your input image, for example.
Reply
Jul '18
@masahi
instead of tvm_input i created a DLTensor ‘z’ of device type = kdCPU , for storing input image .
I did copy byte array as shown below .
https://gist.github.com/rajh619/74538a7b3a7e1b89a2ae89db5ab24054 30
but the i couldnt find the copied bytes in the destination (x->data) .?
Reply
Jul '18
If you already have your data in DLtensor, you should use TVMArrayCopyFromTo.
Reply
Jul '18
@masahi
ok .now i tried with TVMArrayCopyFromTo
TVMArrayCopyFromTo(z, x, nullptr);
the same issue happens , i couldnt find the bytes copied to x->data .
I think x->data should be same as z->data(image data) . please correct me if am wrong ?
Reply
if x is on GPU, you should NEVER touch x->data. You either get segfault or complete junk.
If you want to access x->data, copy x to CPU first.
Reply
@masahi Thanks . got it working . i copied back from GPU to CPU(x->data to k->data) and validated the data.
After executing "run() " , i was able to get output to CPU in two ways :
- allocate tvm array to output tensor “y” with devicetype - CPU (1) , then tvm_output(0,y) . y->data contains output . ( i think internally tvm copies the output from device to cpu_host ?)
- allocate tvm array to output tensor “y” with devicetype - GPU (4) , tvm_output(0,y) ,then copy bytes from GPU to CPU ->out_vector[] . (similar to your sample code) .
Out of both which is the right way to extract output ?
Reply
The answer is 2.
See my sample code. The output y is GPU tensor. I copy y to tvm_output, which is on cpu.
1
Reply
Hi, @masahi I am still getting segfault even I use your sample 5 code also after allocating memory i am doing memeset to 0
DLTensor* x = nullptr;
DLTensor* y = nullptr;
const int in_ndim = 4;
const int out_ndim = in_ndim;
const int num_slice = 1;
const int num_class = 4;
const int shrink_size[] = { 256, 256 };
const int64_t in_shape[] = { num_slice, 1, shrink_size[0], shrink_size[1] };
const int64_t out_shape[] = { num_slice, num_class, shrink_size[0], shrink_size[1] };
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);
memset(x->data, 0, 4265265);
it is happing for Cuda and OpenCL for llvm it’s working fine.
Reply
you can’t use memset on GPU memory.
Reply
hi, @masahi still i am not able working still it throws memory error.
void FR_TVM_Deploy::forward(float* imgData)
{
int in_size = (1 * 64 * 64 * 3 * 4);
constexpr int dtype_code = kDLFloat;
constexpr int dtype_bits = 32;
constexpr int dtype_lanes = 1;
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
const int64_t in_shape[in_ndim] = {1, 64, 64, 3};
//Allocating memeory to DLTensor object
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
TVMArrayCopyFromBytes(input, imgData, in_size);
//Get globl function module for graph runtime
tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
// get the function from the module(set input data)
tvm::runtime::PackedFunc set_input = mod->GetFunction(“set_input”);
set_input(“input”, input);
// get the function from the module(run it)
tvm::runtime::PackedFunc run = mod->GetFunction(“run”);
run();
int out_ndim = 2;
int64_t out_shape[2] = {1, 256};
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);
// get the function from the module(get output data)
tvm::runtime::PackedFunc get_output = mod->GetFunction(“get_output”);
get_output(0, output);
size_t out_size = out_shape[0] * out_shape[1];
std::vector tvm_output(out_size, 0);
TVMArrayCopyToBytes(output, &tvm_output[out_size], out_size);
TVMArrayFree(input);
TVMArrayFree(output);
}
when i print the tvm_output vector i am getting all 0’s means output is coming 0, in llvm case i am getting correct output. here i am printing vector tvm_output in loop, is there any othere way to check output?
參考鏈接:
https://www.cnblogs.com/wangtianning1223/p/14662970.html
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17
https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10
總結(jié)
以上是生活随笔為你收集整理的tvm模型部署c++ 分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 使用Tensor Expression张
- 下一篇: TVM实现hardware backen