tensorRT-lenet C++代码分析【附代码】
前面的文章中已經寫了一個tensorRT簡單的demo---lenet推理【tensorRT-lenet】,實現了從torch模型轉wts【同時也展示出了wts內網絡的詳細信息】再轉engine后的推理過程,本文章是在之前的基礎上去分析C++代碼的實現。
我們從main函數開始分析每個函數或者說某塊的作用。完整的代碼會在文末附出。
本人C++比較薄弱,寫的比較粗糙還請見諒。
目錄
1.參數的傳入
2.API創建模型--wts轉engine
IHostMemory
APIToModel
IBuileder?
IBuilderConfig?
ICudaEngine
?engine序列化
3. engine推理
?反序列化
?執行推理
開始前向推理
doInference函數
完整代碼
可以總結一下本文的主要步驟。
wts-->engine:
? ? ? ? 1.利用IHostMemory創建一個modelStream用于后面API寫入engine
? ? ? ? 2.利用API創建模型:
????????????????1) 創建IBuilder;
? ? ? ? ? ? ? ? 2)?利用步驟1)創建IBuilderConfig;
? ? ? ? ? ? ? ? 3)利用自定義的createLenetEngine創建網絡,將wts中的權重數據寫入,返回engine;
? ? ? ? ? ? ? ? 4)將engine序列化后寫入步驟1 的modelStream;
? ? ? ? 3.將步驟2所得的modelStream寫入engine文件中
engine-->推理:
? ? ? ? 1.獲取engine文件的size;
? ? ? ? 2.將engine內容放入開辟的堆區空間trtModelStream中(實際就是得到一個trt_model)
? ? ? ? 3.對步驟2中的model利用deserializeCudaEngine進行反序列化,得到反序列化后的engine;
? ? ? ? 4.創建可執行的Context();
? ? ? ? 5.推理
1.參數的傳入
下面這幾行代碼是用來判斷傳入參數是否正確,-s指wts轉engine文件,-d指的是engine的前向推理。
int main(int argc, char** argv) {if (argc != 2) {std::cerr << "arguments not right!" << std::endl;std::cerr << "./lenet -s // serialize model to plan file" << std::endl;std::cerr << "./lenet -d // deserialize plan file and run inference" << std::endl;return -1;}2.API創建模型--wts轉engine
核心的C++ API包含在?NvInfer.h 中,因此需要導入這個頭文件。
通過API創建模型,并將其序列化為數據流。??
// create a model using the API directly and serialize it to a streamchar *trtModelStream{nullptr};size_t size{0};當傳入參數為“-s”時為序列化model。
IHostMemory
這里需要先說明IHostMemory這個類。
該類是一個與分配內存相關的庫,不可繼承,因為會影響到前向傳播計算。
該類有幾個成員函數:data (指針,指向數據的首地址)、size( data bytes)、type(數據類型)、destroy(釋放內存)。
if (std::string(argv[1]) == "-s") {IHostMemory* modelStream{nullptr}; //模型二進制數據流APIToModel(1, &modelStream);assert(modelStream != nullptr);std::ofstream p("lenet5.engine", std::ios::binary);if (!p){std::cerr << "could not open plan output file" << std::endl;return -1;}p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());modelStream->destroy();return 1;APIToModel
使用API創建model。APIToModel這個函數完整代碼如下,可以看到該函數傳入兩個參數,第一個就是Batch_size,第二個就是我們上面創建的IHostMemory對象,這個對象是一個二進制的model stream,初始為空指針。
可以看到這個函數有這么幾個類型的對象,IBuilder、IBuilderConfig、ICudaEngine。這里將一一給出介紹。
void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {// Create builderIBuilder* builder = createInferBuilder(gLogger);IBuilderConfig* config = builder->createBuilderConfig();// Create model to populate the network, then set the outputs and create an engineICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);assert(engine != nullptr);// Serialize the engine(*modelStream) = engine->serialize();// Close everything downengine->destroy();builder->destroy(); }IBuileder?
創建builder。
下面這一行代碼其中gLogger也是模板化的固定代碼,用于顯示執行過程的信息【這個信息就是后面創建engine時的過程】。
gLogger是之前定義的static Logger gLogger【需要logging.h頭文件】;
IBuilder* builder = createInferBuilder(gLogger);IBuilderConfig?
構建器的配置。指定用于創建engine的詳細信息。從下面這行代碼可以看出,創建的config這個指針指向的是之前構建器(builder)中的成員函數createBuilderConfig()。
IBuilderConfig* config = builder->createBuilderConfig();ICudaEngine
該API是NvInferRuntime.h頭文件下的。用于在構建的網絡上執行推理的engine,具有功能不安全的特性。?不可繼承。
下面代碼中的createLenetEngine函數用于構建網絡推理時的engine。
// Create model to populate the network, then set the outputs and create an engineICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);createLenetEngine完整代碼如下,這個代碼很核心:
創建engine主要步驟為:
1.定義空網絡;
2.創建一個tensor;
3.wts權重加載;
4.卷積層的建立;
5.設置輸出結點name并獲得網絡輸出;
// Creat the engine using only the API and not any parser. ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {INetworkDefinition* network = builder->createNetworkV2(0U);// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAMEITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});assert(data);// Add convolution layer with 6 outputs and a 5x5 filter.std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);assert(conv1);conv1->setStrideNd(DimsHW{1, 1});// Add activation layer using the ReLU algorithm.IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);assert(relu1);// Add max pooling layer with stride of 2x2 and kernel size of 2x2.IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool1);pool1->setStrideNd(DimsHW{2, 2});// Add second convolution layer with 16 outputs and a 5x5 filter.IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);assert(conv2);conv2->setStrideNd(DimsHW{1, 1});// Add activation layer using the ReLU algorithm.IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);assert(relu2);// Add second max pooling layer with stride of 2x2 and kernel size of 2x2>IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool2);pool2->setStrideNd(DimsHW{2, 2});// Add fully connected layerIFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);assert(fc1);// Add activation layer using the ReLU algorithm.IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);assert(relu3);// Add second fully connected layerIFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);assert(fc2);// Add activation layer using the ReLU algorithm.IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);assert(relu4);// Add third fully connected layerIFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);assert(fc3);// Add softmax layer to determine the probability.ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));assert(prob);prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);network->markOutput(*prob->getOutput(0));// Build enginebuilder->setMaxBatchSize(maxBatchSize);config->setMaxWorkspaceSize(1 << 20);ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);// Don't need the network any morenetwork->destroy();// Release host memoryfor (auto& mem : weightMap){free((void*) (mem.second.values));}return engine; }該函數返回類型為ICudaEngine。傳入參數有4個:Batch_Size,構建器(builder),builder config,DataType。
創建網絡的方式有兩種,一種是使用TRT的API創建網絡,一種是利用解析器將已有的模型轉換成Network,這里選擇的是前者。
1.網絡的定義
INetworkDefinition用于定義網絡。調用builder下的成員函數createNetworkV2。createNetworkV2(0U)為先創建一個空白的Network。
INetworkDefinition* network = builder->createNetworkV2(0U);2.tensor的創建
利用addInput創建一個tensor,addInput傳入三個數據類型,const char*,DataType,Dims。這里設置的addInput是("data",dt,Dims3{1,32,32})。
// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAMEITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});assert(data);3.wts權重加載
// Add convolution layer with 6 outputs and a 5x5 filter.std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");loadWeights的完整代碼如下,file為權重路徑:
std::map<std::string, Weights> loadWeights(const std::string file) {std::cout << "Loading weights: " << file << std::endl;std::map<std::string, Weights> weightMap;// Open weights filestd::ifstream input(file);assert(input.is_open() && "Unable to load weight file.");// Read number of weight blobsint32_t count;input >> count;assert(count > 0 && "Invalid weight map file.");while (count--){Weights wt{DataType::kFLOAT, nullptr, 0};uint32_t size;// Read name and type of blobstd::string name;input >> name >> std::dec >> size;wt.type = DataType::kFLOAT;// Load blobuint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));for (uint32_t x = 0, y = size; x < y; ++x){input >> std::hex >> val[x];}wt.values = val;wt.count = size;weightMap[name] = wt;}return weightMap; }?4.卷積層的建立
調用addConvolutionNd函數,第一個參數為輸入網絡的tensor,第二個參數為輸出通道數,第三個參數為卷積的尺寸這里是5 * 5,第四個參數為對應卷積的weights,第五個參數為對應卷積的bias。設置步長為1,setStrideNd。這里設置的寬和高的步長。
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]); assert(conv1); conv1->setStrideNd(DimsHW{1, 1});?4.1 添加激活函數
定義完卷積以后我們需要在添加激活函數,在lenet中使用的是Relu激活函數。addActivation()有兩個參數,第一個是輸入tensor,取的是conv的第一個維度(也就是batch這一維度),第二個是激活函數類型kRELU就是ReLu激活函數。
// Add activation layer using the ReLU algorithm.IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);assert(relu1)4.2 添加平均池化層
與上面一樣,addPoolingNd為池化層,第一參數為激活函數后的輸出作為該層的輸入,第二個參數為池化類型,第三個參數為尺寸的kernel大小這里是2 * 2,同樣設置步長為2。
// Add max pooling layer with stride of 2x2 and kernel size of 2x2.IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool1);pool1->setStrideNd(DimsHW{2, 2});4.3 添加全連接層
調用addFullyConnected添加全連接層。第一個參數為輸入,第二個參數輸出通道數,第三個和第四個數為weight和bias。
// Add fully connected layerIFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);assert(fc1);4.4添加softmax
添加softmax用addSoftmax即可。
// Add softmax layer to determine the probability.ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));assert(prob);?5. 設置輸出結點的name并獲得輸出
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME); network->markOutput(*prob->getOutput(0));上面的步驟就是人為定義網絡結構【這個網絡結構參考pytorch中lenet的forward寫即可】。只不過是將對應的wts每層的權值賦值給你用C++寫的網絡的結構中。
?build engine:
builder是我們前面構造的構造器,調用其中setMaxBatchSize傳入Batch_size大小。
config是前面的配置器,設置engine在執行時使用最大GPU臨時內存。
構建engine:調用buildEngineWithConfig傳入前面定義好的network與配置信息。
構建好以后就可以銷毀network了。這個engine就是我們已經序列化后的網絡模型。
// Build enginebuilder->setMaxBatchSize(maxBatchSize);config->setMaxWorkspaceSize(1 << 20);ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);// Don't need the network any morenetwork->destroy();完成上述過程以后釋放內存。?
// Release host memoryfor (auto& mem : weightMap){free((void*) (mem.second.values));}?
?engine序列化
// Serialize the engine(*modelStream) = engine->serialize();// Close everything downengine->destroy();builder->destroy();?
上面的APIToModel就已經完成了engine的構建,構建完成后會自動保存engine文件在目錄中。接下來就是推理階段。
3. engine推理
傳入-d參數為推理模式。
讀取lenet5.engine文件。
獲取文件內容大小(size)。
在堆區開辟一個和上述size一樣大小的空間trtModelStream,利用file.read讀入數據流。
else if (std::string(argv[1]) == "-d") {std::ifstream file("lenet5.engine", std::ios::binary);if (file.good()) {file.seekg(0, file.end);size = file.tellg();file.seekg(0, file.beg);trtModelStream = new char[size];assert(trtModelStream);file.read(trtModelStream, size);file.close();}創建一個全為1的圖像(樣例)。
// Subtract mean from imagefloat data[INPUT_H * INPUT_W];for (int i = 0; i < INPUT_H * INPUT_W; i++)data[i] = 1.0;記錄log。?
IRuntime* runtime = createInferRuntime(gLogger);assert(runtime != nullptr);?反序列化
deserializeCudaEngine()為反序列化,傳入三個參數,第一個就是我們在堆區內放入的序列化model,size為model大小,以及IPluginFactory。
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);?執行推理
這個步驟就是得到反序化后的模型
IExecutionContext* context = engine->createExecutionContext();開始前向推理
這里的OUTPUT_SIZE為10,這是因為lenet網絡最終輸出為10個分類。
// Run inferencefloat prob[OUTPUT_SIZE];for (int i = 0; i < 1000; i++) {auto start = std::chrono::system_clock::now();doInference(*context, data, prob, 1);auto end = std::chrono::system_clock::now();//std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;}// Destroy the enginecontext->destroy();engine->destroy();runtime->destroy();doInference函數
關鍵的函數為doInference。
?該函數傳入參數有4個,第一個為可執行的context(其實就是反序列化后的model),第二個為輸入,第三個為輸出,第四個為batch_size。
將緩存中input和output的指針傳遞給engine。
為了綁定緩存區,需要知道輸入和輸出的name,這個我們前面有定義過輸入的name為"data",輸出的name為"prob"。
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME); const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);創建buffers
創建兩個buffer,這兩個buffer的大小要與input和ouput大小一樣。
// Create GPU buffers on device CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float))); CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));?創建stream
// Create streamcudaStream_t stream;CHECK(cudaStreamCreate(&stream));推理
DMA將batch輸入到device,異步推理batch,并將DMA輸出回host。
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to hostCHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));context.enqueue(batchSize, buffers, stream, nullptr);CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));cudaStreamSynchronize(stream);推理完后釋放
// Release stream and bufferscudaStreamDestroy(stream);CHECK(cudaFree(buffers[inputIndex]));CHECK(cudaFree(buffers[outputIndex]));?
完整代碼:
void doInference(IExecutionContext& context, float* input, float* output, int batchSize) {const ICudaEngine& engine = context.getEngine();// Pointers to input and output device buffers to pass to engine.// Engine requires exactly IEngine::getNbBindings() number of buffers.assert(engine.getNbBindings() == 2);void* buffers[2];// In order to bind the buffers, we need to know the names of the input and output tensors.// Note that indices are guaranteed to be less than IEngine::getNbBindings()const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);// Create GPU buffers on deviceCHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));// Create streamcudaStream_t stream;CHECK(cudaStreamCreate(&stream));// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to hostCHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));context.enqueue(batchSize, buffers, stream, nullptr);CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));cudaStreamSynchronize(stream);// Release stream and bufferscudaStreamDestroy(stream);CHECK(cudaFree(buffers[inputIndex]));CHECK(cudaFree(buffers[outputIndex])); }?
?
完整代碼
#include "NvInfer.h" #include "cuda_runtime_api.h" #include "logging.h" #include <fstream> #include <map> #include <chrono>#define CHECK(status) \do\{\auto ret = (status);\if (ret != 0)\{\std::cerr << "Cuda failure: " << ret << std::endl;\abort();\}\} while (0)// stuff we know about the network and the input/output blobs static const int INPUT_H = 32; static const int INPUT_W = 32; static const int OUTPUT_SIZE = 10;const char* INPUT_BLOB_NAME = "data"; const char* OUTPUT_BLOB_NAME = "prob";using namespace nvinfer1;static Logger gLogger;// Load weights from files shared with TensorRT samples. // TensorRT weight files have a simple space delimited format: // [type] [size] <data x size in hex> std::map<std::string, Weights> loadWeights(const std::string file) {std::cout << "Loading weights: " << file << std::endl;std::map<std::string, Weights> weightMap;// Open weights filestd::ifstream input(file);assert(input.is_open() && "Unable to load weight file.");// Read number of weight blobsint32_t count;input >> count;assert(count > 0 && "Invalid weight map file.");while (count--){Weights wt{DataType::kFLOAT, nullptr, 0};uint32_t size;// Read name and type of blobstd::string name;input >> name >> std::dec >> size;wt.type = DataType::kFLOAT;// Load blobuint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));for (uint32_t x = 0, y = size; x < y; ++x){input >> std::hex >> val[x];}wt.values = val;wt.count = size;weightMap[name] = wt;}return weightMap; }// Creat the engine using only the API and not any parser. ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {INetworkDefinition* network = builder->createNetworkV2(0U);// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAMEITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});assert(data);// Add convolution layer with 6 outputs and a 5x5 filter.std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);assert(conv1);conv1->setStrideNd(DimsHW{1, 1});// Add activation layer using the ReLU algorithm.IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);assert(relu1);// Add max pooling layer with stride of 2x2 and kernel size of 2x2.IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool1);pool1->setStrideNd(DimsHW{2, 2});// Add second convolution layer with 16 outputs and a 5x5 filter.IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);assert(conv2);conv2->setStrideNd(DimsHW{1, 1});// Add activation layer using the ReLU algorithm.IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);assert(relu2);// Add second max pooling layer with stride of 2x2 and kernel size of 2x2>IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool2);pool2->setStrideNd(DimsHW{2, 2});// Add fully connected layerIFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);assert(fc1);// Add activation layer using the ReLU algorithm.IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);assert(relu3);// Add second fully connected layerIFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);assert(fc2);// Add activation layer using the ReLU algorithm.IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);assert(relu4);// Add third fully connected layerIFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);assert(fc3);// Add softmax layer to determine the probability.ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));assert(prob);prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);network->markOutput(*prob->getOutput(0));// Build enginebuilder->setMaxBatchSize(maxBatchSize);config->setMaxWorkspaceSize(1 << 20);ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);// Don't need the network any morenetwork->destroy();// Release host memoryfor (auto& mem : weightMap){free((void*) (mem.second.values));}return engine; }void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {// Create builderIBuilder* builder = createInferBuilder(gLogger);IBuilderConfig* config = builder->createBuilderConfig();// Create model to populate the network, then set the outputs and create an engineICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);assert(engine != nullptr);// Serialize the engine(*modelStream) = engine->serialize();// Close everything downengine->destroy();builder->destroy(); }void doInference(IExecutionContext& context, float* input, float* output, int batchSize) {const ICudaEngine& engine = context.getEngine();// Pointers to input and output device buffers to pass to engine.// Engine requires exactly IEngine::getNbBindings() number of buffers.assert(engine.getNbBindings() == 2);void* buffers[2];// In order to bind the buffers, we need to know the names of the input and output tensors.// Note that indices are guaranteed to be less than IEngine::getNbBindings()const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);// Create GPU buffers on deviceCHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));// Create streamcudaStream_t stream;CHECK(cudaStreamCreate(&stream));// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to hostCHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));context.enqueue(batchSize, buffers, stream, nullptr);CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));cudaStreamSynchronize(stream);// Release stream and bufferscudaStreamDestroy(stream);CHECK(cudaFree(buffers[inputIndex]));CHECK(cudaFree(buffers[outputIndex])); }int main(int argc, char** argv) {if (argc != 2) {std::cerr << "arguments not right!" << std::endl;std::cerr << "./lenet -s // serialize model to plan file" << std::endl;std::cerr << "./lenet -d // deserialize plan file and run inference" << std::endl;return -1;}// create a model using the API directly and serialize it to a streamchar *trtModelStream{nullptr};size_t size{0};if (std::string(argv[1]) == "-s") {IHostMemory* modelStream{nullptr}; //模型二進制數據流APIToModel(1, &modelStream);assert(modelStream != nullptr);std::ofstream p("lenet5.engine", std::ios::binary);if (!p){std::cerr << "could not open plan output file" << std::endl;return -1;}p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());modelStream->destroy();return 1;} else if (std::string(argv[1]) == "-d") {std::ifstream file("lenet5.engine", std::ios::binary);if (file.good()) {file.seekg(0, file.end);size = file.tellg();file.seekg(0, file.beg);trtModelStream = new char[size];assert(trtModelStream);file.read(trtModelStream, size);file.close();}} else {return -1;}// Subtract mean from imagefloat data[INPUT_H * INPUT_W];for (int i = 0; i < INPUT_H * INPUT_W; i++)data[i] = 1.0;IRuntime* runtime = createInferRuntime(gLogger);assert(runtime != nullptr);ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);assert(engine != nullptr);IExecutionContext* context = engine->createExecutionContext();assert(context != nullptr);// Run inferencefloat prob[OUTPUT_SIZE];for (int i = 0; i < 1000; i++) {auto start = std::chrono::system_clock::now();doInference(*context, data, prob, 1);auto end = std::chrono::system_clock::now();std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;}// Destroy the enginecontext->destroy();engine->destroy();runtime->destroy();// Print histogram of the output distributionstd::cout << "\nOutput:\n\n";for (unsigned int i = 0; i < 10; i++){std::cout << prob[i] << ", ";}std::cout << std::endl;return 0; }總結
以上是生活随笔為你收集整理的tensorRT-lenet C++代码分析【附代码】的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Webots介绍
- 下一篇: C++ 给定当前的时间,用英文