當前位置：首頁 > 运维知识 > windows >内容正文

windows

使用Tensorrt部署，C++ API yolov7_pose模型

發布時間：2023/11/30 windows 46 coder

生活随笔收集整理的這篇文章主要介紹了使用Tensorrt部署，C++ API yolov7_pose模型小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用Tensorrt部署，C++ API yolov7_pose模型

雖然標題叫部署yolov7_pose模型，但是接下來的教程可以使用Tensorrt部署任何pytorch模型。

倉庫地址：https://github.com/WongKinYiu/yolov7/tree/pose

系統版本：ubuntu18.4

驅動版本：CUDA Version: 11.4

在推理過程中，基于 TensorRT 的應用程序的執行速度可比 CPU 平臺的速度快 40 倍。借助 TensorRT，您可以優化在所有主要框架中訓練的神經網絡模型，精確校正低精度，并最終將模型部署到超大規模數據中心、嵌入式或汽車產品平臺中。

TensorRT 以 NVIDIA 的并行編程模型 CUDA 為基礎構建而成，可幫助您利用 CUDA-X 中的庫、開發工具和技術，針對人工智能、自主機器、高性能計算和圖形優化所有深度學習框架中的推理。

TensorRT 針對多種深度學習推理應用的生產部署提供 INT8 和 FP16 優化，例如視頻流式傳輸、語音識別、推薦和自然語言處理。推理精度降低后可顯著減少應用延遲，這恰巧滿足了許多實時服務、自動和嵌入式應用的要求。

我們部署的主要步驟為：將PytorchModel轉化為OnnxModel，在將OnnxModel轉化為TensorrtModel.

雖然看似步驟簡單，但是坑還是有點多。

1.安裝TensorRT

首先查看自己的Cuda版本，Windows 在cmd中執行nvidia-smi,Ubuntu在終端執行nvidia-smi即可查看cuda的版本。一般我們選擇自己所能下載的最新的版本，避免有的算子沒有實現的問題。我之前在這里被坑了一天。

然后根據版本在官網下載，點擊Download，沒有注冊英偉達賬號的需要注冊賬號登陸。官網地址：https://developer.nvidia.com/tensorrt

同意協議，然后根據自己的cuda版本選擇，合適的版本。比如我的版本是cuda 11.4,一般選擇Tar包

接下來將tar包或者zip包解壓到你想安裝的位置。這個軟件解壓即用，不用再安裝。我們需要做的就是把軟件的bin目錄添加到環境變量。

Ubuntu:用vim打開～/.bashrc,將下面兩行添加到文件最后面。

export LD_LIBRARY_PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
export PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/bin:$PATH

其中tensorrt的地址應該換成你解壓的地址。然后sourse一下當前的終端

source ~/.bashrc

然后直接執行trtexec,如果沒有報錯證明成功安裝了tensorrt

~/Downloads trtexec
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec -h
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is 
  ......

2.轉換pytorch模型為onnx格式的模型

先說yolo項目：項目目錄下有個model/export.py

打開文件查看參數可以看到有一下參數設置。

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default='./yolov5s.pt', help='weights path')
    parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size')  # height, width
    parser.add_argument('--batch-size', type=int, default=1, help='batch size')
    parser.add_argument('--grid', action='store_true', help='export Detect() layer grid')
    parser.add_argument('--device', default='cpu', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--dynamic', action='store_true', help='dynamic ONNX axes')  # ONNX-only
    parser.add_argument('--simplify', action='store_true', help='simplify ONNX model')  # ONNX-only
    parser.add_argument('--export-nms', action='store_true', help='export the nms part in ONNX model')  # ONNX-only, #opt.grid has to be set True for nms export to work
    opt = parser.parse_args()
    opt.img_size *= 2 if len(opt.img_size) == 1 else 1  # expand
    print(opt)
    set_logging()
    t = time.time()

根據自己模型設置合適的參數，注意如果你修改過模型的輸出分類數，關鍵點數目。那么在導出nms層的時候就需要你自己手動修改網絡模型。在models/common.py中的第361行修改non_max_suppression參數

class NMS(nn.Module):
    # Non-Maximum Suppression (NMS) module
    iou = 0.45  # IoU threshold
    classes = None  # (optional list) filter by class

    def __init__(self, conf=0.25, kpt_label=False):
        super(NMS, self).__init__()
        self.conf=conf
        self.kpt_label = kpt_label


    def forward(self, x):
        return non_max_suppression(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2,nkpt=3)

class NMS_Export(nn.Module):
    # Non-Maximum Suppression (NMS) module used while exporting ONNX model
    iou = 0.45  # IoU threshold
    classes = None  # (optional list) filter by class

    def __init__(self, conf=0.001, kpt_label=False):
        super(NMS_Export, self).__init__()
        self.conf = conf
        self.kpt_label = kpt_label

    def forward(self, x):
        return non_max_suppression_export(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2)

我們需要把nc和nkpt改為自己的設置的參數，比如我的分類為2，關鍵點數量為3。然后導出模型。

python --img-size 960 --weights /home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.pt --grid --export-nms --simplify

如果順利的話，我們會得到一個onnx格式的模型。我們可以打開https://netron.app/ 然后選擇onnx模型打開。我們可以看到模型的圖像

我們需要關注的就是模型的輸入，輸出。以及他們的形狀。

從圖中可以看出我的模型輸入為images，大小為13 * 960 960輸出為detections形狀暫時不清楚。如果不清楚我們可以用onnxruntime跑一下查看形狀

import onnxruntime
import numpy as np
import cv2
# 指定你的 ONNX 模型文件路徑
onnx_model_path = '/home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.onnx'
# 創建 ONNX Runtime 的推理會話
sess = onnxruntime.InferenceSession(onnx_model_path)

# 獲取輸入名稱和形狀
input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape

# 指定圖像文件路徑
image_path = '/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg'

# 使用 OpenCV 讀取圖像
image = cv2.imread(image_path)
# 將圖像調整為模型的輸入形狀
resized_image = cv2.resize(image, (input_shape[3], input_shape[2]))
# 將圖像轉換為浮點數并進行歸一化
input_data = resized_image.astype(np.float32) / 255.0
# 將圖像數據轉換為 ONNX 模型期望的輸入形狀
input_data = np.transpose(input_data, [2, 0, 1])
input_data = np.expand_dims(input_data, axis=0)

# 運行推理
outputs = sess.run(None, {input_name: input_data})

# 輸出模型的每個輸出
for i, output_data in enumerate(outputs):
    print(f"Output {i + 1}: {output_data}")
print(f"Output  {output_data.shape}")

Output 1: [[8.01661621e+02 1.53809937e+02 9.72689453e+02 3.77949707e+02
  4.21597920e-02 0.00000000e+00 5.15294671e-02 8.84101624e+02
  2.51692810e+02 9.91612077e-01 9.03469177e+02 1.68072296e+02
  6.35425091e-01 8.85691345e+02 1.72709320e+02 7.30822206e-01]
 [7.85901917e+02 1.61294067e+02 9.64655701e+02 3.66809448e+02
  4.08335961e-02 1.00000000e+00 6.32926583e-01 8.77714966e+02
  2.57085205e+02 9.89280879e-01 8.91954224e+02 1.80863663e+02
  2.32283741e-01 8.78342041e+02 1.87161697e+02 5.20734370e-01]
 [7.05231201e+02 3.90309601e+02 7.51886230e+02 4.35935760e+02
  1.86153594e-02 1.00000000e+00 6.94520175e-01 7.35046814e+02
  4.11621490e+02 7.23196447e-01 7.14923584e+02 4.14582092e+02
  4.62090850e-01 7.09832214e+02 4.12042603e+02 2.80124098e-01]
 [4.01937828e+01 4.64705994e+02 1.51267151e+02 6.35167419e+02
  1.55489137e-02 1.00000000e+00 9.99976933e-01 8.51227875e+01
  5.72096252e+02 9.97074127e-01 8.59449158e+01 4.89000427e+02
  9.83235717e-01 8.48072968e+01 5.18143494e+02 9.95639443e-01]
 [4.67657043e+02 2.47014786e+02 6.09315125e+02 4.11179565e+02
  1.50994565e-02 0.00000000e+00 1.29642010e-01 5.45577820e+02
  3.71885773e+02 9.93896723e-01 5.56157104e+02 3.50972717e+02
  9.97142434e-01 5.54454590e+02 3.20836670e+02 9.76849675e-01]
 [3.69356445e+02 1.81159134e+01 4.91651611e+02 1.81579437e+02
  1.44530777e-02 1.00000000e+00 9.98439074e-01 4.16761169e+02
  1.16163483e+02 9.97292042e-01 4.29588745e+02 2.69206352e+01
  9.79286790e-01 4.28487366e+02 8.01969910e+01 9.97563720e-01]
 [7.12836548e+02 3.89805634e+02 7.66137817e+02 4.36001556e+02
  1.32421134e-02 0.00000000e+00 2.13130921e-01 7.40284363e+02
  4.09640594e+02 7.56286979e-01 7.18195129e+02 4.12563293e+02
  1.05279446e-01 7.11785156e+02 4.14483521e+02 1.00254148e-01]
 [7.01546204e+02 3.92902222e+02 7.31227966e+02 4.25415100e+02
  1.30005283e-02 1.00000000e+00 9.94012475e-01 7.22401733e+02
  4.12053406e+02 4.85429347e-01 7.12319214e+02 4.13364197e+02
  7.06610680e-01 7.13084656e+02 4.11362488e+02 4.67233360e-01]
 [6.80663696e+02 4.66796997e+02 7.09215454e+02 4.98112915e+02
  1.06324852e-02 0.00000000e+00 6.49383068e-02 6.97597473e+02
  4.87214142e+02 9.42029715e-01 6.90804749e+02 4.85028137e+02
  9.82081532e-01 6.85866089e+02 4.70633820e+02 9.92424369e-01]]
Output  (9, 16)

最后輸出可以看出我的輸出為1* 9 * 16，因為經過nms層后最后檢測框的數量是不固定的所以應該是1 * x *16。仔細觀察16緯的數據可以發現，每個數據都是

[x1,y1,x2,y2,confi,prob1,prob2,kpt1,conf1,pkt2,conf2,kpt3,conf3]

其中前四個數據為檢測框，然后是置信度，分類概率，關鍵點以及關鍵點的置信度。

3.將onnx格式的模型轉為.engine的tensorrt模型。

直接執行命令，然后等待模型轉換成功。

trtexec --onnx=yolov7.onnx --fp16 --saveEngine=yolov7.engine

如果報錯，比如什么算子不支持可以嘗試更新tensorrt到最新版本。

4.C++部署


#include <iostream>
#include <fstream>
#include <vector>
#include <opencv2/opencv.hpp>
#include <NvInfer.h>
#include <cuda_runtime_api.h>

#define INPUT_W 960
#define INPUT_H 960
#define DEVICE 0  // GPU id

#define CONF_THRESH 0.2

using namespace nvinfer1;

class Logger : public ILogger {
    void log(Severity severity, const char *msg) noexcept override {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

#define CHECK(status) \
    do\
    {\
        auto ret = (status);\
        if (ret != 0)\
        {\
            std::cerr << "Cuda failure: " << ret << std::endl;\
            abort();\
        }\
    } while (0)

float *blobFromImage(cv::Mat &img) {
    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);

    float *blob = new float[img.total() * 3];
    int channels = 3;
    int img_h = img.rows;
    int img_w = img.cols;
    for (int c = 0; c < channels; c++) {
        for (int h = 0; h < img_h; h++) {
            for (int w = 0; w < img_w; w++) {
                blob[c * img_w * img_h + h * img_w + w] =
                        (((float) img.at<cv::Vec3b>(h, w)[c]) / 255.0f);
            }
        }
    }
    return blob;
}

cv::Mat static_resize(cv::Mat &img) {
    float r = std::min(INPUT_W / (img.cols * 1.0), INPUT_H / (img.rows * 1.0));
    int unpad_w = r * img.cols;
    int unpad_h = r * img.rows;
    cv::Mat re(unpad_h, unpad_w, CV_8UC3);
    cv::resize(img, re, re.size());
    cv::Mat out(INPUT_W, INPUT_H, CV_8UC3, cv::Scalar(114, 114, 114));
    re.copyTo(out(cv::Rect(0, 0, re.cols, re.rows)));
    return out;
}

const char *INPUT_BLOB_NAME = "images";
const char *OUTPUT_BLOB_NAME = "detections";
static Logger gLogger;
static constexpr int MAX_OUTPUT_BBOX_COUNT = 100;
static constexpr int CLASS_NUM = 2;
static constexpr int LOCATIONS = 4;
static constexpr int KEY_POINTS_NUM = 3;
struct Keypoint {
    float x;
    float y;
    float kpt_conf;
};

struct alignas(float) Detection {
    //center_x center_y w h
    float bbox[LOCATIONS];
    float conf;  // bbox_conf * cls_conf
    float prob[CLASS_NUM]; // Probabilities for each class
    // 3 keypoints
    Keypoint kpts[KEY_POINTS_NUM];
};


void
doInference(IExecutionContext &context, float *input, float *output, const int output_size, const int input_shape) {
    const ICudaEngine &engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void *buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);

    assert(engine.getBindingDataType(inputIndex) == nvinfer1::DataType::kFLOAT);
    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
    assert(engine.getBindingDataType(outputIndex) == nvinfer1::DataType::kFLOAT);
    // int mBatchSize = engine.getMaxBatchSize();

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], input_shape * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], output_size * sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_shape * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueueV2(buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}

static constexpr int DETECTION_SIZE = sizeof(Detection) / sizeof(float);

static void
postprocess_decode(float *feat_blob, float prob_threshold,std::vector<Detection> &objects_map) {
    for (int i = 0; i < MAX_OUTPUT_BBOX_COUNT; i++) {

        int base_index = i * DETECTION_SIZE;  // Calculate the base index for the current detection
        if (feat_blob[base_index + LOCATIONS] <= prob_threshold)
            continue;
        Detection det;
        // Copy the detection information from feat_blob to the Detection structure
        memcpy(&det, &feat_blob[base_index], DETECTION_SIZE * sizeof(float));
        objects_map.push_back(det);
    }
}

int main() {
    char *trtModelStream{nullptr};
    cudaSetDevice(DEVICE);
    size_t size{0};
    const char *engine_file_path = "/home/ubuntu/GITHUG/yolov7_pose/yolov7.engine";
    std::ifstream file(engine_file_path, std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
    }
    // create a model using the API directly and serialize it to a stream
    IRuntime *runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);
    assert(engine != nullptr);
    IExecutionContext *context = engine->createExecutionContext();
    assert(context != nullptr);
    delete[] trtModelStream;
    // auto out_dims = engine->getBindingDimensions(1);


    int input_size = 1 * 3 * 960 * 960;
    int output_size = MAX_OUTPUT_BBOX_COUNT * 16 * 1;
    static float *prob = new float[output_size];
    const char *input_image_path = "/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg";
    cv::Mat img = cv::imread(input_image_path);
    cv::Mat pr_img = static_resize(img);
    float *blob;
//    cv::imshow("Image", pr_img);
    blob = blobFromImage(pr_img);

    cv::waitKey(200);
    // 關閉窗口
    cv::destroyAllWindows();
    doInference(*context, blob, prob, output_size, input_size);

    std::vector<Detection> objects_map;
    for (int i = 0; i < prob[0] && i < MAX_OUTPUT_BBOX_COUNT; i++) {
        std::cout << ": " << prob[i] << std::endl;
    }
    postprocess_decode(prob, CONF_THRESH, objects_map);
    float r_w = INPUT_W / (img.cols * 1.0);
    float r_h = INPUT_H / (img.rows * 1.0);
    cv::cvtColor(pr_img, pr_img, cv::COLOR_RGB2BGR);
    for (const auto &det: objects_map) {
            // Access other information in the Detection structure as needed
            // Example: Print bbox coordinates
            std::cout << "  Bbox: ";
            for (int i = 0; i < LOCATIONS; i++) {
                std::cout << det.bbox[i] << " ";
            }
            float r = 0.0;
            if (img.rows <= img.cols) {
                r = r_w;
            } else {
                r = r_h;
            }
            cv::Point pt1(det.bbox[0]/r, det.bbox[1]/r);
            cv::Point pt2(det.bbox[2]/r, det.bbox[3]/r);

            cv::rectangle(img, pt1, pt2, cv::Scalar(0, 255, 0), 2);

            cv::Point point1(det.kpts[0].x / r, det.kpts[0].y / r);
            cv::Point point2(det.kpts[1].x / r, det.kpts[1].y / r);
            cv::Point point3(det.kpts[2].x / r, det.kpts[2].y / r);
            // 畫線段
            cv::line(img, point1, point2, cv::Scalar(0, 0, 255), 2);  // Scalar 參數表示顏色，這里是紅色 (B, G, R)
            cv::line(img, point2, point3, cv::Scalar(255, 0, 0), 2);  // Scalar 參數表示顏色，這里是紅色 (B, G, R)


            cv::imshow("Rectangle", img);
            cv::waitKey(0);
            std::cout << std::endl;
        }
    }

這是我的demo以及最后的效果。

其中的關鍵代碼為解析模型輸出的部分，大家可以參考一下

總結

以上是生活随笔為你收集整理的使用Tensorrt部署，C++ API yolov7_pose模型的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 3天面了20个候选人，聊聊我的感受
下一篇： 2023-11-29：用go语言，给你一