當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理

發(fā)布時間：2023/11/28 生活经验 38 豆豆

生活随笔收集整理的這篇文章主要介紹了利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

利用NVIDIA NGC的TensorRT容器優(yōu)化和加速人工智能推理

Optimizing and Accelerating AI Inference with the TensorRT Container from NVIDIA NGC

自然語言處理（NLP）是人工智能最具挑戰(zhàn)性的任務(wù)之一，因?yàn)樗枰斫馍舷挛摹⒄Z音和重音來將人類語音轉(zhuǎn)換為文本。構(gòu)建這個人工智能工作流首先要訓(xùn)練一個能夠理解和處理口語到文本的模型。

BERT是這項任務(wù)的最佳模型之一。您不必從頭開始構(gòu)建像BERT這樣的最先進(jìn)的模型，而是可以針對您的特定用例微調(diào)經(jīng)過預(yù)訓(xùn)練的BERT模型，并將其與NVIDIA Triton推理服務(wù)器配合使用。有兩種基于BERT的模型可用：

BERT-Base有12層、12個注意頭和1.1億個參數(shù)的

BERT-large有24層，16個注意頭，3.4億個參數(shù)

這些模型中的許多參數(shù)都是稀疏的。大量的參數(shù)因此降低了推理的吞吐量。本文將使用BERT推理作為一個例子來展示如何利用nvidiangc的TensorRT容器，并通過您的AI模型提高推理的性能。

Prerequisites

This post uses the following resources:

· The TensorFlow container for GPU-accelerated training

· A system with up to eight NVIDIA GPUs, such as DGX-1

o Other NVIDIA GPUs can be used but the training time varies with the number and type of GPU.

o GPU-based instances are available on all major cloud service providers.

· NVIDIA Docker

· The latest CUDA driver Get the assets from NGC

Before you can start the BERT optimization process, you must obtain a few assets from NGC:

· A fine-tuned BERT-large model

· Model scripts for running inference with the fine-tuned model, in TensorFlow

Fine-tuned BERT-Large model

If you followed our previous post, Jump-start AI Training with NGC Pretrained Models On-Premises and in the Cloud, you’ll see that we are using the same fine-tuned model for optimization.

If you didn’t get a chance to fine-tune your own model, make a directory and download the pretrained model files. You have several download options.

Option 1: Download from the command line using the following commands. In the terminal, use wget to download the fine-tuned model:

mkdir bert_model && cd bert_model

wget
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/bert_config.json

wget
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.data-00000-of-00001

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.index

wget
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.meta

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/vocab.txt

Option 2: Download from the NGC website.

In your browser, navigate to the model repo page.
In the top right corner, choose Download.
After the zip file finishes downloading, unzip the files.

Refer to the directory where the fine-tuned model is saved as $MODEL_DIR. It can be the model that you saved from
our previous post, or the model that you just downloaded.

When you are in this directory, export it:

export MODEL_DIR=$PWD

cd …

Model scripts for running inference with the fine-tuned model

Use the following scripts to see the performance of BERT inference in TensorFlow format. To download the model
scripts:

In your browser, navigate to the model scripts page.
At the top right, choose Download.

Figure 1. BERT inference model in TensorFlow from NGC.

Alternatively, the model script can be downloaded using git from the NVIDIA Deep Learning Examples on GitHub:

mkdir bert_tf && cd bert_tf

git clone
https://github.com/NVIDIA/DeepLearningExamples.git

You are doing TensorFlow inference from the BERT directory. Whether you downloaded using the NGC webpage or GitHub, refer to this directory moving forward as $BERT_DIR.

Export this directory as follows:

export
BERT_DIR=$PWD’/DeepLearningExamples/TensorFlow/LanguageModeling/BERT/’

cd …

Before cloning the TensorRT GitHub repo,
run the following command:

mkdir bert_trt && cd bert_trt

To get the script required for converting and running BERT TensorFlow model into TensorRT, follow the steps in Downloading the TensorRT Components. Make sure that the directory locations are correct:

· $MODEL_DIR—Location of the BERT model checkpoint files.

· $BERT_DIR—Location of the BERT TF scripts.

TensorFlow performance evaluation

In this section, you build, run, and evaluate the performance of BERT in TensorFlow.

Set up and run a Docker container

Build the Docker container by running the following command:

docker build $BERT_DIR -t bert

Launch the BERT container, with two mounted
volumes:

· One volume for the BERT model scripts code repo, mounted to /workspace/bert.

· One volume for the fine-tuned model that you either fine-tuned yourself or downloaded from NGC, mounted to /finetuned-model-bert.

docker run --gpus all -it \

-v $BERT_DIR:/workspace/bert \

-v $MODEL_DIR:/finetuned-model-bert \

bert

Prepare the dataset

You are evaluating the BERT model using the SQuAD dataset. For more information, see SQuAD1.1: The Stanford Question Answering Dataset.

export
BERT_PREP_WORKING_DIR="/workspace/bert/data"

python3 /workspace/bert/data/bertPrep.py–action download --dataset squad

if the line import PubMedTextFormatting gives any errors in the bertPrep.py script, comment this line out, as you don’t need the PubMed dataset in this example.

This script downloads two folders in $BERT_PREP_WORKING_DIR/download/squad/: v2.0/ and v1.1/. For this post, use v1.1/.

Run evaluations with the TensorFlow model

Inside the container, navigate to the BERT workspace that contains the model scripts:

cd /workspace/bert/

You can run inference with a fine-tuned model in TensorFlow using scripts/run_squad.sh. For more information, see Jump-start AI Training with NGC Pretrained Models On-Premises and in the Cloud.

There are two modifications to this script.
First, set it to prediction-only mode:

· --do_train=False

· --do_predict=True

When you manually edit --do_train=False in run_squad.sh, the training-related parameters that you pass into run_squad.sh aren’t relevant in this scenario.

Second, comment out the following block starting at line number 27:

#if [ “$bert_model” =“l(fā)arge” ] ; then

export

BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16

#else

export

BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12

#fi

Because you can get vocab.txt and bert_config.json from the mounted directory /finetuned-model-bert, you do not
need this block of code.

Now, export BERT_DIR inside the container:

export BERT_DIR=/finetuned-model-bert

After making the modifications, issue the following command:

bash scripts/run_squad.sh 1 5e-6 fp16 true 1 384 128 large 1.1 /finetuned-model-bert/model.ckpt<-num>

Put the correct checkpoint number <-num> available:

INFO:tensorflow:Throughput Average (sentences/sec) = 106.56

We observed that inference speed is 106.56 sentences per second for running inference directly in TensorFlow on a system powered with a single NVIDIA T4 GPU. Performance may differ depending on the number of GPUs and the architecture of the GPUs.

This is good performance, but could it be better? Investigate by using the scripts in /workspace/bert/trt/ to convert the TF model into TensorRT 7.1, then run inference on the TensorRT BERT model engine. For that process, switch over to the TensorRT repo and build a Docker image to launch.

Issue the following command:

exit

TensorRT performance evaluation

In the following section, you build, run, and evaluate the performance of BERT in TensorFlow. Before proceeding, make sure that you have downloaded and set up the TensorRT GitHub repo.

Set up a Docker container

In this step, you build and launch the Docker image from Dockerfile for TensorRT.

On your host machine, navigate to the TensorRT directory:

cd TensorRT

The script docker/build.sh builds the TensorRT Docker container:

./docker/build.sh --file docker/ubuntu.Dockerfile --tag tensorrt-ubuntu --os 18.04 --cuda 11.0

After the container is built, you must launch it by executing the docker/launch.sh script. However, before launching the container, modify docker/launch.sh to add -v
$MODEL_DIR:/finetuned-model-bert and -v
$BERT_DIR/data/download/squad/v1.1:/data/squad in docker_args to pass in your fine-tuned model and squad dataset, respectively.

The docker_args at line 49 should look like the following code:

docker_args="$extra_args -v
$MODEL_DIR:/finetuned-model-bert -v
$BERT_DIR/data/download/squad/v1.1:/data/squad -v $arg_trtrelease:/tensorrt -v
$arg_trtsource:/workspace/TensorRT -it $arg_imagename:latest"

Now build and launch the Docker image
locally:

./docker/launch.sh --tag tensorrt-ubuntu
–gpus all --release $TRT_RELEASE --source $TRT_SOURCE

When you are in the container, you must
build the TensorRT plugins:

cd $TRT_SOURCE

export
LD_LIBRARY_PATH=pwd/build/out:$LD_LIBRARY_PATH:/tensorrt/lib

mkdir -p build && cd build

cmake … -DTRT_LIB_DIR=$TRT_RELEASE/lib
-DTRT_OUT_DIR=pwd/out

make -j$(nproc)

pip3 install
/tensorrt/python/tensorrt-7.1*-cp36-none-linux_x86_64.whl

Now you are ready to build the BERT TensorRT engine.

Build the TensorRT engine

Make a directory to store the TensorRT engine:

mkdir -p /workspace/TensorRT/engines

Optionally, explore/workspace/TensorRTdemo/BERT/scripts/download_model.sh to see how you can use the ngc registry model download-version command to download models from NGC.

Run the builder.py script, noting the following values:

·
Path to the TensorFlow model /finetuned-model bert/model.ckpt-/li>

·
Output path for the engine to be built

· Batch size 1

· Sequence length 384

· Precision fp16

· Checkpoint path /finetuned-model-bert

cd /workspace/TensorRT/demo/BERT

python3 builder.py -m/finetuned-model-bert/model.ckpt-5474 o/workspace/TensorRT/engines/bert_large_384.engine -b 1 -s 384 --fp16 -c/finetuned-model-bert/

Make sure that you provide the correct checkpoint model. The script takes ~1-2 mins to build the TensorRT engine.

Run the TensorRT inference

Now run the built TensorRT inference engine on 2K samples from the SQADv1.1 evaluation dataset. To run and get the throughput numbers, replace the code from line number 222 to line number 228 in inference.py, as shown in the following
code block.

Be mindful of indentation. If the prompt asks for a password while you are installing vim in the container, use the password nvidia.

if squad_examples:

eval_time_l = []

all_predictions = collections.OrderedDict()

for example_index, example in enumerate(squad_examples):

print(“Processing example {} of {}”.format(example_index+1, len(squad_examples)), end="\r")

features = question_features(example.doc_tokens, example.question_text)

eval_time_elapsed, prediction, nbest_json = inference(features, example.doc_tokens)

eval_time_l.append(1.0/eval_time_elapsed)

all_predictions[example.id] = prediction

if example_index+1 == 2000:

break

print("Throughput Average (sentences/sec) = ",np.mean(eval_time_l))

Now run the inference:
CUDA_VISIBLE_DEVICES=0 python3 inference.py
-e /workspace/TensorRT/engines/bert_large_384.engine -b

1 -s 384 -sq /data/squad/dev-v1.1.json -v /finetuned-model-bert/vocab.txt

Throughput Average (sentences/sec) = 136.59

We observed that inference speed is 136.59 sentences per second for running inference with TensorRT 7.1 on a system
powered with a single NVIDIA T4 GPU. Performance may differ depending on the number of GPUs and the architecture of the GPUs, where the data is stored and
other factors. However, you’ll always observe a performance boost due to model optimization using TensorRT.

Figure shows that the TensorRT BERT engine gives an average throughput of 136.59 sentences/sec compared to 106.56 sentences/sec given by the BERT model in TensorFlow. This is a 28% boost in throughput.

Figure 2. Performance gained when running BERT in TensorRT over TensorFlow.

Summary

Pull the TensorRT container from NGC to easily and quickly performance tune your models in all major frameworks, create novel low-latency inference applications, and deliver the best quality of service (QoS) to customers.

總結(jié)

以上是生活随笔為你收集整理的利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CUDA刷新器：CUDA编程模型
下一篇：基于自动驾驶车辆的NVIDIA-Tens