當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NVIDIA-CUDA编程初探

發布時間：2023/12/20 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 NVIDIA-CUDA编程初探小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CUDA的全稱是Compute Unified Device Architecture,是顯卡廠商NVIDIA推出的運算平臺，開發者可以使用C語言來編寫CUDA代碼，使用NVCC編譯器可以在支持CUDA的GPU處理器上以高速運行。雖然AMD也做顯卡，但是CUDA是老黃自家提出的標準，沒帶AMD一起玩兒，所以，提到基于CUDA的高性能計算，使用的都是Nvidia的顯卡。

首先安裝CUDA環境，具體方式參考博客：

FairMOT Cuda環境搭建并進行推理_tugouxp的專欄-CSDN博客環境準備1.PC Host Ubuntu 18.04.6,Linux Kernel 5.4，內核版本關系不大，記錄下來備查。2.安裝基礎工具,比如GCC，CMAKE，VIM，GIT等等，工具盡量完備, 如果做不到，遇到問題臨時下載也可。3.安裝python3發行版，我用的是anaconda發行版，具體版本是 Anaconda3-2020.11-Linux-x86_64.sh下載地址在如下鏈接，選擇對應的版本即可。https://repo.anaco...https://blog.csdn.net/tugouxp/article/details/121248457環境安裝后，可以用如下方法驗證環境：

nvidia-smi命令枚舉了系統中的所有顯卡支持信息

nvcc工具是CUDA編譯器，用nvcc -V 驗證編譯器是否可以工作:

cuda編程

編輯helloworld.cu文件，編碼內容:

#include <cuda_runtime.h> #include <stdio.h>int main(void) {printf("hellow world!\n");return 0; }

之后執行 nvcc helloworld.cu -o helloworld,并運行

可以看到，運行程序后打印除了helloworld.

但是，這個程序用到顯卡了嗎？很遺憾，沒有。如果非要用顯卡做點什么的化，可以改成這個樣子：

#include <cuda_runtime.h> #include <stdio.h>__global__ void kernel(void) {}int main(void) {kernel<<<1,1>>>();printf("hellow world!\n");return 0; }

我們定義了一個空函數送給GPU跑，函數是空函數，什么也沒做，白嫖一下GPU就退出，編譯并運行：

生成的helloworld文件是ELF格式的目標文件，與GCC產生的無異，可以通過objdump反編譯一把：

來看一下main函數的片段：

粗略一看，首先給人的印象是NVCC不是一個人在戰斗，畢竟我們的代碼才短短幾行，反編譯后卻有這么多條指令，而且貌似有些指令是沒有出現在源碼層面調用的。還能看出一點的就是源碼是按照C++編譯的，因為看到了明顯的名字改編。

那就是編譯器做的手腳咯,幸好我們有辦法確認這一點，方式就是在nvcc編譯的時候加上--verbose選項:

#$ _NVVM_BRANCH_=nvvm #$ _SPACE_= #$ _CUDART_=cudart #$ _HERE_=/usr/local/cuda-11.5/bin #$ _THERE_=/usr/local/cuda-11.5/bin #$ _TARGET_SIZE_= #$ _TARGET_DIR_= #$ _TARGET_DIR_=targets/x86_64-linux #$ TOP=/usr/local/cuda-11.5/bin/.. #$ NVVMIR_LIBRARY_DIR=/usr/local/cuda-11.5/bin/../nvvm/libdevice #$ LD_LIBRARY_PATH=/usr/local/cuda-11.5/bin/../lib::/usr/local/cuda-11.5/lib64 #$ PATH=/usr/local/cuda-11.5/bin/../nvvm/bin:/usr/local/cuda-11.5/bin:/home/caozilong/anaconda3/bin:/home/caozilong/anaconda3/condabin:/home/caozilong/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-11.5/bin #$ INCLUDES="-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" #$ LIBRARIES= "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" #$ CUDAFE_FLAGS= #$ PTXAS_FLAGS= #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "helloworld.cu" -o "/tmp/tmpxft_0000596f_00000000-9_helloworld.cpp1.ii" #$ cicc --c++14 --gnu_version=70500 --display_error_number --orig_src_file_name "helloworld.cu" --orig_src_path_name "/home/caozilong/cuda/helloworld.cu" --allow_managed -arch compute_52 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000596f_00000000-3_helloworld.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000596f_00000000-4_helloworld.module_id" --gen_c_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.gpu" "/tmp/tmpxft_0000596f_00000000-9_helloworld.cpp1.ii" -o "/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" #$ ptxas -arch=sm_52 -m64 "/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" -o "/tmp/tmpxft_0000596f_00000000-10_helloworld.sm_52.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=52,file=/tmp/tmpxft_0000596f_00000000-10_helloworld.sm_52.cubin" "--image3=kind=ptx,sm=52,file=/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" --embedded-fatbin="/tmp/tmpxft_0000596f_00000000-3_helloworld.fatbin.c" #$ rm /tmp/tmpxft_0000596f_00000000-3_helloworld.fatbin #$ gcc -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "helloworld.cu" -o "/tmp/tmpxft_0000596f_00000000-5_helloworld.cpp4.ii" #$ cudafe++ --c++14 --gnu_version=70500 --display_error_number --orig_src_file_name "helloworld.cu" --orig_src_path_name "/home/caozilong/cuda/helloworld.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.cpp" --stub_file_name "tmpxft_0000596f_00000000-6_helloworld.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_0000596f_00000000-4_helloworld.module_id" "/tmp/tmpxft_0000596f_00000000-5_helloworld.cpp4.ii" #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.cpp" -o "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" #$ nvlink -m64 --arch=sm_52 --register-link-binaries="/tmp/tmpxft_0000596f_00000000-7_helloworld_dlink.reg.c" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" -cpu-arch=X86_64 "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" -lcudadevrt -o "/tmp/tmpxft_0000596f_00000000-12_helloworld_dlink.sm_52.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -link "--image3=kind=elf,sm=52,file=/tmp/tmpxft_0000596f_00000000-12_helloworld_dlink.sm_52.cubin" --embedded-fatbin="/tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin.c" #$ rm /tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin #$ gcc -D__CUDA_ARCH_LIST__=520 -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_0000596f_00000000-7_helloworld_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -m64 "/usr/local/cuda-11.5/bin/crt/link.stub" -o "/tmp/tmpxft_0000596f_00000000-13_helloworld_dlink.o" #$ g++ -D__CUDA_ARCH_LIST__=520 -m64 -Wl,--start-group "/tmp/tmpxft_0000596f_00000000-13_helloworld_dlink.o" "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "helloworld"

現在總結一下CUDA編程的規則：

核函數，在GPU上執行的函數通常成為核函數，如上面程序中的kernel函數。
核函數一般通過標識符__global__修飾，通過<<<參數1，參數2>>>調用，用于說明內核函數中的線程數量，以及線程是如何組織的。
以線程格(Grid)的形式組織，每個線程格有若干個線程塊(block)組成，而每個線程塊又由若干個線程(thread)組成。
以Block為單位執行
能在主機端代碼中調用
調用時必須聲明內核函數的執行參數
在編程時，必須先為kernel函數中用到的數組或者變量分配好足夠的空間，再調用kernel函數，否則在GPU計算時會發生錯誤，例如越界或者報錯，甚至導致藍屏和死機。

CUDA的變成模型如下圖所示：

上面例子中,kernel函數恰好叫kernel是一種巧合，實際上你可以改成任何有意義的名字，只要按照CUDA要求的方式調用即可

#include <cuda_runtime.h> #include <stdio.h>__global__ void dummy(void) {}int main(void) {dummy<<<1,1>>>();printf("hellow world!\n");return 0; }

對于上如上的例子，我們探究一下它的控制流是如何進行的,首先我們看到反編譯文件中，首先main函數調用了_Z5dummyv

?不難看出這個函數名是經過C++名字改編的，我們用c++filt工具將其還原：

可以看到它就是dummy,我們繼續追蹤

可以看到dummy調用了_Z23__device_stub__Z5dummyvv函數,繼續追蹤，發現執行流最終調用了_Z16cudaLaunchKernelIcE9cudaErrorPKT_4dim3S4_PPvmP11CUstream_st，而經過反命名后，發現它是名為cudaLaunchKernel的cuda函數，這個函數并非代碼中顯示調用的，而是NVCC工具鏈生成的，所以，顧名思義，很可能就是這句調用發起了對GPU控制流的交接。

?global和device

__global__和__device__是函數修飾符，__global__表明被修飾的函數在設備上執行，但是在主機上調用，__device__，表示被修飾的函數在設備上執行，但是只能在其他__device__或者__global__函數中調用，說白了它只能在GPU中執行，并且被GPU中執行的函數調用。

新的例子

計算3+6等于幾的例子

#include <cuda_runtime.h> #include <stdio.h>__global__ void add(int a, int b, int *c) {*c = a + b; }int main(void) {int c;int *gpu_c;cudaMalloc((void **)&gpu_c, sizeof(int));add<<<1,1>>>(3,6,gpu_c);cudaMemcpy(&c, gpu_c, sizeof(int), cudaMemcpyDeviceToHost);cudaFree(gpu_c);printf("3 + 9 eques %d.\n", c);return 0; }

我們得到了正確的計算結果。

我們稍微改一下程序，將計算過程改為循環計算，然后用nvidia-smi工具監視一下GPU的資源使用情況:

#include <cuda_runtime.h> #include <stdio.h>__global__ void add(int a, int b, int *c) {*c = a + b; }int main(void) {int c;int *gpu_c;while(1){cudaMalloc((void **)&gpu_c, sizeof(int));add<<<1,1>>>(3,6,gpu_c);cudaMemcpy(&c, gpu_c, sizeof(int), cudaMemcpyDeviceToHost);cudaFree(gpu_c);printf("3 + 9 eques %d.\n", c);}return 0; }

編譯并運行：

運行過程中，使用watch -n 1 nvidia-smi命令監控GPU的資源變化情況,可以看到內存占用和GPU負載在不斷的發生變化：

復雜一些的例子：

下面的例子對兩個列向量求算術平方和，循環進行M次，分別用CPU和GPU計算，最后對統計到的計算速度進行對比：

#include <cuda_runtime.h> #include <device_launch_parameters.h> #include <stdio.h> #include <time.h>#define N (1024 * 1024) #define M (10000) #define THREADS_PER_BLOCK (1024)void cpu_vector_add(double *a, double *b, double *c, int n , int m) {int index, j;for(index = 0; index < n; index ++){for(j = 0; j < m; j ++){c[index] = a[index] * a[index] + b[index] * b[index];}}return; }__global__ void gpu_vector_add(double *a, double *b, double *c) {int j;int index = blockIdx.x * blockDim.x + threadIdx.x;for(j = 0; j < M; j ++){c[index] = a[index] * a[index] + b[index] * b[index];} }int main(void) {clock_t start, end;double *a, *b, *c;int size = N * sizeof(double);a = (double *)malloc(size);b = (double *)malloc(size);c = (double *)malloc(size);if(!a || !b || !c){printf("%s line %d, fatal error,malloc buffer failure.\n", __func__, __LINE__);return -1;}int j;for(j = 0; j < N; j ++){a[j] = b[j] = j;c[j] = 0;}start = clock();cpu_vector_add(a, b, c, N, M);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_cpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("CPU cost %f sectonds.\n", time_cpu_cost);start = clock();double *gpu_a, *gpu_b, *gpu_c;cudaMalloc((void**)&gpu_a, size);cudaMalloc((void**)&gpu_b, size);cudaMalloc((void**)&gpu_c, size);cudaMemcpy(gpu_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);gpu_vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>(gpu_a, gpu_b, gpu_c);cudaMemcpy(c, gpu_c, size, cudaMemcpyDeviceToHost);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_gpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("GPU cost %f sectonds.\n", time_gpu_cost);float faster_than = time_cpu_cost/time_gpu_cost;printf("GPU cost faster than CPU %f times.\n", faster_than);//printf("a = %p, b = %p, c = %p, gpu_a = %p, gpu_b = %p, gpu_c = %p.\n", a, b, c, gpu_a, gpu_b, gpu_c);free(a);free(b);free(c);cudaFree(gpu_a);cudaFree(gpu_b);cudaFree(gpu_c);return 0; }

編譯運行，查看計算結果：

nvcc helloworld.cu

經過實際測試,同樣的計算量，我的紅米本使用的MX250入門級顯卡要比Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz處理器快大約30倍，更別提那些用于數學計算的專業N卡了，這個數據讓我對顯卡的恐怖計算能力有了進一步的認識。

問題：

程序中存在兩類指針，malloc分配的主存指針和cudaMalloc分配的顯存指針，添加打印，打印出a,b,c,gpu_a,gpu_b,gpu_c的指針數值，根據打印來看，這些指針沒有明顯差別，難道cudaMalloc分配的并非是顯存上的地址？或者顯存和主存之間存在某種映射?不過可以確定的是，雖然指針的地址范圍相似，但是不可以在主機代碼中使用cudaMalloc()分配的指針進行主機內存讀寫操作（即不能進行解引用）。

#include <cuda_runtime.h> #include <device_launch_parameters.h> #include <stdio.h> #include <stdlib.h> #include <time.h>#define N (1024 * 1024) #define M (10000) #define THREADS_PER_BLOCK (1024)void cpu_vector_add(double *a, double *b, double *c, int n , int m) {int index, j;for(index = 0; index < n; index ++){for(j = 0; j < m; j ++){c[index] = a[index] * a[index] + b[index] * b[index];}}return; }__global__ void gpu_vector_add(double *a, double *b, double *c) {int j;int index = blockIdx.x * blockDim.x + threadIdx.x;for(j = 0; j < M; j ++){c[index] = a[index] * a[index] + b[index] * b[index];} }int main(void) {clock_t start, end;double *a, *b, *c;int size = N * sizeof(double);a = (double *)malloc(size);b = (double *)malloc(size);c = (double *)malloc(size);if(!a || !b || !c){printf("%s line %d, fatal error,malloc buffer failure.\n", __func__, __LINE__);return -1;}int j;for(j = 0; j < N; j ++){a[j] = b[j] = j;c[j] = 0;}start = clock();cpu_vector_add(a, b, c, N, M);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_cpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("CPU cost %f sectonds.\n", time_cpu_cost);start = clock();double *gpu_a, *gpu_b, *gpu_c;cudaMalloc((void**)&gpu_a, size);cudaMalloc((void**)&gpu_b, size);cudaMalloc((void**)&gpu_c, size);cudaMemcpy(gpu_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);gpu_vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>(gpu_a, gpu_b, gpu_c);cudaMemcpy(c, gpu_c, size, cudaMemcpyDeviceToHost);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_gpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("GPU cost %f sectonds.\n", time_gpu_cost);float faster_than = time_cpu_cost/time_gpu_cost;printf("GPU cost faster than CPU %f times.\n", faster_than);printf("a = %p, b = %p, c = %p, gpu_a = %p, gpu_b = %p, gpu_c = %p.\n", a, b, c, gpu_a, gpu_b, gpu_c);while(1){printf("%s line %d, *a = %f.\n", __func__, __LINE__, *a);printf("%s line %d, *gpu_c = %f.\n", __func__, __LINE__, *gpu_c);}free(a);free(b);free(c);cudaFree(gpu_a);cudaFree(gpu_b);cudaFree(gpu_c);return 0; }

和OPENCL的關系：

架構上，它們在同一層面，cuda和OpenCL都屬于一種并行計算開發語言，這從CUDA使用的編譯器和OpenCL雖然使用GCC編譯HOST側代碼，但是端冊代碼卻需編譯一個文本CL程序文件，交給OpenCL API執行在線編譯看出來，他們雖然都有吸取C的語法特點，但是異構加速核心這一塊和CPU端的編譯器是不共用的，關于OPENCL開發的例子，可以參考如下博客。

OpenCL編程之二_tugouxp的專欄-CSDN博客白嫖來的C端代碼:matrix.c：#include <stdio.h>#include <stdlib.h>#include <alloca.h>#include <CL/cl.h>#pragma warning( disable : 4996 )int main() {cl_int error;cl_platform_id platforms;cl_device_id devices;cl_context contehttps://blog.csdn.net/tugouxp/article/details/121844159

后來的測試：

半年后，重新安裝系統，使用DARKNET環境安裝CUDA和CUDANN之后:

?./setup.sh -InstallCUDA

再用老辦法安裝NVIDIA 驅動?

此時CUDA運行速率明顯加快，原因未知。

總結

NVCC是NVIDIA提供的用于編譯CUDA C程序的編譯器，它會自動將.cu文件分為帶有CUDA C語句的部分和不帶CUDA C語句的部分，并將后者交給本地的C/C++編譯器，當兩部分文件都編譯完畢，NVCC再將他們連接成可執行文件，默認的編譯命令不帶有任何參數，可以直接生成可執行文件。

KERNEL有三種編譯方式，分別為：

1.靜態編譯，也叫離線編譯，在執行前由C++、GCC或者VENDOR SPECIFIC的工具鏈進行編譯。

2.動態編譯，也叫運行時編譯，在CPU運行過程中編譯GPU KENREL代碼。

3.運行時IR編譯，執行前先i將KERNEL代碼編譯為IR，用戶驅動集成COMPILER的BACKEND代碼，在運行時再將IR編譯為GPU上運行的指令，是1，2，兩種情況的融合。

OPENCL可以看作是CUDA的一個保守發行版，對用戶來說，如果要使用NVIDIA產品的最新特性，則需要使用CUDA而不是OPENCL。

結束！?

總結

以上是生活随笔為你收集整理的NVIDIA-CUDA编程初探的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

nvidia
CUDA

上一篇：简单几何(极角排序) POJ 2007
下一篇：数据结构学习感悟