日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CUDA从入门到精通(四):加深对设备的认识

發布時間:2025/3/15 编程问答 19 豆豆
生活随笔 收集整理的這篇文章主要介紹了 CUDA从入门到精通(四):加深对设备的认识 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CUDA從入門到精通(四):加深對設備的認識

?4211人閱讀?評論(2)?收藏?舉報 ?分類: GPU(29)?

前面三節已經對CUDA做了一個簡單的介紹,這一節開始真正進入編程環節。

首先,初學者應該對自己使用的設備有較為扎實的理解和掌握,這樣對后面學習并行程序優化很有幫助,了解硬件詳細參數可以通過上節介紹的幾本書和官方資料獲得,但如果仍然覺得不夠直觀,那么我們可以自己動手獲得這些內容。

?

以第二節例程為模板,我們稍加改動的部分代碼如下:

[cpp]?view plaincopy print?
  • ???//?Add?vectors?in?parallel.??
  • ???cudaError_t?cudaStatus;??
  • int?num?=?0;??
  • cudaDeviceProp?prop;??
  • cudaStatus?=?cudaGetDeviceCount(&num);??
  • for(int?i?=?0;i<num;i++)??
  • {??
  • ????cudaGetDeviceProperties(&prop,i);??
  • }??
  • cudaStatus?=?addWithCuda(c,?a,?b,?arraySize);??

  • 這個改動的目的是讓我們的程序自動通過調用cuda API函數獲得設備數目和屬性,所謂“知己知彼,百戰不殆”。

    cudaError_t 是cuda錯誤類型,取值為整數。

    cudaDeviceProp為設備屬性結構體,其定義可以從cuda Toolkit安裝目錄中找到,我的路徑為:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h,找到定義為:

    [cpp]?view plaincopy print?
  • /**?
  • ?*?CUDA?device?properties?
  • ?*/??
  • struct?__device_builtin__?cudaDeviceProp??
  • {??
  • ????char???name[256];??????????????????/**<?ASCII?string?identifying?device?*/??
  • ????size_t?totalGlobalMem;?????????????/**<?Global?memory?available?on?device?in?bytes?*/??
  • ????size_t?sharedMemPerBlock;??????????/**<?Shared?memory?available?per?block?in?bytes?*/??
  • ????int????regsPerBlock;???????????????/**<?32-bit?registers?available?per?block?*/??
  • ????int????warpSize;???????????????????/**<?Warp?size?in?threads?*/??
  • ????size_t?memPitch;???????????????????/**<?Maximum?pitch?in?bytes?allowed?by?memory?copies?*/??
  • ????int????maxThreadsPerBlock;?????????/**<?Maximum?number?of?threads?per?block?*/??
  • ????int????maxThreadsDim[3];???????????/**<?Maximum?size?of?each?dimension?of?a?block?*/??
  • ????int????maxGridSize[3];?????????????/**<?Maximum?size?of?each?dimension?of?a?grid?*/??
  • ????int????clockRate;??????????????????/**<?Clock?frequency?in?kilohertz?*/??
  • ????size_t?totalConstMem;??????????????/**<?Constant?memory?available?on?device?in?bytes?*/??
  • ????int????major;??????????????????????/**<?Major?compute?capability?*/??
  • ????int????minor;??????????????????????/**<?Minor?compute?capability?*/??
  • ????size_t?textureAlignment;???????????/**<?Alignment?requirement?for?textures?*/??
  • ????size_t?texturePitchAlignment;??????/**<?Pitch?alignment?requirement?for?texture?references?bound?to?pitched?memory?*/??
  • ????int????deviceOverlap;??????????????/**<?Device?can?concurrently?copy?memory?and?execute?a?kernel.?Deprecated.?Use?instead?asyncEngineCount.?*/??
  • ????int????multiProcessorCount;????????/**<?Number?of?multiprocessors?on?device?*/??
  • ????int????kernelExecTimeoutEnabled;???/**<?Specified?whether?there?is?a?run?time?limit?on?kernels?*/??
  • ????int????integrated;?????????????????/**<?Device?is?integrated?as?opposed?to?discrete?*/??
  • ????int????canMapHostMemory;???????????/**<?Device?can?map?host?memory?with?cudaHostAlloc/cudaHostGetDevicePointer?*/??
  • ????int????computeMode;????????????????/**<?Compute?mode?(See?::cudaComputeMode)?*/??
  • ????int????maxTexture1D;???????????????/**<?Maximum?1D?texture?size?*/??
  • ????int????maxTexture1DMipmap;?????????/**<?Maximum?1D?mipmapped?texture?size?*/??
  • ????int????maxTexture1DLinear;?????????/**<?Maximum?size?for?1D?textures?bound?to?linear?memory?*/??
  • ????int????maxTexture2D[2];????????????/**<?Maximum?2D?texture?dimensions?*/??
  • ????int????maxTexture2DMipmap[2];??????/**<?Maximum?2D?mipmapped?texture?dimensions?*/??
  • ????int????maxTexture2DLinear[3];??????/**<?Maximum?dimensions?(width,?height,?pitch)?for?2D?textures?bound?to?pitched?memory?*/??
  • ????int????maxTexture2DGather[2];??????/**<?Maximum?2D?texture?dimensions?if?texture?gather?operations?have?to?be?performed?*/??
  • ????int????maxTexture3D[3];????????????/**<?Maximum?3D?texture?dimensions?*/??
  • ????int????maxTextureCubemap;??????????/**<?Maximum?Cubemap?texture?dimensions?*/??
  • ????int????maxTexture1DLayered[2];?????/**<?Maximum?1D?layered?texture?dimensions?*/??
  • ????int????maxTexture2DLayered[3];?????/**<?Maximum?2D?layered?texture?dimensions?*/??
  • ????int????maxTextureCubemapLayered[2];/**<?Maximum?Cubemap?layered?texture?dimensions?*/??
  • ????int????maxSurface1D;???????????????/**<?Maximum?1D?surface?size?*/??
  • ????int????maxSurface2D[2];????????????/**<?Maximum?2D?surface?dimensions?*/??
  • ????int????maxSurface3D[3];????????????/**<?Maximum?3D?surface?dimensions?*/??
  • ????int????maxSurface1DLayered[2];?????/**<?Maximum?1D?layered?surface?dimensions?*/??
  • ????int????maxSurface2DLayered[3];?????/**<?Maximum?2D?layered?surface?dimensions?*/??
  • ????int????maxSurfaceCubemap;??????????/**<?Maximum?Cubemap?surface?dimensions?*/??
  • ????int????maxSurfaceCubemapLayered[2];/**<?Maximum?Cubemap?layered?surface?dimensions?*/??
  • ????size_t?surfaceAlignment;???????????/**<?Alignment?requirements?for?surfaces?*/??
  • ????int????concurrentKernels;??????????/**<?Device?can?possibly?execute?multiple?kernels?concurrently?*/??
  • ????int????ECCEnabled;?????????????????/**<?Device?has?ECC?support?enabled?*/??
  • ????int????pciBusID;???????????????????/**<?PCI?bus?ID?of?the?device?*/??
  • ????int????pciDeviceID;????????????????/**<?PCI?device?ID?of?the?device?*/??
  • ????int????pciDomainID;????????????????/**<?PCI?domain?ID?of?the?device?*/??
  • ????int????tccDriver;??????????????????/**<?1?if?device?is?a?Tesla?device?using?TCC?driver,?0?otherwise?*/??
  • ????int????asyncEngineCount;???????????/**<?Number?of?asynchronous?engines?*/??
  • ????int????unifiedAddressing;??????????/**<?Device?shares?a?unified?address?space?with?the?host?*/??
  • ????int????memoryClockRate;????????????/**<?Peak?memory?clock?frequency?in?kilohertz?*/??
  • ????int????memoryBusWidth;?????????????/**<?Global?memory?bus?width?in?bits?*/??
  • ????int????l2CacheSize;????????????????/**<?Size?of?L2?cache?in?bytes?*/??
  • ????int????maxThreadsPerMultiProcessor;/**<?Maximum?resident?threads?per?multiprocessor?*/??
  • };??

  • 后面的注釋已經說明了其字段代表意義,可能有些術語對于初學者理解起來還是有一定困難,沒關系,我們現在只需要關注以下幾個指標:

    name:就是設備名稱;

    totalGlobalMem:就是顯存大小;

    major,minor:CUDA設備版本號,有1.1, 1.2, 1.3, 2.0, 2.1等多個版本;

    clockRate:GPU時鐘頻率;

    multiProcessorCount:GPU大核數,一個大核(專業點稱為流多處理器,SM,Stream-Multiprocessor)包含多個小核(流處理器,SP,Stream-Processor)

    ?

    編譯,運行,我們在VS2008工程的cudaGetDeviceProperties()函數處放一個斷點,單步執行這一函數,然后用Watch窗口,切換到Auto頁,展開+,在我的筆記本上得到如下結果:

    可以看到,設備名為GeForce 610M,顯存1GB,設備版本2.1(比較高端了,哈哈),時鐘頻率為950MHz(注意950000單位為kHz),大核數為1。在一些高性能GPU上(如Tesla,Kepler系列),大核數可能達到幾十甚至上百,可以做更大規模的并行處理。

    PS:今天看SDK代碼時發現在help_cuda.h中有個函數實現從CUDA設備版本查詢相應大核中小核的數目,覺得很有用,以后編程序可以借鑒,摘抄如下:

    [cpp]?view plaincopy print?
  • //?Beginning?of?GPU?Architecture?definitions??
  • inline?int?_ConvertSMVer2Cores(int?major,?int?minor)??
  • {??
  • ????//?Defines?for?GPU?Architecture?types?(using?the?SM?version?to?determine?the?#?of?cores?per?SM??
  • ????typedef?struct??
  • ????{??
  • ????????int?SM;?//?0xMm?(hexidecimal?notation),?M?=?SM?Major?version,?and?m?=?SM?minor?version??
  • ????????int?Cores;??
  • ????}?sSMtoCores;??
  • ??
  • ????sSMtoCores?nGpuArchCoresPerSM[]?=??
  • ????{??
  • ????????{?0x10,??8?},?//?Tesla?Generation?(SM?1.0)?G80?class??
  • ????????{?0x11,??8?},?//?Tesla?Generation?(SM?1.1)?G8x?class??
  • ????????{?0x12,??8?},?//?Tesla?Generation?(SM?1.2)?G9x?class??
  • ????????{?0x13,??8?},?//?Tesla?Generation?(SM?1.3)?GT200?class??
  • ????????{?0x20,?32?},?//?Fermi?Generation?(SM?2.0)?GF100?class??
  • ????????{?0x21,?48?},?//?Fermi?Generation?(SM?2.1)?GF10x?class??
  • ????????{?0x30,?192},?//?Kepler?Generation?(SM?3.0)?GK10x?class??
  • ????????{?0x35,?192},?//?Kepler?Generation?(SM?3.5)?GK11x?class??
  • ????????{???-1,?-1?}??
  • ????};??
  • ??
  • ????int?index?=?0;??
  • ??
  • ????while?(nGpuArchCoresPerSM[index].SM?!=?-1)??
  • ????{??
  • ????????if?(nGpuArchCoresPerSM[index].SM?==?((major?<<?4)?+?minor))??
  • ????????{??
  • ????????????return?nGpuArchCoresPerSM[index].Cores;??
  • ????????}??
  • ??
  • ????????index++;??
  • ????}??
  • ??
  • ????//?If?we?don't?find?the?values,?we?default?use?the?previous?one?to?run?properly??
  • ????printf("MapSMtoCores?for?SM?%d.%d?is?undefined.??Default?to?use?%d?Cores/SM\n",?major,?minor,?nGpuArchCoresPerSM[7].Cores);??
  • ????return?nGpuArchCoresPerSM[7].Cores;??
  • }??
  • //?end?of?GPU?Architecture?definitions??

  • 可見,設備版本2.1的一個大核有48個小核,而版本3.0以上的一個大核有192個小核!

    ?

    前文說到過,當我們用的電腦上有多個顯卡支持CUDA時,怎么來區分在哪個上運行呢?這里我們看一下addWithCuda這個函數是怎么做的。

    [cpp]?view plaincopy print?
  • cudaError_t?cudaStatus;??
  • ??
  • //?Choose?which?GPU?to?run?on,?change?this?on?a?multi-GPU?system.??
  • cudaStatus?=?cudaSetDevice(0);??
  • if?(cudaStatus?!=?cudaSuccess)?{??
  • ????fprintf(stderr,?"cudaSetDevice?failed!??Do?you?have?a?CUDA-capable?GPU?installed?");??
  • ????goto?Error;??
  • }??

  • 使用了cudaSetDevice(0)這個操作,0表示能搜索到的第一個設備號,如果有多個設備,則編號為0,1,2...。

    再看我們本節添加的代碼,有個函數cudaGetDeviceCount(&num),這個函數用來獲取設備總數,這樣我們選擇運行CUDA程序的設備號取值就是0,1,...num-1,于是可以一個個枚舉設備,利用cudaGetDeviceProperties(&prop)獲得其屬性,然后利用一定排序、篩選算法,找到最符合我們應用的那個設備號opt,然后調用cudaSetDevice(opt)即可選擇該設備。選擇標準可以從處理能力、版本控制、名稱等各個角度出發。后面講述流并發過程時,還要用到這些API。

    ?

    如果希望了解更多硬件內容可以結合http://www.geforce.cn/hardware獲取。

    總結

    以上是生活随笔為你收集整理的CUDA从入门到精通(四):加深对设备的认识的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。