CUDA系列学习(四)Parallel Task类型 与 Memory Allocation
本文為CUDA系列學習第四講,首先介紹了Parallel communication patterns的幾種形式(map, gather, scatter, stencil, transpose), 然后復習了cuda memory model并從high level上分析怎樣寫出高效代碼,最后學習了流程控制(control flow)以及其中一個重要部分——原子操作。參考資料:udacity cs344.
(一). Parallel communication Patterns
在上一章CUDA系列學習(二)CUDA memory & variables中我們介紹了memory和variable的不同類型,本章中根據不同的memory映射方式,我們將task分為以下幾種類型:Map, Gather, Scatter, Stencil, transpose.
1.1 Map, Gather, Scatter
- Map: one input - one output
- Gather: several input - one output?
e.g image blur by average - Scatter: one input - several output?
e.g add a value to its neighbors?
(因為每個thread 將結果scatter到各個memory,所以叫scatter)
圖為Map, Gather & Scatter示意圖:
1.2 Stencil, Transpose
-
stencil: 對input中的每一個位置,?
stencil input:該點的neighborhood?
stencil output:該點value?
e.g image blur by average?
這樣也可以看出,stencil和gather很像,其實stencil是gather的一種,只不過stencil要求input必須是neighborhood而且對input的每一個元素都要操作?
圖示: - 2D stencil: (示例為兩種形式)?
? - 3D stencil:?
-
transpose?
input:matrix M?
output: M^T?
圖示: -
Matrix transpose?
-
Transpose represents in vector?
Exercise?
Q:?
看這個quiz圖,給每個藍線畫著的句子標注map/gather/scatter/stencil/transpose:?
A:四個位置分別選AECB。?
這里我最后一個選錯成B&D, 為什么不選D呢?看stencil的定義:如果是average,也應該對每一個位置都要進行average,而題目中有if(i%2)這個condition。
那么對于不同的Parallel communication Patterns需要關注哪些點呢??
1. threads怎樣高效訪問memory?- 怎樣重用數據??
2. threads怎樣相互交互部分結果?(通過sharing memory)這樣安全嗎?
我們將在下一節中首先回顧講過的memory model,然后結合具體問題分析闡述how to program。
(二). Programming model and Memroy model
第一講和第三講中我們講過SM與grid, block, thead的關系:各個grid, block的thread組織(gridDim,blockDim,grid shape, block shape)可以不同,分別用于執行不同kernel。?
如我們第一章所講,不同GPU有不同數量的硬件SM(streaming Multiprocessors),GPU負責將這些block分配到SMs,所有SM獨立,并行地跑。
2.1 Memory model
第二講中我們講了memroy的幾種形式,這里我們先來回顧一下memory model.
每個thread都可以訪問:?
1. 該thread獨占的local memory?
2. block內threads共享的shared memory?
3. GPU中所有threads(包括不同SM的所有threads)共享的global memory
下面復習一下,做兩個quiz。
Quiz -1 :?
Ans:選擇A,B,D?
解讀:根據定義,一個block只能run在一個SM;SM中不同blocks的threads不能cooperate
Quiz - 2 :?
Ans: 都不選~~~?
解讀:block執行時間及順序不可控;block分配到哪個SM是GPU做的事情,并非programmer能指定的;
2.2 Memory in Program
How to write Efficient Programs from high level maximize arithmetic intensity?
arithmetic intensity = calculation/memory?
即要maximize calculation per thread 并 minimize memory per thread(其實目的是minimize memory access的時間)?
方法:經常訪問的數據放在可快速訪問的memory(GPU中不同memory在硬件層的介紹參考第二章),對于剛才講的local, shared and global memory的訪問速度, 有?
local > shared >> global >> CPU memory?
所以,比如我想經常訪問一個global memory,那可以在kernel中先將該global memory variable賦值給一個shared memory variable, 然后頻繁訪問那個shared memory variable.
minimize memory access stride?
如coalesce memory access圖所示:?
如果GPU的threads訪問相鄰memory,我們稱為coalesced,如果threads間訪問memory有固定步長(蹦著走),我們稱stripped,完全沒規律的memory訪問稱為random。訪問速度,有?
coalesced > strided > random
avoid thread divergence?
這個我們在前兩講中有過相應說明。
Exercise:
給下面這段代碼中5,6,7,8行的幾句話執行速度排序(1最快,4最慢):
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> __global__ void f(float* x, float* y, float* z){ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> float s,t,u; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> __shared__ float a,b,c; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span> s = *x; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span> t = s; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span> a = b; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span> *y = *z; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span> }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>Ans: 5,6,7,8行執行速度為:3,1,2,4。
下面一節我們來看具體programming問題中的流程控制與同步。
(三). Control flow and synchronisation
3.1 program 運行順序
在講流程控制之前我們首先看一個例子,用來測試不同block的運行順序。
Demo code:
<code class="hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include <stdio.h></span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define Num_block 16</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define Num_thread 1</span>__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> print(){printf(“Num: %d\n”,blockIdx.x); }<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> main(){<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//launch the kernel</span>print<<<Num_block, Num_thread>>>();cudaDeviceSynchronize();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// what is the function of this sentence? - force the printf()s to flush, 不然運行時顯示不出來</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul> 編譯命令:?
nvcc -arch=sm_21 -I ~/NVIDIA_GPU_Computing_SDK/C/common/inc print.cu
運行兩次結果:
可見程序執行每一次的結果都不同,也就是不同block之間的執行順序是不可控的,正如剛才quiz的ans。那么如果我們希望同步各個threads呢?
3.2 同步機制
第二章中我們在一個例子中引入并使用了同步函數syncthreads(), 即設置一個barrier,使所有threads運行到同步函數的時候stop and wait, 直到所有threads運行到此處,那么問題來了。
Exercise:?
考慮一個程序,將每個位置i的元素移到i-1的位置,需要多少個syncthreads()??
e.g kernel中聲明如下:
Ans: 3個~
<code class="hljs cpp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">… <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> idx = threadIdx.x; __shared__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">128</span>]; <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx] = idx; __syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如果不加將導致array還沒賦值就被操作</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (idx<<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">127</span>){<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> tmp = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx];__syncthreads();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如不加導致先讀后寫,數據相關</span><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx] = tmp;__syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如不加不能確保下面的程序訪問到正確數據</span> } … </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>Quiz: 看下面這個程序會不會出現collision,哪里會出現collision?
<code class="hljs perl has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1__</span>global_<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">_</span> void f(){ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> __shared_<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">_</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1024</span>]; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = threadIdx.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">x</span>; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> __syncthreads(); <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span> __syncthreads(); <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(i<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%2</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span> __syncthreads(); <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>; <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">printf</span>(“<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%d</span>\n”,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i]); <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11</span> }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul> Ans: Collision在?
1. 第5行,如上題,應為int tmp = s[i-1]; __syncthread(); s[i] = tmp;?
2. 第9行,同理?
PS: 第7行是沒問題的,模擬一下就知道
3.3 Atomic Memory Operation
這一節中我們將要接觸到原子操作。?
首先考慮一個問題:用1000000個threads給一個長為10個元素的array做加法,希望每個thread加100000,這個代碼大家先寫寫看,很簡單,依照我們之前的方法有下面的code:
注:這里的gputimer.h請去我的資源頁面自行下載。
<code class="hljs cpp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include <stdio.h></span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include "gputimer.h"</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">using</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">namespace</span> Gadgetron;<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define NUM_THREADS 1000000</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define ARRAY_SIZE 10</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define BLOCK_WIDTH 1000</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> print_array(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> size) {<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"{ "</span>);<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; i<size; i++) { <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"%d "</span>, <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[i]); }<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"}\n"</span>); } __global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_naive(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g) {<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>i = i % ARRAY_SIZE; g[i] = g[i] + <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>; } __global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_atomic(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g) {<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>i = i % ARRAY_SIZE; atomicAdd(&g[i], <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>); } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> main(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> argc,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">char</span> **argv) { GPUTimer timer;<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"%d total threads in %d blocks writing into %d array elements\n"</span>,NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// declare and allocate host memory</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> h_array[ARRAY_SIZE];<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">const</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> ARRAY_BYTES = ARRAY_SIZE * <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">sizeof</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span>);<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// declare, allocate, and zero out GPU memory</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> * d_array;cudaMalloc((<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> **) &d_array, ARRAY_BYTES);cudaMemset((<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> *) d_array, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, ARRAY_BYTES);<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// launch the kernel - comment out one of these</span>timer.start();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);</span>increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);timer.stop();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// copy back the array of sums from GPU and print</span>cudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);print_array(h_array, ARRAY_SIZE);<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// free GPU memory allocation and exit</span>cudaFree(d_array);<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; } </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li></ul> 執行兩次的結果:?
可見結果里每個元素都是648/647,不符合預期100000。這是為什么呢?
看我們的kernel部分代碼,每次執行g[i] = g[i] + 1, 一個read-modify-write操作,這樣會導致許多線程讀到g[i]的value,然后慢的線程將快的線程寫結果覆蓋掉了。如何解決呢?我們引入原子操作(atomic operation), 更改上面的kernel部分為:
<code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_atomic(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g) {<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>i = i % ARRAY_SIZE; atomicAdd(&g[i], <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>); }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul> 我們可以得到結果:?
可見,結果正確。那么原子操作atomicAdd用了怎樣的機制呢?——原子操作用了GPU built-in的特殊硬件,用以保證原子操作(同一時刻只能有一個thread做read-modify-write操作)
這里來看一下原子操作的limitations:?
1. only certain operations, data type(功能有限)?
2. still no ordering constraints(還是無序執行)?
3. serializes access to memory(所以慢)
(四). 總結
本節課介紹了以下內容:
-
communication patterns
- map
- gather
- scatter
- stencil
- transpose
-
gpu hardware & programming model
- SMs, threads, blocks ordering
- synchronization
- Memory model - local, global, shared memory
-
efficient GPU programming
- coalesced memory access
- faster memory for common used variable
OK~ 第三課就結束了,過兩天我把exercise上上來~ 敬請關注~.~
from:?http://blog.csdn.net/abcjennifer/article/details/43374009
總結
以上是生活随笔為你收集整理的CUDA系列学习(四)Parallel Task类型 与 Memory Allocation的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: CUDA系列学习(三)GPU设计与结构Q
- 下一篇: CUDA系列学习(五)GPU基础算法: