當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

Lesson 2 - GPU Hardware and Parallel Communication Patterns

發(fā)布時間：2023/12/3 综合教程 44 生活家

生活随笔收集整理的這篇文章主要介紹了 Lesson 2 - GPU Hardware and Parallel Communication Patterns 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.Welcome to Unit2

2.Communication Patterns

3.Map and Gathre

4.練習：Scatter

5.練習：Stencil

6.Transpose Part 1

7.Transpose Part 2

8.練習：What kind of communication Pattern

9.Parallel Communicaiton Pattern Recap

10.Let us talk About GPU Hardware

11.Programmer View of GPU

12.Thread Blocks and GPU Hardware

13.Threads and Blocks

14.Anoter Quiz on Threads and Blocks

15.What Can the Programmer Specify

16.CUDA Makes Few Guarantees About Thread Blocks

17.練習：A Thread Block Programming Example

18.練習：Code for A Thread Block Programming Example

19.What dows CUDA Guarantee

20.GPU Memory Model

21.練習：A Quize About GPU Memory Model

22.Synchronization - Barrier

23.練習：The need for barries

24.Programming Model

25.練習：A Quzie On Synchronization

26.Writing Efficient Programs

27.Minimize Time Spent On Memory

28.Global Memory

29.Shared Memory

30.練習：A Quize On Memory Access

31.Coalesce Memory Access

32.練習：A Quiz on Coalescing Memory Access

33.A Related Problem Part 1

34.A Related Problem Part 2

35.Atomic Memory Operations

36.Limitations of Atomic Memory Operations

37.練習：Code For Timing Atomic Operations

38.練習：Let us Time some Code

39.High Arithmetic Intensity

40.Thread Divergence

41.Summary of Unit 2

42.Congratulations

1.Welcome to Unit2

2.Communication Patterns

Parallel computing : Many threads solving a problem by working together. =?communication !?!

3.Map and Gathre

4.練習：Scatter

5.練習：Stencil

6.Transpose Part 1

7.Transpose Part 2

8.練習：What kind of communication Pattern

9.Parallel Communicaiton Pattern Recap

10.Let us talk About GPU Hardware

11.Programmer View of GPU

12.Thread Blocks and GPU Hardware

13.Threads and Blocks

14.Anoter Quiz on Threads and Blocks

15.What Can the Programmer Specify

16.CUDA Makes Few Guarantees About Thread Blocks

17.練習：A Thread Block Programming Example

18.練習：Code for A Thread Block Programming Example

#include <stdio.h>#define NUM_BLOCKS 16
#define BLOCK_WIDTH 1__global__ void hello()
{printf("Hello world! I'm a thread in block %d\n", blockIdx.x);
}int main(int argc,char **argv)
{// launch the kernelhello<<<NUM_BLOCKS, BLOCK_WIDTH>>>();// force the printf()s to flushcudaDeviceSynchronize();printf("That's all!\n");return 0;
}

19.What dows CUDA Guarantee

Later on we’ll learn how you can use a concept called ‘streams’ to relax this guarantee and overlap different kernels when as the programmer you know it’s safe to do so.

20.GPU Memory Model

21.練習：A Quize About GPU Memory Model

22.Synchronization - Barrier

23.練習：The need for barries

24.Programming Model

高度總結了CUDA的概念：A hierachy of Computation 、Memory Space、Synchronization

25.練習：A Quzie On Synchronization

Students paying close attention will notice another bug in this code: an off-by-one array access. Thread 0 will try to write to location s[-1]. Oops!

26.Writing Efficient Programs

27.Minimize Time Spent On Memory

28.Global Memory

/*********************** using local memory ***********************/// a __device__ or __global__ function runs on the GPU
__global__ void use_local_memory_GPU(float in)
{float f;    // variable "f" is in local memory and private to each threadf = in;     // parameter "in" is in local memory and private to each thread// ... real code would presumably do other stuff here ... 
}/*********************** using global memory ***********************/// a __global__ function runs on the GPU & can be called from host
__global__ void use_global_memory_GPU(float *array)
{// "array" is a pointer into global memory on the devicearray[threadIdx.x] = 2.0f * (float) threadIdx.x;
}

Note that in this example we are shipping the data to the GPU, running only a single kernel, then copying it back. Often we will run several kernels on the GPU, one after another. When this happens there is no need to copy the intermediate results back to the host - you can run each kernel in sequence, leaving the intermediate result data on the GPU in global memory, and only copy the final result back to the host

29.Shared Memory

// Using different memory spaces in CUDA
#include <stdio.h>/*********************** using local memory ***********************/// a __device__ or __global__ function runs on the GPU
__global__ void use_local_memory_GPU(float in)
{float f;    // variable "f" is in local memory and private to each threadf = in;     // parameter "in" is in local memory and private to each thread// ... real code would presumably do other stuff here ... 
}/*********************** using global memory ***********************/// a __global__ function runs on the GPU & can be called from host
__global__ void use_global_memory_GPU(float *array)
{// "array" is a pointer into global memory on the devicearray[threadIdx.x] = 2.0f * (float) threadIdx.x;
}/*********************** using shared memory ***********************/// (for clarity, hardcoding 128 threads/elements and omitting out-of-bounds checks)
__global__ void use_shared_memory_GPU(float *array) //局部變量是一個指針，指向預先分配的全局內(nèi)存
{// local variables, private to each threadint i, index = threadIdx.x;float average, sum = 0.0f;// __shared__ variables are visible to all threads in the thread block// and have the same lifetime as the thread block__shared__ float sh_arr[128];// copy data from "array" in global memory to sh_arr in shared memory.// here, each thread is responsible for copying a single element.sh_arr[index] = array[index];__syncthreads();    // ensure all the writes to shared memory have completed// now, sh_arr is fully populated. Let's find the average of all previous elementsfor (i=0; i<index; i++) { sum += sh_arr[i]; }  //因為共享內(nèi)存特別快，訪問共享內(nèi)存比全局內(nèi)存快的多//每個線程要訪問數(shù)組中的一堆元素average = sum / (index + 1.0f);// if array[index] is greater than the average of array[0..index-1], replace with average.// since array[] is in global memory, this change will be seen by the host (and potentially // other thread blocks, if any)if (array[index] > average) { array[index] = average; }// the following code has NO EFFECT: it modifies shared memory, but // the resulting modified data is never copied back to global memory// and vanishes when the thread block completessh_arr[index] = 3.14; //不起作用，共享內(nèi)存的壽命是線程塊的壽命，一旦線程塊完成了，該內(nèi)存就蒸發(fā)了
}int main(int argc, char **argv)
{/** First, call a kernel that shows using local memory */use_local_memory_GPU<<<1, 128>>>(2.0f);/** Next, call a kernel that shows using global memory*/float h_arr[128];   // convention: h_ variables live on hostfloat *d_arr;       // convention: d_ variables live on device (GPU global mem)// allocate global memory on the device, place result in "d_arr"cudaMalloc((void **) &d_arr, sizeof(float) * 128);// now copy data from host memory "h_arr" to device memory "d_arr"cudaMemcpy((void *)d_arr, (void *)h_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);// launch the kernel (1 block of 128 threads)use_global_memory_GPU<<<1, 128>>>(d_arr);  // modifies the contents of array at d_arr 共享內(nèi)存// copy the modified array back to the host, overwriting contents of h_arrcudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyDeviceToHost);// ... do other stuff .../** Next, call a kernel that shows using shared memory*/// as before, pass in a pointer to data in global memoryuse_shared_memory_GPU<<<1, 128>>>(d_arr); // copy the modified array back to the hostcudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);// ... do other stuff ...return 0;
}

There should be a __syncthreads() before the final line, to avoid threads that reach that line from overwriting sh_arr while other threads are still computing their averages. Thanks to all of you that have pointed this out.

30.練習：A Quize On Memory Access

31.Coalesce Memory Access

32.練習：A Quiz on Coalescing Memory Access

33.A Related Problem Part 1

問題：線程相互覆蓋，所以是隨機的

34.A Related Problem Part 2

35.Atomic Memory Operations

使用GPU內(nèi)置的特殊硬件以執(zhí)行原子運算

解決：多個線程視圖同時在同一內(nèi)存位置讀寫的沖突，把不同線程對內(nèi)存的訪問做到了串行化

#include <stdio.h>
#include "gputimer.h"#define NUM_THREADS 1000000
#define ARRAY_SIZE  100#define BLOCK_WIDTH 1000void print_array(int *array, int size)
{printf("{ ");for (int i = 0; i < size; i++)  { printf("%d ", array[i]); }printf("}\n");
}__global__ void increment_naive(int *g)
{// which thread is this?int i = blockIdx.x * blockDim.x + threadIdx.x; // each thread to increment consecutive elements, wrapping at ARRAY_SIZEi = i % ARRAY_SIZE;  g[i] = g[i] + 1;
}__global__ void increment_atomic(int *g)
{// which thread is this?int i = blockIdx.x * blockDim.x + threadIdx.x; // each thread to increment consecutive elements, wrapping at ARRAY_SIZEi = i % ARRAY_SIZE;  atomicAdd(& g[i], 1);
}int main(int argc,char **argv)
{   GpuTimer timer;printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);// declare and allocate host memoryint h_array[ARRAY_SIZE];const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);// declare, allocate, and zero out GPU memoryint * d_array;cudaMalloc((void **) &d_array, ARRAY_BYTES);cudaMemset((void *) d_array, 0, ARRAY_BYTES);  //數(shù)組里面的值初始化為0// launch the kernel - comment out one of thesetimer.Start();// increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);timer.Stop();// copy back the array of sums from GPU and printcudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);print_array(h_array, ARRAY_SIZE);printf("Time elapsed = %g ms\n", timer.Elapsed());// free GPU memory allocation and exitcudaFree(d_array);return 0;
}

wlsh@wlsh-ThinkStation:~/Desktop/cs344-master/Lesson Code Snippets/Lesson 2 Code Snippets$ ./a.out 
1000000 total threads in 1000 blocks writing into 100 array elements
{ 32 32 32 32 31 31 31 31 33 33 33 33 32 32 32 32 32 32 32 32 33 33 33 33 34 34 34 34 32 32 32 32 32 32 32 32 30 30 30 30 30 30 30 30 31 31 31 31 31 31 31 31 32 32 32 32 31 31 31 31 31 31 31 31 34 34 34 34 31 31 31 31 32 32 32 32 34 34 34 34 33 33 33 33 33 33 33 33 32 32 32 32 31 31 31 31 31 31 31 31 }
Time elapsed = 0.454272 ms
wlsh@wlsh-ThinkStation:~/Desktop/cs344-master/Lesson Code Snippets/Lesson 2 Code Snippets$ nvcc atomics.cu 
wlsh@wlsh-ThinkStation:~/Desktop/cs344-master/Lesson Code Snippets/Lesson 2 Code Snippets$ ./a.out 
1000000 total threads in 1000 blocks writing into 100 array elements
{ 33 33 33 33 31 31 31 31 32 32 32 32 33 33 33 33 31 31 31 31 34 34 34 34 31 31 31 31 32 32 32 32 33 33 33 33 32 32 32 32 32 32 32 32 31 31 31 31 34 34 34 34 32 32 32 32 33 33 33 33 32 32 32 32 31 31 31 31 33 33 33 33 32 32 32 32 32 32 32 32 31 31 31 31 33 33 33 33 32 32 32 32 31 31 31 31 32 32 32 32 }
Time elapsed = 0.466592 ms

wlsh@wlsh-ThinkStation:~/Desktop/cs344-master/Lesson Code Snippets/Lesson 2 Code Snippets$ nvcc atomics.cu 
wlsh@wlsh-ThinkStation:~/Desktop/cs344-master/Lesson Code Snippets/Lesson 2 Code Snippets$ ./a.out 
1000000 total threads in 1000 blocks writing into 100 array elements
{ 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 }
Time elapsed = 0.324992 ms

36.Limitations of Atomic Memory Operations

An example of floating point being non-associative: associative.cu

37.練習：Code For Timing Atomic Operations

代碼在上面

38.練習：Let us Time some Code

39.High Arithmetic Intensity

threadIdx.是不會改變的

40.Thread Divergence

降低速度

41.Summary of Unit 2

42.Congratulations

總結

以上是生活随笔為你收集整理的Lesson 2 - GPU Hardware and Parallel Communication Patterns的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CF 1529B. Sifid and
下一篇： Java基础知识详解

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

综合教程

Lesson 2 - GPU Hardware and Parallel Communication Patterns

1.Welcome to Unit2

2.Communication Patterns

3.Map and Gathre

4.練習：Scatter

5.練習：Stencil

6.Transpose Part 1

7.Transpose Part 2

8.練習：What kind of communication Pattern

9.Parallel Communicaiton Pattern Recap

10.Let us talk About GPU Hardware

11.Programmer View of GPU

12.Thread Blocks and GPU Hardware

13.Threads and Blocks

14.Anoter Quiz on Threads and Blocks

15.What Can the Programmer Specify

16.CUDA Makes Few Guarantees About Thread Blocks

17.練習：A Thread Block Programming Example

18.練習：Code for A Thread Block Programming Example

19.What dows CUDA Guarantee

20.GPU Memory Model

21.練習：A Quize About GPU Memory Model

22.Synchronization - Barrier

23.練習：The need for barries

24.Programming Model

25.練習：A Quzie On Synchronization

26.Writing Efficient Programs

27.Minimize Time Spent On Memory

28.Global Memory

29.Shared Memory

30.練習：A Quize On Memory Access

31.Coalesce Memory Access

32.練習：A Quiz on Coalescing Memory Access

33.A Related Problem Part 1

34.A Related Problem Part 2

35.Atomic Memory Operations

36.Limitations of Atomic Memory Operations

37.練習：Code For Timing Atomic Operations

38.練習：Let us Time some Code

39.High Arithmetic Intensity

40.Thread Divergence

41.Summary of Unit 2

42.Congratulations

總結