OpenCL memory object 之选择传输path
對(duì)應(yīng)用程序來說,選擇合適的memory object傳輸path可以有效提高程序性能。
?
下面先看一寫buffer bandwidth的例子:
?
1.? clEnqueueWriteBuffer()以及clEnqueueReadBuffer()
?
????? 如果應(yīng)用程序已經(jīng)通過malloc 或者mmap分配內(nèi)存,CL_MEM_USE_HOST_PTR是個(gè)理想的選擇。
有兩種使用這種方式的方法:
?
第一種:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )
d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )
?
???? pinning開銷在步驟a產(chǎn)生,步驟d沒有任何pinning開銷。通常應(yīng)用立即程序執(zhí)行a,b,c,e步驟,而在步驟d之后,要反復(fù)讀和修改pinnedMemory中的數(shù)據(jù),
?
?
第二種:
?? clEnqueueRead/WriteBuffer 直接在用戶的memory buffer中被使用。在copy(host->device)數(shù)據(jù)前,首先需要pin(lock page)操作,然后才能執(zhí)行傳輸操作。這條path大概是peak interconnect bandwidth的2/3。
2. 在pre-pinned host buffer上使用clEnqueueCopyBuffer()
?
??? 和1類似,clEnqueueCopyBuffer在pre-pinned buffer上以peak interconnect bandwidth執(zhí)行傳輸操作:
?
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *memory = clEnqueueMapBuffer( pinnedBuffer )
d. Application writes or modifies memory.
e. clEnqueueUnmapMemObject( pinnedBuffer, memory )
f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )
或者通過:
g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )
h. void *memory = clEnqueueMapBuffer( pinnedBuffer )
i. Application reads memory.
j. clEnqueueUnmapMemObject( pinnedBuffer, memory )
?
???
???? 由于pinned memory駐留在host memroy,所以clMap() 以及 clUnmap()調(diào)用不會(huì)導(dǎo)致數(shù)據(jù)傳輸。cpu可以以host memory帶寬來操作這些pinned buffer。
?
3、在device buffer上執(zhí)行 clEnqueueMapBuffer() and clEnqueueUnmapMemObject()
?
???? 對(duì)于已經(jīng)通過malloc和mmap分配空間的buffer,傳輸開銷除了interconnect傳輸外,還要包括一個(gè)memcpy過程,該過程把buffer拷貝進(jìn)mapped device buffer。
a. Data transfer from host to device buffer.
1.
ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
???? 由于緩沖被映射為write-only,所以沒有數(shù)據(jù)從device傳輸?shù)絟ost,映射開銷比較低。一個(gè)指向pinned host buffer的指針被返回。
?
?
2. 應(yīng)用程序通過memset(ptr)填充host buffer?
??? memcpy ( ptr, srcptr ), fread( ptr ), 或者直接CPU寫, 這些操作以host memory全速帶寬讀寫。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )?
??? pre-pinned buffer以peak interconnect速度被傳輸?shù)紾PU device。
?
b. Data transfer from device buffer to host.
1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )
??? 這個(gè)命令啟動(dòng)devcie到host數(shù)據(jù)傳輸,數(shù)據(jù)以peak interconnect bandwidth傳輸?shù)揭粋€(gè)pre-pinned的臨時(shí)緩沖中。返回一個(gè)指向pinned memory的指針。
2. 應(yīng)用程序讀、處理數(shù)據(jù)或者執(zhí)行 memcpy( dstptr, ptr ), fwrite (ptr), 或者其它類似的函數(shù)時(shí)候,由于buffer駐留在host memory中,所以操作以host memory bandwidth執(zhí)行。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
由于buffer被映射成只讀的,沒有實(shí)際數(shù)據(jù)傳輸,所以u(píng)nmap操作的cost很低。
4. host直接訪問設(shè)備zero copy buffer
?? 這個(gè)訪問允許數(shù)據(jù)傳輸和GPU計(jì)算同時(shí)執(zhí)行(overlapped),在一些稀疏(sparse)的寫或者更新情況下,比較有用。
a. 一個(gè)device上的 zero copy buffer通過下面的命令被創(chuàng)建:
buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )
CPU能夠通過uncached WC path直接訪問該buffer。 通常可以使用雙緩沖機(jī)制,gpu在處理一個(gè)緩沖中的數(shù)據(jù),cpu同時(shí)在填充另一個(gè)緩沖中的數(shù)據(jù)。
A zero copy device buffer can also be used to for sparse updates, such as assembling sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. Due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible.
?
b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is directly mapped into the host address space.
2. The application transfers data via memset( ptr ), memcpy( ptr, srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy device buffer. Depending on the chipset, the bandwidth can be of the same order of magnitude as the interconnect bandwidth, although it typically is lower than peak.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the buffer continues to reside on the device.
c. If the buffer content must be read back later, use clEnqueueReadBuffer( .., buf, ..)? or clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).
This bypasses slow host reads through the uncached path.
5 - GPU直接訪問host zero copy memory
?
This option allows direct reads or writes of host memory by the GPU. A GPU kernel can import data from the host without explicit transfer, and write data directly back to host memory. An ideal use is to perform small I/Os straight from the kernel, or to integrate the transfer latency directly into the kernel execution time.
a:The application creates a zero copy host buffer.
???? buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b:Next, the application modifies or reads the zero copy host buffer.
???? 1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ | CL_MAP_WRITE, .. )
This operation is very low cost because it is a map of a buffer already residing in host memory.
????? 2. The application modifies the data through memset( ptr ), memcpy( in either direction ), sparse or dense CPU reads or writes. Since the application is modifying a host buffer, these operations take place at host memory bandwidth.
????? 3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory.
c. The application runs clEnqueueNDRangeKernel(), using buffers of this type as input or output. GPU kernel reads and writes go across the interconnect to host memory, and the data transfer becomes part of the
kernel execution.
The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnitude as the peak interconnect bandwidth.
For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magnitude lower compared to a kernel accessing a regular device buffer located on the device.
d. Following kernel execution, the application can access data in the host buffer in the same manner as described above.
總結(jié)
以上是生活随笔為你收集整理的OpenCL memory object 之选择传输path的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 工行融e借和e分期额度共享吗?90%的人
- 下一篇: 《OpenCL异构计算》新版中译本派送中