日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

OpenCL memory object 之选择传输path

發(fā)布時間:2023/12/13 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 OpenCL memory object 之选择传输path 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

對應用程序來說,選擇合適的memory object傳輸path可以有效提高程序性能。

?

下面先看一寫buffer bandwidth的例子:

?

1.? clEnqueueWriteBuffer()以及clEnqueueReadBuffer()

?

????? 如果應用程序已經(jīng)通過malloc 或者mmap分配內(nèi)存,CL_MEM_USE_HOST_PTR是個理想的選擇。

有兩種使用這種方式的方法:

?

第一種:

a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )
d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
e. clEnqueueUnmapMemObject( pinnedBuffer,
pinnedMemory )

?

???? pinning開銷在步驟a產(chǎn)生,步驟d沒有任何pinning開銷。通常應用立即程序執(zhí)行a,b,c,e步驟,而在步驟d之后,要反復讀和修改pinnedMemory中的數(shù)據(jù),

?

?

第二種

?? clEnqueueRead/WriteBuffer 直接在用戶的memory buffer中被使用。在copy(host->device)數(shù)據(jù)前,首先需要pin(lock page)操作,然后才能執(zhí)行傳輸操作。這條path大概是peak interconnect bandwidth的2/3。


2. 在pre-pinned host buffer上使用clEnqueueCopyBuffer()

?

??? 和1類似,clEnqueueCopyBuffer在pre-pinned buffer上以peak interconnect bandwidth執(zhí)行傳輸操作:

?

a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *memory = clEnqueueMapBuffer( pinnedBuffer )
d. Application writes or modifies memory.
e. clEnqueueUnmapMemObject( pinnedBuffer, memory )
f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )
或者通過:
g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )
h. void *memory = clEnqueueMapBuffer( pinnedBuffer )
i. Application reads memory.
j. clEnqueueUnmapMemObject( pinnedBuffer, memory )

?

???

???? 由于pinned memory駐留在host memroy,所以clMap() 以及 clUnmap()調(diào)用不會導致數(shù)據(jù)傳輸。cpu可以以host memory帶寬來操作這些pinned buffer。

?
3、在device buffer上執(zhí)行 clEnqueueMapBuffer() and clEnqueueUnmapMemObject()

?

???? 對于已經(jīng)通過malloc和mmap分配空間的buffer,傳輸開銷除了interconnect傳輸外,還要包括一個memcpy過程,該過程把buffer拷貝進mapped device buffer。


a. Data transfer from host to device buffer.


1.

ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
???? 由于緩沖被映射為write-only,所以沒有數(shù)據(jù)從device傳輸?shù)絟ost,映射開銷比較低。一個指向pinned host buffer的指針被返回。

?

?

2. 應用程序通過memset(ptr)填充host buffer?
??? memcpy ( ptr, srcptr ), fread( ptr ), 或者直接CPU寫, 這些操作以host memory全速帶寬讀寫。


3. clEnqueueUnmapMemObject( .., buf, ptr, .. )?
??? pre-pinned buffer以peak interconnect速度被傳輸?shù)紾PU device。

?
b. Data transfer from device buffer to host.


1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )
??? 這個命令啟動devcie到host數(shù)據(jù)傳輸,數(shù)據(jù)以peak interconnect bandwidth傳輸?shù)揭粋€pre-pinned的臨時緩沖中。返回一個指向pinned memory的指針。
2. 應用程序讀、處理數(shù)據(jù)或者執(zhí)行 memcpy( dstptr, ptr ), fwrite (ptr), 或者其它類似的函數(shù)時候,由于buffer駐留在host memory中,所以操作以host memory bandwidth執(zhí)行。

3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

由于buffer被映射成只讀的,沒有實際數(shù)據(jù)傳輸,所以unmap操作的cost很低。


4. host直接訪問設(shè)備zero copy buffer

?? 這個訪問允許數(shù)據(jù)傳輸和GPU計算同時執(zhí)行(overlapped),在一些稀疏(sparse)的寫或者更新情況下,比較有用。

a. 一個device上的 zero copy buffer通過下面的命令被創(chuàng)建


buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )


CPU能夠通過uncached WC path直接訪問該buffer。 通??梢允褂秒p緩沖機制,gpu在處理一個緩沖中的數(shù)據(jù),cpu同時在填充另一個緩沖中的數(shù)據(jù)。

A zero copy device buffer can also be used to for sparse updates, such as assembling sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. Due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible.

?

b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is directly mapped into the host address space.


2. The application transfers data via memset( ptr ), memcpy( ptr, srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy device buffer. Depending on the chipset, the bandwidth can be of the same order of magnitude as the interconnect bandwidth, although it typically is lower than peak.


3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the buffer continues to reside on the device.
c. If the buffer content must be read back later, use clEnqueueReadBuffer( .., buf, ..)? or clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).


This bypasses slow host reads through the uncached path.


5 - GPU直接訪問host zero copy memory

?

This option allows direct reads or writes of host memory by the GPU. A GPU kernel can import data from the host without explicit transfer, and write data directly back to host memory. An ideal use is to perform small I/Os straight from the kernel, or to integrate the transfer latency directly into the kernel execution time.


a:The application creates a zero copy host buffer.
???? buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b:Next, the application modifies or reads the zero copy host buffer.


???? 1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ | CL_MAP_WRITE, .. )

This operation is very low cost because it is a map of a buffer already residing in host memory.
????? 2. The application modifies the data through memset( ptr ), memcpy( in either direction ), sparse or dense CPU reads or writes. Since the application is modifying a host buffer, these operations take place at host memory bandwidth.
????? 3. clEnqueueUnmapMemObject( .., buf, ptr, .. )

As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory.
c. The application runs clEnqueueNDRangeKernel(), using buffers of this type as input or output. GPU kernel reads and writes go across the interconnect to host memory, and the data transfer becomes part of the
kernel execution.


The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnitude as the peak interconnect bandwidth.


For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magnitude lower compared to a kernel accessing a regular device buffer located on the device.


d. Following kernel execution, the application can access data in the host buffer in the same manner as described above.

總結(jié)

以上是生活随笔為你收集整理的OpenCL memory object 之选择传输path的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 亚洲精品二区三区 | 日韩 在线 | 人人干天天干 | 狠狠操免费视频 | 特种兵之深入敌后 | 国产第一页在线观看 | 牛牛影视免费观看 | 亚洲日本激情 | 18禁一区二区 | 国产在线观看免费播放 | 五月天一区二区 | 男人操女人的视频 | 成年人视频免费在线观看 | 欧美性爱视频久久 | 免费人成在线观看视频播放 | 免费一级特黄特色毛片久久看 | 国产精品伦理一区 | 久久久久久久久国产精品一区 | 国产片网站 | 人人色网| 伊人成人久久 | 黄色特一级 | 色男人的天堂 | 激情福利 | 四虎永久免费观看 | 成人影片在线播放 | 黑人极品ⅴideos精品欧美棵 | 超碰xxx| 欧美日韩在线观看一区二区三区 | 日韩免费视频网站 | 人人草人人爽 | 九色丨蝌蚪丨成人 | 亚洲黄网在线观看 | 精品香蕉视频 | 国产盗摄视频在线观看 | 国产一二三区av | 久久久99国产精品免费 | 亚洲黄色免费看 | 五月天伊人网 | 天堂网av手机版 | 国产精品video | 国产婷婷色一区二区三区 | 亚洲天堂不卡 | 日本三级韩国三级三级a级中文 | 免费成人在线观看视频 | 亚洲视频二区 | 夜夜躁很很躁日日躁麻豆 | 亚洲精品字幕在线观看 | 深夜福利1000 | 狠狠干超碰 | 黑人操日本女优 | 日韩黄视频 | 久久免费电影 | 亚洲 国产 欧美 日韩 | 精品国产一区二区视频 | 欧美日韩久久婷婷 | 黄色女女 | 友田真希一区二区 | 黑人操日本女优 | 四虎亚洲精品 | 久久不射网站 | 日韩极品视频在线观看 | 污污小视频 | 香蕉黄色片 | 99热青青草 | 国产成人精品999在线观看 | 久久久精品人妻一区二区三区色秀 | www,xxx日本 | 一区二区有码 | 免费在线色 | 免费看国产曰批40分钟粉红裤头 | 日韩av色图| 神马午夜伦理影院 | 中文字幕亚洲无线码在线一区 | 91精品国自产在线观看 | 在线观看国产福利 | 日韩在线观看你懂的 | 国产精品久久久久永久免费看 | 免费av黄色 | 91美女诱惑| 色一区二区三区 | 国产美女精品视频 | 国产资源免费 | 欧美国产日韩综合 | 影音先锋毛片 | 中文字幕视频一区 | 中国性xxx | a点w片| 免费视频色| 天天爽天天爽 | 日本黄色性视频 | 天天干天天操天天操 | 欧美一区三区 | 黄色激情小说视频 | 欧美人与禽性xxxxx杂性 | 国产无毛av | 韩国91视频 | 亚洲中文字幕无码一区二区三区 | 欧美一区在线观看视频 |