OpenCL学习笔记(三):OpenCL安装,编程简介与helloworld
歡迎轉(zhuǎn)載,轉(zhuǎn)載請(qǐng)注明:本文出自Bin的專(zhuān)欄http://t.zoukankan.com/blog.csdn.net/xbinworld。 技術(shù)交流QQ群:433250724,歡迎對(duì)算法、技術(shù)、應(yīng)用感興趣的同學(xué)加入。
OpenCL安裝
安裝我不打算花篇幅寫(xiě),原因是OpenCL實(shí)在是可以太多的平臺(tái)+環(huán)境下實(shí)現(xiàn)了,包括GPU和FPGA,以及不同的器件支持,在這里我主要把網(wǎng)上可以找到比較不錯(cuò)的經(jīng)驗(yàn)貼列一下,方便大家,我主要關(guān)注了FPGA的,其他GPU的大家網(wǎng)上搜搜吧:
altera opencl sdk下載:
https://www.altera.com.cn/products/design-software/embedded-software-developers/opencl/overview.html
alter的安裝指南,《Altera SDK for OpenCL
Getting Started Guide》
理論上看上面兩個(gè)就夠了,你需要做的事情包括:
下載opencl SDK,或者quatuars II軟件(含SDK),下載相應(yīng)開(kāi)發(fā)板的支持(altera上面有一些,但是其他的可能就需要你從相應(yīng)的供應(yīng)商那邊找了);還需要opencl的license,不然是不能編譯的。
中文的一些經(jīng)驗(yàn)貼可以看:
《Altera OpenCL入門(mén)(beta版)》http://wenku.baidu.com/link?url=bkIyo01jXeWfdGsrA_M0J1zomx6f0lYk0NPf-9-MNaC0OkWRmukDwY5yFz0I3Wrctqi5qD3jC8BhQQzjoqw1HXpUgIM68_blz5Cr3vxpaZC
【Altera SoC體驗(yàn)之旅】+ 正式開(kāi)啟OpenCL模式
http://home.eeworld.com.cn/my/space-uid-169743-blogid-247647.html
OpenCL編程簡(jiǎn)介
下面的圖簡(jiǎn)單說(shuō)明了OpenCL的編程框架,圖是用的GPU,其他類(lèi)似;
從圖中可以看出(參考《OpenCL 編程入門(mén)》):
1. 異構(gòu)計(jì)算設(shè)備,可以是CPU或GPU。現(xiàn)在也有支持OpenCL的FPGA設(shè)備和至強(qiáng)融核協(xié)處理設(shè)備(MIC)。
2. OpenCL的API通過(guò)Context(環(huán)境上下文)聯(lián)系在一起。
3. 運(yùn)行設(shè)備端的程序,經(jīng)過(guò)了編譯->設(shè)置參數(shù)->運(yùn)行等步驟。
名詞的概念:
Platform (平臺(tái)):主機(jī)加上OpenCL框架管理下的若干設(shè)備構(gòu)成了這個(gè)平臺(tái),通過(guò)這個(gè)平臺(tái),應(yīng)用程序可以與設(shè)備共享資源并在設(shè)備上執(zhí)行kernel。實(shí)際使用中基本上一個(gè)廠商對(duì)應(yīng)一個(gè)Platform,比如Intel, AMD都是這樣。
Device(設(shè)備):官方的解釋是計(jì)算單元(Compute Units)的集合。舉例來(lái)說(shuō),GPU是典型的device。Intel和AMD的多核CPU也提供OpenCL接口,所以也可以作為Device。
Context(上下文):OpenCL的Platform上共享和使用資源的環(huán)境,包括kernel、device、memory objects、command queue等。使用中一般一個(gè)Platform對(duì)應(yīng)一個(gè)Context。
Program:OpenCL程序,由kernel函數(shù)、其他函數(shù)和聲明等組成。
Kernel(核函數(shù)):可以從主機(jī)端調(diào)用,運(yùn)行在設(shè)備端的函數(shù)。
Memory Object(內(nèi)存對(duì)象):在主機(jī)和設(shè)備之間傳遞數(shù)據(jù)的對(duì)象,一般映射到OpenCL程序中的global memory。有兩種具體的類(lèi)型:Buffer Object(緩存對(duì)象)和Image Object(圖像對(duì)象)。
Command Queue(指令隊(duì)列):在指定設(shè)備上管理多個(gè)指令(Command)。隊(duì)列里指令執(zhí)行可以順序也可以亂序。一個(gè)設(shè)備可以對(duì)應(yīng)多個(gè)指令隊(duì)列。
NDRange:主機(jī)端運(yùn)行設(shè)備端kernel函數(shù)的主要接口。實(shí)際上還有其他的,NDRange是非常常見(jiàn)的,用于分組運(yùn)算,以后具體用到的時(shí)候就知道區(qū)別了。
Host端來(lái)看,OpenCL的組要執(zhí)行流程是這樣的:
其實(shí)基本上大部分簡(jiǎn)單的程序HOST部分都是差不多的,不用改很多,具體下面看一個(gè)例子就知道了。
第一個(gè)程序
這里貼一個(gè)altera官方的vector add的實(shí)例code,基本就是helloworld級(jí)別了,不過(guò)它的host寫(xiě)的很通用(考慮到對(duì)多個(gè)device統(tǒng)一編程),可以過(guò)一遍看看是不是和上面的圖對(duì)的上。其實(shí)看過(guò)這個(gè)基本其他的也就差不多了。
Host部分:(Kernel在最后)
// Copyright (C) 2013-2014 Altera Corporation, San Jose, California, USA. All rights reserved.
// Permission is hereby granted, free of charge, to any person obtaining a copy of this
// software and associated documentation files (the "Software"), to deal in the Software
// without restriction, including without limitation the rights to use, copy, modify, merge,
// publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to
// whom the Software is furnished to do so, subject to the following conditions:
// The above copyright notice and this permission notice shall be included in all copies or
// substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.
//
// This agreement shall be governed in all respects by the laws of the State of California and
// by the laws of the United States of America.
///////////////////////////////////////////////////////////////////////////////////
// This host program executes a vector addition kernel to perform:
// C = A + B
// where A, B and C are vectors with N elements.
//
// This host program supports partitioning the problem across multiple OpenCL
// devices if available. If there are M available devices, the problem is
// divided so that each device operates on N/M points. The host program
// assumes that all devices are of the same type (that is, the same binary can
// be used), but the code can be generalized to support different device types
// easily.
//
// Verification is performed against the same computation on the host CPU.
///////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "CL/opencl.h"
#include "AOCL_Utils.h"
using namespace aocl_utils;
// OpenCL runtime configuration
cl_platform_id platform = NULL;
unsigned num_devices = 0;
scoped_array<cl_device_id> device; // num_devices elements
cl_context context = NULL;
scoped_array<cl_command_queue> queue; // num_devices elements
cl_program program = NULL;
scoped_array<cl_kernel> kernel; // num_devices elements
scoped_array<cl_mem> input_a_buf; // num_devices elements
scoped_array<cl_mem> input_b_buf; // num_devices elements
scoped_array<cl_mem> output_buf; // num_devices elements
// Problem data.
const unsigned N = 1000000; // problem size
scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements
scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements
scoped_array<scoped_array<float> > ref_output; // num_devices elements
scoped_array<unsigned> n_per_device; // num_devices elements
// Function prototypes
float rand_float();
bool init_opencl();
void init_problem();
void run();
void cleanup();
// Entry point.
int main() {
// Initialize OpenCL.
if(!init_opencl()) {
return -1;
}
// Initialize the problem data.
// Requires the number of devices to be known.
init_problem();
// Run the kernel.
run();
// Free the resources allocated
cleanup();
return 0;
}
/////// HELPER FUNCTIONS ///////
// Randomly generate a floating-point number between -10 and 10.
float rand_float() {
return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
}
// Initializes the OpenCL objects.
bool init_opencl() {
cl_int status;
printf("Initializing OpenCL
");
if(!setCwdToExeDir()) {
return false;
}
// Get the OpenCL platform.
platform = findPlatform("Altera");
if(platform == NULL) {
printf("ERROR: Unable to find Altera OpenCL platform.
");
return false;
}
// Query the available OpenCL device.
device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
printf("Platform: %s
", getPlatformName(platform).c_str());
printf("Using %d device(s)
", num_devices);
for(unsigned i = 0; i < num_devices; ++i) {
printf(" %s
", getDeviceName(device[i]).c_str());
}
// Create the context.
context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
checkError(status, "Failed to create context");
// Create the program for all device. Use the first device as the
// representative device (assuming all device are of the same type).
std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);
printf("Using AOCX: %s
", binary_file.c_str());
program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);
// Build the program that was just created.
status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
checkError(status, "Failed to build program");
// Create per-device objects.
queue.reset(num_devices);
kernel.reset(num_devices);
n_per_device.reset(num_devices);
input_a_buf.reset(num_devices);
input_b_buf.reset(num_devices);
output_buf.reset(num_devices);
for(unsigned i = 0; i < num_devices; ++i) {
// Command queue.
queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);
checkError(status, "Failed to create command queue");
// Kernel.
const char *kernel_name = "vectorAdd";
kernel[i] = clCreateKernel(program, kernel_name, &status);
checkError(status, "Failed to create kernel");
// Determine the number of elements processed by this device.
n_per_device[i] = N / num_devices; // number of elements handled by this device
// Spread out the remainder of the elements over the first
// N % num_devices.
if(i < (N % num_devices)) {
n_per_device[i]++;
}
// Input buffers.
input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
n_per_device[i] * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for input A");
input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
n_per_device[i] * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for input B");
// Output buffer.
output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
n_per_device[i] * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for output");
}
return true;
}
// Initialize the data for the problem. Requires num_devices to be known.
void init_problem() {
if(num_devices == 0) {
checkError(-1, "No devices");
}
input_a.reset(num_devices);
input_b.reset(num_devices);
output.reset(num_devices);
ref_output.reset(num_devices);
// Generate input vectors A and B and the reference output consisting
// of a total of N elements.
// We create separate arrays for each device so that each device has an
// aligned buffer.
for(unsigned i = 0; i < num_devices; ++i) {
input_a[i].reset(n_per_device[i]);
input_b[i].reset(n_per_device[i]);
output[i].reset(n_per_device[i]);
ref_output[i].reset(n_per_device[i]);
for(unsigned j = 0; j < n_per_device[i]; ++j) {
input_a[i][j] = rand_float();
input_b[i][j] = rand_float();
ref_output[i][j] = input_a[i][j] + input_b[i][j];
}
}
}
void run() {
cl_int status;
const double start_time = getCurrentTimestamp();
// Launch the problem for each device.
scoped_array<cl_event> kernel_event(num_devices);
scoped_array<cl_event> finish_event(num_devices);
for(unsigned i = 0; i < num_devices; ++i) {
// Transfer inputs to each device. Each of the host buffers supplied to
// clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
// for the host-to-device transfer.
cl_event write_event[2];
status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,
0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);
checkError(status, "Failed to transfer input A");
status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,
0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);
checkError(status, "Failed to transfer input B");
// Set kernel arguments.
unsigned argi = 0;
status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);
checkError(status, "Failed to set argument %d", argi - 1);
status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);
checkError(status, "Failed to set argument %d", argi - 1);
status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);
checkError(status, "Failed to set argument %d", argi - 1);
// Enqueue kernel.
// Use a global work size corresponding to the number of elements to add
// for this device.
//
// We don't specify a local work size and let the runtime choose
// (it'll choose to use one work-group with the same size as the global
// work-size).
//
// Events are used to ensure that the kernel is not launched until
// the writes to the input buffers have completed.
const size_t global_work_size = n_per_device[i];
printf("Launching for device %d (%d elements)
", i, global_work_size);
status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,
&global_work_size, NULL, 2, write_event, &kernel_event[i]);
checkError(status, "Failed to launch kernel");
// Read the result. This the final operation.
status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,
0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);
// Release local events.
clReleaseEvent(write_event[0]);
clReleaseEvent(write_event[1]);
}
// Wait for all devices to finish.
clWaitForEvents(num_devices, finish_event);
const double end_time = getCurrentTimestamp();
// Wall-clock time taken.
printf("
Time: %0.3f ms
", (end_time - start_time) * 1e3);
// Get kernel times using the OpenCL event profiling API.
for(unsigned i = 0; i < num_devices; ++i) {
cl_ulong time_ns = getStartEndTime(kernel_event[i]);
printf("Kernel time (device %d): %0.3f ms
", i, double(time_ns) * 1e-6);
}
// Release all events.
for(unsigned i = 0; i < num_devices; ++i) {
clReleaseEvent(kernel_event[i]);
clReleaseEvent(finish_event[i]);
}
// Verify results.
bool pass = true;
for(unsigned i = 0; i < num_devices && pass; ++i) {
for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {
if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {
printf("Failed verification @ device %d, index %d
Output: %f
Reference: %f
",
i, j, output[i][j], ref_output[i][j]);
pass = false;
}
}
}
printf("
Verification: %s
", pass ? "PASS" : "FAIL");
}
// Free the resources allocated during initialization
void cleanup() {
for(unsigned i = 0; i < num_devices; ++i) {
if(kernel && kernel[i]) {
clReleaseKernel(kernel[i]);
}
if(queue && queue[i]) {
clReleaseCommandQueue(queue[i]);
}
if(input_a_buf && input_a_buf[i]) {
clReleaseMemObject(input_a_buf[i]);
}
if(input_b_buf && input_b_buf[i]) {
clReleaseMemObject(input_b_buf[i]);
}
if(output_buf && output_buf[i]) {
clReleaseMemObject(output_buf[i]);
}
}
if(program) {
clReleaseProgram(program);
}
if(context) {
clReleaseContext(context);
}
}
Kernel部分:
// ACL kernel for adding two input vectors
__kernel void vectorAdd(__global const float *x,
__global const float *y,
__global float *restrict z)
{
// get index of the work item
int index = get_global_id(0);
// add the vector elements
z[index] = x[index] + y[index];
}
kernel部分代碼就這幾行,__global是一個(gè)限定符,表示用外部存儲(chǔ)(比如DDR)來(lái)存儲(chǔ),其他語(yǔ)法和標(biāo)準(zhǔn)C語(yǔ)言是一樣的,就不多說(shuō)了。
代碼中最重要的就是get_global_id,這個(gè)是在多work-item工作模式下的常用手段,通過(guò)id確定work-item然后進(jìn)行操作,所有的item都是一樣的,因此就add的函數(shù)里面就沒(méi)有習(xí)慣的for()的寫(xiě)法了。可以對(duì)kernel的設(shè)置進(jìn)行定制,包括compute unit,SIMD模式等,這樣來(lái)控制程序的并行性,更大的并行往往性能高,但是更耗資源。
具體的Kernel函數(shù)的內(nèi)容可以參考OpenCL的《The OpenCL Specification 1.0》以及altera的opencl編程指南,后面的筆記我會(huì)具體寫(xiě)一下。
內(nèi)存模型
最后寫(xiě)一下Opencl的內(nèi)存模型,看下面的示意圖:
用核函數(shù)中的內(nèi)存變量來(lái)簡(jiǎn)單地解釋?zhuān)河胏lCreateBuffer 創(chuàng)建、用clSetKernelArg 傳遞的數(shù)據(jù)在global memory 和constant memory中;核函數(shù)中的寄存器變量在private memory 中;核函數(shù)的內(nèi)部變量、緩存等,在local memory 中。圖例中可以看到Device 并不直接訪問(wèn)global memory,而是通過(guò)Cache 來(lái)訪問(wèn)。可以想象當(dāng)同時(shí)運(yùn)行的work-item,使用的內(nèi)存都在同一塊cache 中,則內(nèi)存吞吐的效率最高。對(duì)應(yīng)到work group 中,就是在程序設(shè)計(jì)上盡量使同一個(gè)work group 中的work item 操作連續(xù)的內(nèi)存,以提高訪存效率。
本篇就到這里。
總結(jié)
以上是生活随笔為你收集整理的OpenCL学习笔记(三):OpenCL安装,编程简介与helloworld的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: testdisk修复磁盘文件
- 下一篇: 容器安全产品Aqua调研