當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

OpenCL学习笔记（三）：OpenCL安装，编程简介与helloworld

發布時間：2024/6/21 综合教程 29 生活家

生活随笔收集整理的這篇文章主要介紹了 OpenCL学习笔记（三）：OpenCL安装，编程简介与helloworld 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

歡迎轉載，轉載請注明：本文出自Bin的專欄http://t.zoukankan.com/blog.csdn.net/xbinworld。技術交流QQ群：433250724，歡迎對算法、技術、應用感興趣的同學加入。

OpenCL安裝

安裝我不打算花篇幅寫，原因是OpenCL實在是可以太多的平臺+環境下實現了，包括GPU和FPGA，以及不同的器件支持，在這里我主要把網上可以找到比較不錯的經驗貼列一下，方便大家，我主要關注了FPGA的，其他GPU的大家網上搜搜吧：

altera opencl sdk下載：
https://www.altera.com.cn/products/design-software/embedded-software-developers/opencl/overview.html

alter的安裝指南，《Altera SDK for OpenCL
Getting Started Guide》

理論上看上面兩個就夠了，你需要做的事情包括：
下載opencl SDK，或者quatuars II軟件(含SDK)，下載相應開發板的支持（altera上面有一些，但是其他的可能就需要你從相應的供應商那邊找了）；還需要opencl的license，不然是不能編譯的。

中文的一些經驗貼可以看：
《Altera OpenCL入門(beta版)》http://wenku.baidu.com/link?url=bkIyo01jXeWfdGsrA_M0J1zomx6f0lYk0NPf-9-MNaC0OkWRmukDwY5yFz0I3Wrctqi5qD3jC8BhQQzjoqw1HXpUgIM68_blz5Cr3vxpaZC

【Altera SoC體驗之旅】+ 正式開啟OpenCL模式
http://home.eeworld.com.cn/my/space-uid-169743-blogid-247647.html

OpenCL編程簡介

下面的圖簡單說明了OpenCL的編程框架，圖是用的GPU，其他類似；

從圖中可以看出（參考《OpenCL 編程入門》）：
1. 異構計算設備，可以是CPU或GPU。現在也有支持OpenCL的FPGA設備和至強融核協處理設備（MIC）。
2. OpenCL的API通過Context（環境上下文）聯系在一起。
3. 運行設備端的程序，經過了編譯->設置參數->運行等步驟。

名詞的概念：
Platform (平臺)：主機加上OpenCL框架管理下的若干設備構成了這個平臺，通過這個平臺，應用程序可以與設備共享資源并在設備上執行kernel。實際使用中基本上一個廠商對應一個Platform，比如Intel, AMD都是這樣。

Device（設備）：官方的解釋是計算單元（Compute Units）的集合。舉例來說，GPU是典型的device。Intel和AMD的多核CPU也提供OpenCL接口，所以也可以作為Device。

Context（上下文）：OpenCL的Platform上共享和使用資源的環境，包括kernel、device、memory objects、command queue等。使用中一般一個Platform對應一個Context。

Program：OpenCL程序，由kernel函數、其他函數和聲明等組成。
Kernel（核函數）：可以從主機端調用，運行在設備端的函數。

Memory Object（內存對象）：在主機和設備之間傳遞數據的對象，一般映射到OpenCL程序中的global memory。有兩種具體的類型：Buffer Object（緩存對象）和Image Object（圖像對象）。

Command Queue（指令隊列）：在指定設備上管理多個指令（Command）。隊列里指令執行可以順序也可以亂序。一個設備可以對應多個指令隊列。

NDRange：主機端運行設備端kernel函數的主要接口。實際上還有其他的，NDRange是非常常見的，用于分組運算，以后具體用到的時候就知道區別了。

Host端來看，OpenCL的組要執行流程是這樣的：

其實基本上大部分簡單的程序HOST部分都是差不多的，不用改很多，具體下面看一個例子就知道了。

第一個程序
這里貼一個altera官方的vector add的實例code，基本就是helloworld級別了，不過它的host寫的很通用（考慮到對多個device統一編程），可以過一遍看看是不是和上面的圖對的上。其實看過這個基本其他的也就差不多了。

Host部分：（Kernel在最后）

// Copyright (C) 2013-2014 Altera Corporation, San Jose, California, USA. All rights reserved. 
// Permission is hereby granted, free of charge, to any person obtaining a copy of this 
// software and associated documentation files (the "Software"), to deal in the Software 
// without restriction, including without limitation the rights to use, copy, modify, merge, 
// publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to 
// whom the Software is furnished to do so, subject to the following conditions: 
// The above copyright notice and this permission notice shall be included in all copies or 
// substantial portions of the Software. 
//  
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 
// OTHER DEALINGS IN THE SOFTWARE. 
//  
// This agreement shall be governed in all respects by the laws of the State of California and 
// by the laws of the United States of America. 

///////////////////////////////////////////////////////////////////////////////////
// This host program executes a vector addition kernel to perform:
//  C = A + B
// where A, B and C are vectors with N elements.
//
// This host program supports partitioning the problem across multiple OpenCL
// devices if available. If there are M available devices, the problem is
// divided so that each device operates on N/M points. The host program
// assumes that all devices are of the same type (that is, the same binary can
// be used), but the code can be generalized to support different device types
// easily.
//
// Verification is performed against the same computation on the host CPU.
///////////////////////////////////////////////////////////////////////////////////

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "CL/opencl.h"
#include "AOCL_Utils.h"

using namespace aocl_utils;

// OpenCL runtime configuration
cl_platform_id platform = NULL;
unsigned num_devices = 0;
scoped_array<cl_device_id> device; // num_devices elements
cl_context context = NULL;
scoped_array<cl_command_queue> queue; // num_devices elements
cl_program program = NULL;
scoped_array<cl_kernel> kernel; // num_devices elements
scoped_array<cl_mem> input_a_buf; // num_devices elements
scoped_array<cl_mem> input_b_buf; // num_devices elements
scoped_array<cl_mem> output_buf; // num_devices elements

// Problem data.
const unsigned N = 1000000; // problem size
scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements
scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements
scoped_array<scoped_array<float> > ref_output; // num_devices elements
scoped_array<unsigned> n_per_device; // num_devices elements

// Function prototypes
float rand_float();
bool init_opencl();
void init_problem();
void run();
void cleanup();

// Entry point.
int main() {
  // Initialize OpenCL.
  if(!init_opencl()) {
    return -1;
  }

  // Initialize the problem data.
  // Requires the number of devices to be known.
  init_problem();

  // Run the kernel.
  run();

  // Free the resources allocated
  cleanup();

  return 0;
}

/////// HELPER FUNCTIONS ///////

// Randomly generate a floating-point number between -10 and 10.
float rand_float() {
  return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
}

// Initializes the OpenCL objects.
bool init_opencl() {
  cl_int status;

  printf("Initializing OpenCL
");

  if(!setCwdToExeDir()) {
    return false;
  }

  // Get the OpenCL platform.
  platform = findPlatform("Altera");
  if(platform == NULL) {
    printf("ERROR: Unable to find Altera OpenCL platform.
");
    return false;
  }

  // Query the available OpenCL device.
  device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
  printf("Platform: %s
", getPlatformName(platform).c_str());
  printf("Using %d device(s)
", num_devices);
  for(unsigned i = 0; i < num_devices; ++i) {
    printf("  %s
", getDeviceName(device[i]).c_str());
  }

  // Create the context.
  context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
  checkError(status, "Failed to create context");

  // Create the program for all device. Use the first device as the
  // representative device (assuming all device are of the same type).
  std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);
  printf("Using AOCX: %s
", binary_file.c_str());
  program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);

  // Build the program that was just created.
  status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  checkError(status, "Failed to build program");

  // Create per-device objects.
  queue.reset(num_devices);
  kernel.reset(num_devices);
  n_per_device.reset(num_devices);
  input_a_buf.reset(num_devices);
  input_b_buf.reset(num_devices);
  output_buf.reset(num_devices);

  for(unsigned i = 0; i < num_devices; ++i) {
    // Command queue.
    queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);
    checkError(status, "Failed to create command queue");

    // Kernel.
    const char *kernel_name = "vectorAdd";
    kernel[i] = clCreateKernel(program, kernel_name, &status);
    checkError(status, "Failed to create kernel");

    // Determine the number of elements processed by this device.
    n_per_device[i] = N / num_devices; // number of elements handled by this device

    // Spread out the remainder of the elements over the first
    // N % num_devices.
    if(i < (N % num_devices)) {
      n_per_device[i]++;
    }

    // Input buffers.
    input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY, 
        n_per_device[i] * sizeof(float), NULL, &status);
    checkError(status, "Failed to create buffer for input A");

    input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY, 
        n_per_device[i] * sizeof(float), NULL, &status);
    checkError(status, "Failed to create buffer for input B");

    // Output buffer.
    output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 
        n_per_device[i] * sizeof(float), NULL, &status);
    checkError(status, "Failed to create buffer for output");
  }

  return true;
}

// Initialize the data for the problem. Requires num_devices to be known.
void init_problem() {
  if(num_devices == 0) {
    checkError(-1, "No devices");
  }

  input_a.reset(num_devices);
  input_b.reset(num_devices);
  output.reset(num_devices);
  ref_output.reset(num_devices);

  // Generate input vectors A and B and the reference output consisting
  // of a total of N elements.
  // We create separate arrays for each device so that each device has an
  // aligned buffer. 
  for(unsigned i = 0; i < num_devices; ++i) {
    input_a[i].reset(n_per_device[i]);
    input_b[i].reset(n_per_device[i]);
    output[i].reset(n_per_device[i]);
    ref_output[i].reset(n_per_device[i]);

    for(unsigned j = 0; j < n_per_device[i]; ++j) {
      input_a[i][j] = rand_float();
      input_b[i][j] = rand_float();
      ref_output[i][j] = input_a[i][j] + input_b[i][j];
    }
  }
}

void run() {
  cl_int status;

  const double start_time = getCurrentTimestamp();

  // Launch the problem for each device.
  scoped_array<cl_event> kernel_event(num_devices);
  scoped_array<cl_event> finish_event(num_devices);

  for(unsigned i = 0; i < num_devices; ++i) {

    // Transfer inputs to each device. Each of the host buffers supplied to
    // clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
    // for the host-to-device transfer.
    cl_event write_event[2];
    status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,
        0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);
    checkError(status, "Failed to transfer input A");

    status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,
        0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);
    checkError(status, "Failed to transfer input B");

    // Set kernel arguments.
    unsigned argi = 0;

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);
    checkError(status, "Failed to set argument %d", argi - 1);

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);
    checkError(status, "Failed to set argument %d", argi - 1);

    status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);
    checkError(status, "Failed to set argument %d", argi - 1);

    // Enqueue kernel.
    // Use a global work size corresponding to the number of elements to add
    // for this device.
    // 
    // We don't specify a local work size and let the runtime choose
    // (it'll choose to use one work-group with the same size as the global
    // work-size).
    //
    // Events are used to ensure that the kernel is not launched until
    // the writes to the input buffers have completed.
    const size_t global_work_size = n_per_device[i];
    printf("Launching for device %d (%d elements)
", i, global_work_size);

    status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,
        &global_work_size, NULL, 2, write_event, &kernel_event[i]);
    checkError(status, "Failed to launch kernel");

    // Read the result. This the final operation.
    status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,
        0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);

    // Release local events.
    clReleaseEvent(write_event[0]);
    clReleaseEvent(write_event[1]);
  }

  // Wait for all devices to finish.
  clWaitForEvents(num_devices, finish_event);

  const double end_time = getCurrentTimestamp();

  // Wall-clock time taken.
  printf("
Time: %0.3f ms
", (end_time - start_time) * 1e3);

  // Get kernel times using the OpenCL event profiling API.
  for(unsigned i = 0; i < num_devices; ++i) {
    cl_ulong time_ns = getStartEndTime(kernel_event[i]);
    printf("Kernel time (device %d): %0.3f ms
", i, double(time_ns) * 1e-6);
  }

  // Release all events.
  for(unsigned i = 0; i < num_devices; ++i) {
    clReleaseEvent(kernel_event[i]);
    clReleaseEvent(finish_event[i]);
  }

  // Verify results.
  bool pass = true;
  for(unsigned i = 0; i < num_devices && pass; ++i) {
    for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {
      if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {
        printf("Failed verification @ device %d, index %d
Output: %f
Reference: %f
",
            i, j, output[i][j], ref_output[i][j]);
        pass = false;
      }
    }
  }

  printf("
Verification: %s
", pass ? "PASS" : "FAIL");
}

// Free the resources allocated during initialization
void cleanup() {
  for(unsigned i = 0; i < num_devices; ++i) {
    if(kernel && kernel[i]) {
      clReleaseKernel(kernel[i]);
    }
    if(queue && queue[i]) {
      clReleaseCommandQueue(queue[i]);
    }
    if(input_a_buf && input_a_buf[i]) {
      clReleaseMemObject(input_a_buf[i]);
    }
    if(input_b_buf && input_b_buf[i]) {
      clReleaseMemObject(input_b_buf[i]);
    }
    if(output_buf && output_buf[i]) {
      clReleaseMemObject(output_buf[i]);
    }
  }

  if(program) {
    clReleaseProgram(program);
  }
  if(context) {
    clReleaseContext(context);
  }
}

Kernel部分：

// ACL kernel for adding two input vectors
__kernel void vectorAdd(__global const float *x, 
                        __global const float *y, 
                        __global float *restrict z)
{
    // get index of the work item
    int index = get_global_id(0);

    // add the vector elements
    z[index] = x[index] + y[index];
}

kernel部分代碼就這幾行，__global是一個限定符，表示用外部存儲（比如DDR）來存儲，其他語法和標準C語言是一樣的，就不多說了。
代碼中最重要的就是get_global_id，這個是在多work-item工作模式下的常用手段，通過id確定work-item然后進行操作，所有的item都是一樣的，因此就add的函數里面就沒有習慣的for()的寫法了。可以對kernel的設置進行定制，包括compute unit，SIMD模式等，這樣來控制程序的并行性，更大的并行往往性能高，但是更耗資源。

具體的Kernel函數的內容可以參考OpenCL的《The OpenCL Specification 1.0》以及altera的opencl編程指南，后面的筆記我會具體寫一下。

內存模型
最后寫一下Opencl的內存模型，看下面的示意圖：

用核函數中的內存變量來簡單地解釋：用clCreateBuffer 創建、用clSetKernelArg 傳遞的數據在global memory 和constant memory中；核函數中的寄存器變量在private memory 中；核函數的內部變量、緩存等，在local memory 中。圖例中可以看到Device 并不直接訪問global memory，而是通過Cache 來訪問。可以想象當同時運行的work-item，使用的內存都在同一塊cache 中，則內存吞吐的效率最高。對應到work group 中，就是在程序設計上盡量使同一個work group 中的work item 操作連續的內存，以提高訪存效率。

本篇就到這里。

總結

以上是生活随笔為你收集整理的OpenCL学习笔记（三）：OpenCL安装，编程简介与helloworld的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。