通过 OpenCL™ 实现异构计算

Intel

0/5 (0投票)

2018 年 12 月 6 日

CPOL

6568

本文将分步介绍将工作负载分发到平台上的所有 OpenCL 设备，并使用相同的内核共同完成计算任务的方法。

立即免费获取 Intel® SDK for OpenCL™ Applications!

1. 摘要

OpenCL™ 是跨 CPU、GPU 和 FPGA 等多个计算设备进行编程的开放标准，是实现异构计算的理想编程语言。本文将分步介绍将工作负载分发到平台上的所有 OpenCL 设备，并使用相同的内核共同完成计算任务的方法。尽管本文仅关注 Intel 处理器、Intel® HD Graphics、Iris™ graphics 和 Iris™ Pro graphics，但理论上，它适用于所有符合 OpenCL 标准的计算设备。假定读者对 OpenCL 编程有基本了解。本文不讨论 OpenCL 框架、平台模型、执行模型和内存模型 [1]。

2. 异构计算实现的概念

在 OpenCL 平台中，主机包含一个或多个计算设备。每个设备有一个或多个计算单元，每个计算单元有一个或多个可以执行内核代码的处理元素（图 1）。

图 1：OpenCL™ 平台模型

[2].

从软件实现的角度来看，通常从查询平台开始 OpenCL 程序。然后可以检索设备列表，程序员可以从这些设备中选择设备。下一步是创建一个上下文。选定的设备与上下文关联，并为该设备创建命令队列。

由于一个上下文可以与多个设备关联，因此想法是将 CPU 和 GPU 都关联到上下文，并为每个目标设备创建命令队列（图 2）。

图 2：从编程角度看多设备拓扑。

工作负载被加入到上下文（以缓冲区或图像对象的形式）。因此，所有与该上下文关联的设备都可以访问它。主机程序可以将不同数量的工作负载分发给这些设备。

假设 XX% 的工作负载分配给 CPU，YY% 的工作负载分配给 GPU，只要 XX% + YY% = 100%（图 3），XX% 和 YY% 的值就可以任意选择。

图 3：示例实现的负载分发。

3. 结果

在一个示例性的 Lattice-Boltzman Method (LBM) OpenCL 异构计算实现中，工作负载为 100x100x130 个浮点数，使用不同的 XX%（分配给 CPU 的工作负载百分比）和 YY%（分配给 GPU 的工作负载百分比）组合进行了归一化性能统计，如图 4 所示。性能是在第五代 Intel® Core™ i7 处理器和 Iris™ Pro graphics 上评估的。请注意，尽管组合 (XX, YY) = (50, 50) 获得了最大的性能提升（约 30%），但这并非普遍情况。不同的内核可能更适合 CPU 或 GPU。最佳 (XX, YY) 组合必须逐案评估。

图 4：归一化的 (XX, YY) 组合性能统计。

4. 实现细节

为了更具说明性，以下讨论假定工作负载是一个 100x100x130 的浮点数三维数组，OpenCL 设备是 Intel 处理器和 Intel HD Graphics（或 Iris graphics 或 Iris Pro graphics）。由于实现只涉及主机端程序，因此不讨论 OpenCL 内核实现和优化。本节的伪代码忽略了错误检查。鼓励读者在改编时自行添加错误检查代码。

4.1 工作负载

工作负载假定为一个 100 × 100 x 130 的浮点数三维 (3D) 数组，声明形式如下：

const int iGridSize = 100 * 100 * 130;

float srcGrid [iGridSize], dstGrid [iGridSize];   // srcGrid and dstGrid represent the source and 
 					        //the destination of the workload respectively

尽管工作负载假定为 3D 浮点数组，但内存声明为一个一维数组，以便数据可以轻松地装入 cl_mem 对象，这使得数据操作更加容易。

4.2 表示 OpenCL 平台的结构体

为了在程序中实现图 2 中的概念，OpenCL 数据结构必须至少包含 cl_platform、cl_context 和 cl_program 对象。为了能传递给 OpenCL API 调用，cl_device_id、cl_command_queue 和 cl_kernel 对象以指针形式声明。可以根据使用的计算设备数量通过动态内存分配来实例化它们。

typedef struct {
	cl_platform_id clPlatform;		// OpenCL platform ID
	cl_context clContext;			// OpenCL Context
	cl_program clProgram;			// OpenCL kernel program source object
 	cl_int clNumDevices;			// The number of OpenCL devices to use
	cl_device_id* clDevices;			// OpenCL device IDs
 	cl_device_type* clDeviceTypes;		// OpenCL device types info. CPU, GPU, or 
						// ACCELERATOR
	cl_command_queue* clCommandQueues;	// Command queues for the OpenCL
 							// devices
	cl_kernel* clKernels;			// OpenCL kernel objects
} OpenCL_Param;

OpenCL_Param prm;

4.3 构建 OpenCL 设备

本文讨论的实现考虑了单台机器有两个设备（CPU 和 GPU）的情况，以便读者能够轻松理解该方法。

4.3.1 检测 OpenCL 设备

检测设备是 OpenCL 编程的第一步。可以通过以下代码片段检索设备。

clGetPlatformIDs ( 1, &(prm.clPlatform), NULL );
 	// Get the OpenCL platform ID and store it in prm.clPlatform.

clGetDeviceIDs ( prm.clPlatform, CL_DEVICE_TYPE_ALL, 0, NULL, &(prm.clNumDevices) );
prm.clDevices = (cl_device_id*)malloc ( sizeof(cl_device_id) * prm.clNumDevices );
 	// Query how many OpenCL devices are available in the platform; the number of 
 	// device is stored in prm.clNumDevices. Proper amount of memory is then 
 	// allocated for prm.clDevices according to prm.clNumDevices.

clGetDeviceIDs (prm.clPlatform, CL_DEVICE_TYPE_ALL, prm.clNumDevices, prm.clDevices,  \
 		NULL);
 	// Query the OpenCL device IDs and store it in prm.clDevices.

在异构计算使用中，了解哪个设备是哪个设备以便将正确数量的工作负载分配给指定的计算设备非常重要。可以使用 ClGetDeviceInfo() 查询设备类型信息。

cl_device_type DeviceType;

prm.clDeviceTypes = (cl_device_type*) malloc ( sizeof(cl_device_type) *  \
 						prm.clNumDevices );
 	// Allocate proper amount of memory for prm.clDeviceTypes.

for (int i = 0; i < prm.clNumDevices; i++) {

	clGetDeviceInfo ( prm.clDevices[i], CL_DEVICE_TYPE, \
 			sizeof(cl_device_type), &DeviceType, NULL );
  			// Query the device type of each OpenCL device and store it in
 			// prm.clDeviceType one by one.
	prm.clDeviceTypes[i] = DeviceType;
}

4.3.2 准备 OpenCL 上下文

一旦找到 OpenCL 设备，下一步就是准备 OpenCL 上下文，它为这些设备提供便利。这是一个直接的步骤，与创建上下文的任何其他 OpenCL 编程相同。

cl_context_properties clCPs[3] = { CL_CONTEXT_PLATFORM, prm.clPlatform, 0 };

prm.clContext = clCreateContext ( clCPs, 2, prm.clDevices, NULL, NULL, NULL );

4.3.3 创建命令队列

命令队列是将内核、内核参数和工作负载加载到 OpenCL 设备的通道。为一个 OpenCL 设备创建一个命令队列；在本例中，为 CPU 和 GPU 分别创建了两个命令队列。

prm.clCommandQueues = (cl_command_queue*)malloc ( prm.clNumDevices *  \
 							sizeof(cl_command_queue) );
 	// Allocate proper amount of memory for prm.clCommandQueues.

for (int i = 0; i < prm.clNumDevices; i++) {

prm.clCommandQueues[i] = clCreateCommandQueue ( prm.clContext, \
 				prm.clDevices[i], CL_QUEUE_PROFILING_ENABLE, NULL);
 		// Create command queue for each of the OpenCL device
}

4.4 编译 OpenCL 内核

到目前为止，图 2 中所示的拓扑已经实现。然后应该加载内核源代码文件，并为 OpenCL 设备进行构建以供执行。请注意，平台上存在两个 OpenCL 设备。必须将两个设备 ID 传递给 clBuildProgram() 调用，以便编译器可以为每个设备构建适当的二进制代码。以下源代码片段假定内核源代码已通过文件 I/O 调用加载到缓冲区 clSource 中，此处未详细介绍。

char* clSource;

// Insert kernel source file read code here. Following code assumes clSource buffer is 
// properly allocated and loaded with the kernel source.

prm.clProgram = clCreateProgramWithSource (prm.clContext, 1, clSource, NULL, NULL );
clBuildProgram (prm.clProgram, 2, prm.clDevices, NULL, NULL, NULL );
 	// Build the program executable for CPU and GPU via feeding clBuildProgram() with
 	// "2", which illustrates there are 2 target devices and the device ID list.

prm.clKernels = (cl_kernel*)malloc ( prm.clNumDevices * sizeof(cl_kernel) );
for (int i = 0; i < prm.clNumDevices; i++) {
 	prm.clKernels[i] = clCreateKernel (prm.clProgram, "<the kernel name>", NULL );
}

4.5 分发工作负载

内核构建完成后，就可以将工作负载分发给设备了。以下代码片段演示了如何将指定的工作负载分发给每个 OpenCL 设备。请注意，此处未演示设置 OpenCL 内核参数 clSetKernelArg() 的调用。不同的内核实现需要不同的参数。示例中的设置内核参数的代码在此处意义不大。

// Put kernel argument setting code, clSetKernelArg(), here. Note that, the same argument 
// must be set to the both kernel objects.

size_t dimBlock[3] = { 100, 1, 1 };		// Work-group dimension and size
size_t dimGrid[2][3] = { {100, 100, 130}, {100, 100, 130} };	// Work-item dimension 
 						// and size for each OpenCL device
dimGrid[0][0] = dimGrid[1][0] = (int)ceil ( dimGrid[0][0] / (double)dimBlock[0] ) *  \
 					dimBlock[0];
dimGrid[0][1] = dimGrid[1][1] = (int)ceil ( dimGrid[0][1] / (double)dimBlock[1] ) *  \
 					dimBlock[1];
 	// Make sure the work-item size is a factor of work-group size in each dimension
dimGrid[0][2] = (int)ceil ( round(dimGrid[0][2]* (double)<XX> /100.0) / (double)dimBlock[2] ) 
 			* dimBlock[2];				// Work-items for CPU
dimGrid[1][2] = (int)ceil ( round(dimGrid[1][2] * (double)<YY> /100.0) / 
 			(double)dimBlock[2] ) * dimBlock[2];	// Work-items for GPU
 	// Assume <XX>% of workload for CPU and <YY>% of workload to GPU

Size_t dimOffset[3] = { 0, 0, dimGrid[0][2] };	// The offset of the whole workload. It is 
						// the GPU workload starting point

for (int i = 0; i < 2; i++) {

	If ( CL_DEVICE_TYPE_CPU == prm.clDeviceTypes[i] ) 
		clEnqueueNDRangeKernel ( prm.clCommandQueues[i], prm.clKernels[i], \
					3, NULL, dimGrid[0], dimBlock, 0, NULL, NULL );
	else					// The other device is CL_DEVICE_TYPE_GPU
		clEnqueueNDRangeKernel ( prm.clCommandQueues[i], prm.clKernels[i], \
					3, dimOffset, dimGrid[1], dimBlock, 0, NULL, NULL );
		// Offload proper portion of workload to CPU and GPU respectively
}

5. 参考

[1] OpenCL 2.1 规范。 https://www.khronos.org/registry/cl/

[2] 图片由 Khronos 集团提供。