使用 SYCL 进行设备发现

Henry A Gabb

0/5 (0投票)

2023年5月9日

CPOL

7分钟阅读

5672

了解系统中的加速器

设备发现是 SYCL 或任何跨架构、异构并行编程方法的重要方面。我之前关于 oneAPI 的文章主要关注使用 SYCL 和 oneMKL、oneDPL 库将计算卸载到加速器设备；换句话说，就是控制代码的执行位置。本文重点介绍设备发现，因为为异构系统编写可移植的代码需要能够查询系统有关可用硬件的信息。例如，如果我们硬编码 SYCL 设备选择器以使用 GPU，但系统中没有 GPU，则以下语句将失败

...
    sycl::queue Q(sycl::gpu_selector_v);
...

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Aborted

该代码无法移植到没有 GPU 的系统。使用默认选择器而不是 GPU 选择器实例化 SYCL 队列可以保证正常工作，但我们失去了对队列提交工作位置的控制。SYCL 运行时会选择设备，例如：

...
    sycl::queue Q(sycl::default_selector_v);

    std::cout << "Running on: "
              << Q.get_device().get_info<sycl::info::device::name>()
              << std::endl;
...

Running on: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

为了编写更健壮的异构并行程序，让我们更仔细地研究 SYCL 设备发现，以回答以下问题：

有哪些可用的加速器设备？
SYCL 队列正在使用哪个设备？
哪个 oneDPL 执行策略正在使用哪个设备？

健壮的设备发现

让我们在 Intel® DevCloud for oneAPI 上运行一些示例，因为它拥有各种 Intel 硬件选项，并且已经安装了最新的 Intel® oneAPI 工具包。硬件会定期更新，但在撰写本文时（2022 年 12 月 2 日）可用的计算节点如下：

     $ pbsnodes | grep properties | sort | uniq -c | sort -nr
     79      properties = xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch
     78      properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9
     26      properties = xeon,skl,gold6128,ram192gb,net1gbe,jupyter,batch,fpga_compile
     25      properties = core,tgl,i9-11900kb,ram32gb,netgbe,gpu,gen11
     12      properties = xeon,skl,ram384gb,net1gbe,renderkit
     12      properties = xeon,skl,gold6128,ram192gb,net1gbe,fpga_runtime,fpga,arria10
      6      properties = xeon,icx,gold6348,ramgb,netgbe,jupyter,batch
      4      properties = xeon,icx,plat8380,ram2tb,net1gbe,batch
      4      properties = xeon,clx,ram192gb,net1gbe,batch,extended,fpga,stratix10,fpga_runtime

如您所见，我们有充足的 CPU、GPU 和 FPGA 选项。（与 Intel 有保密协议的用户可以访问 DevCloud 的 NDA 分区中的预发布硬件。）让我们请求一个节点，看看有哪些可用设备。

$ qsub -I -l nodes=1:gen11:ppn=2

此命令请求单节点上的交互式访问，该节点具有 Intel® Processor Graphics Gen11。除了您已经看到的两个之外，SYCL 还提供了几个内置选择器：default_selector_v、gpu_selector_v、cpu_selector_v 和 accelerator_selector_v。请注意，Intel 还为 FPGA 开发提供了 fpga_selector 和 fpga_emulator_selector 扩展。它们位于 sycl/ext/intel/fpga_device_selector.hpp 头文件中。有关在 FPGA 上使用 SYCL 的更多信息，请参阅 Intel oneAPI 编程指南中关于 FPGA 流的章节。

内置选择器主要用于方便，但它们可以与异常处理结合使用，从而更加健壮，例如：

    sycl::device d;
    try {
        d = sycl::device(sycl::gpu_selector_v);
    }
    catch (sycl::exception const &e) {
        d = sycl::device(sycl::cpu_selector_v);
    }

但是，SYCL 运行时仍在选择设备。我们可能希望获得更多控制，尤其是当有多个设备可用时。

下面的程序列出了我们计算节点中可用的平台和设备（请注意，您也可以使用 sycl-ls 命令行实用程序获取此信息）。

#include <sycl/sycl.hpp>

int main()
{
    for (auto platform : sycl::platform::get_platforms())
    {
        std::cout << "Platform: "
                  << platform.get_info<sycl::info::platform::name>()
                  << std::endl;

        for (auto device : platform.get_devices())
        {
            std::cout << "\tDevice: "
                      << device.get_info<sycl::info::device::name>()
                      << std::endl;
        }
    }
}

$ icpx -fsycl show_platforms.cpp -o show_platforms
$ ./show_platforms
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
        Device: Intel(R) FPGA Emulation Device
Platform: Intel(R) OpenCL
        Device: 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz
Platform: Intel(R) OpenCL HD Graphics
        Device: Intel(R) UHD Graphics [0x9a60]
Platform: Intel(R) Level-Zero
        Device: Intel(R) UHD Graphics [0x9a60]

SYCL 平台基于 OpenCL 平台模型，在该模型中，主机连接到加速器设备。这在上面的示例输出中很明显。此系统有一个 OpenCL 平台和一个 oneAPI Level Zero 平台。每个平台都有一个 SYCL 程序可以提交工作的设备。我们有两个 GPU 平台，具体取决于我们是想使用 OpenCL 还是 oneAPI Level Zero 后端。我们还有 CPU 和 FPGA 仿真平台。此信息允许我们创建队列以将工作提交到这些设备中的任何一个，例如：

#include <sycl/sycl.hpp>

int main()
{
    auto platforms = sycl::platform::get_platforms();

    sycl::queue Q1(platforms[1].get_devices()[0]);
    sycl::queue Q2(platforms[3].get_devices()[0]);

    std::cout << "Q1 mapped to "
              << Q1.get_device().get_info<sycl::info::device::name>()
              << std::endl;

    std::cout << "Q2 mapped to "
              << Q2.get_device().get_info<sycl::info::device::name>()
              << std::endl;
}

$ ipcx -fsycl map_queues.cpp -o map_queues
$ ./map_queues
Q1 mapped to 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz
Q2 mapped to Intel(R) UHD Graphics [0x9a60]

请注意，上面的示例中的设备是硬编码的，因此如果您在自己的系统上尝试此程序，请记住更新平台索引。

查询 SYCL 队列和设备

队列创建在之前的示例代码中可见，但这可能不总是如此。例如，SYCL 队列通常会传递给 oneAPI 库函数。因此，可能需要查询队列以获取信息。队列映射到的目标设备是什么？设备的后端 API 是什么？它是按顺序队列（即，内核必须按照提交的顺序执行）吗？目标设备的向量宽度或最大工作项维度是多少？

库开发人员可以使用此信息来选择最佳代码路径。因此，SYCL queue 类提供了几个成员函数来查询信息：get_backend()、get_context()、get_device()、is_in_order() 等。同样，SYCL device 类提供了成员函数来查询设备特性 [例如，is_cpu()、is_gpu() 和 get_info() 函数]。特别是 get_info() 函数可用于收集有关目标设备的详细信息：供应商、向量宽度、最大工作项和图像维度、内存特性等。SYCL 2020 规范包含可查询的设备信息描述符和设备方面的完整列表。

使用此类信息来优化代码超出了本文的范围，但它将是未来文章的主题。

自定义选择器

到目前为止，我们看到的每个设备选择器都是 SYCL 实现提供的内置选择器。在后台，每个设备选择器都实现为一个 C++ 可调用对象，该对象接受一个设备并返回一个分数。SYCL 实现调用设备选择器为系统中的每个可用设备评分，最终选择得分最高的设备。

我们可以通过编写相同形式的可调用对象来编写自己的自定义设备选择器。作为一个简单的第一个示例，我们可以编写一个与内置 CPU 选择器具有相同行为的设备选择器，方法是为所有 CPU 设备分配正分数，为所有其他设备分配负分数。

...
    auto my_cpu_selector = [](const sycl::device& d)
    {
        if (d.is_cpu())
        {
            return 1;
        }
        else
        {
            return -1;
        }
    };
    sycl::queue Q(my_cpu_selector);
...

Running on: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

因为设备选择器只是一个函数，所以我们可以自由地使用设备的任何属性（例如，方面或设备信息描述符）与程序中的其他变量（例如，命令行参数）结合使用来为设备评分。这非常强大，使我们能够完全控制设备如何评分和选择，并允许我们确保所选设备符合我们应用程序的要求。下面的示例展示了一种可能的实现方式，其中设备选择器会忽略不支持双精度浮点算术的设备，并通过一个布尔变量来优先选择 GPU。

...
    bool prefer_gpus = true;  // e.g., from command-line or configuration file
    auto my_selector = [=](const sycl::device& d)
    {
        // Ignore devices that do not support double-precision
        if (not d.has(sycl::aspect::fp64))
        {
            return -1;
        }

        // Optionally prefer GPUs
        if (prefer_gpus and d.is_gpu())
        {
            return 1;
        }
        else
        {
            return 0;
        }
    };
    sycl::queue Q(my_selector);
...

SYCL 最近添加了 aspect_selector 函数，以帮助选择满足程序员要求的设备。例如，以下语句选择支持半精度的 GPU 设备，同时排除仿真、固定功能的设备。

  auto dev = sycl::device{sycl::aspect_selector(
    std::vector{sycl::aspect::fp16, sycl::aspect::gpu},        // allowed aspects
    std::vector{sycl::aspect::custom, sycl::aspect::emulated}  // disallowed aspects
  )};

在撰写本文时，aspect_selector 尚未得到 Intel® oneAPI DPC++/C++ Compiler 的支持，但应该很快可用。

更改 oneDPL 执行策略

文章 The Maxloc Reduction in oneAPI（The Parallel Universe，第 48 期）展示了 oneDPL 如何使用执行策略将函数卸载到加速器。代码示例仅使用了 oneapi::dpl::execution::dpcpp_default 策略，因此让我们看看如何使用 SYCL 队列修改执行策略以显式控制 oneDPL 函数的运行位置。

#include <oneapi/dpl/execution>

int main()
{
    sycl::queue Q1(sycl::gpu_selector_v);
    auto gpu_policy = oneapi::dpl::execution::make_device_policy(Q1);

    std::cout << "GPU execution policy runs oneDPL functions on "
              << gpu_policy.queue().get_device().
                                    get_info<sycl::info::device::name>()
              << std::endl;

    sycl::queue Q2(sycl::cpu_selector_v);
    auto cpu_policy = oneapi::dpl::execution::make_device_policy(Q2);

    std::cout << "CPU execution policy runs oneDPL functions on "
              << cpu_policy.queue().get_device().
                                    get_info<sycl::info::device::name>()
              << std::endl;
}

$ icpx -fsycl onedpl_policy_example.cpp -o onedpl_example
$ ./onedpl_example
GPU policy runs oneDPL functions on Intel(R) UHD Graphics [0x9a60]
CPU policy runs oneDPL functions on 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz

前面的程序使用内置的 CPU 和 GPU 选择器创建队列，然后使用这些队列设置 oneDPL 执行策略。我们也可以同样轻松地查询平台和设备，并实例化队列，如前所示。

...
    auto platforms = sycl::platform::get_platforms();

    sycl::queue Q1(platforms[3].get_devices()[0]);
    auto gpu_policy = oneapi::dpl::execution::make_device_policy(Q1);

    sycl::queue Q2(platforms[1].get_devices()[0]);
    auto cpu_policy = oneapi::dpl::execution::make_device_policy(Q2);
...

再次强调，前面的示例中的设备是硬编码的，因此请记住更新您系统的平台索引。

我们仅仅触及了 SYCL 为设备发现提供的功能以及程序如何使用平台和设备信息的表面。预计在《The Parallel Universe》的未来几期中会看到更多关于此的内容，尤其是在多设备系统变得越来越普遍，并且程序员开始针对特定设备进行算法设计时。

额外资源

Intel Developer Cloud – 这是一个免费的一站式服务，用于在各种 Intel 硬件上试验 oneAPI。
SYCL™ 2020 参考指南 – 除了 SYCL™ 2020 规范本身之外，此参考指南是获取易于理解信息的首选来源。本文中设备选择、平台、上下文和设备类描述占有重要地位。
《Data Parallel C++: Mastering DPC++ for Programming Heterogeneous Systems Using C++ and SYCL》中的第 12 章：设备信息很好地概述了设备发现，尽管某些代码示例的语法已过时。
oneAPI-samples 存储库包含数百个代码示例，说明了使用 SYCL 和 oneAPI 库进行编程。
Intel® oneAPI 编程指南提供了 Intel oneAPI 工具的基本概述。
oneAPI GPU 优化指南是一个关于使用 oneAPI 以获得最佳性能的编码建议的动态文档。