高性能查询：GPU vs PLINQ vs LINQ

Nick Kopp

4.94/5 (102投票s)

2011 年 11 月 24 日

LGPL3

10分钟阅读

151423

5083

了解如何使用图形处理单元 (GPU) 而非 LINQ 和 PLINQ，将查询性能提升 30 倍。

Screenshot

引言

我们越来越多地听到图形处理单元在非图形方面的强大能力。NVIDIA 的 CUDA 使超级计算能力触手可及。虽然一款高端 Intel i7 可能提供 80GFlops（每秒 800 亿次浮点运算），但一款相对便宜的显卡可以达到 1TFlop（每秒万亿次运算）。然而，我们很快就会发现，能够利用 GPU 的应用程序通常相当晦涩——天气模拟、细胞繁殖、视频和图像渲染、金融预测、物理学等。大多数程序员在关注性能时，更可能从事更平凡的工作，例如在数据库中进行检索、解析和生成 XML 以及匹配记录。GPU 能否在这些任务中发挥作用？

背景

.NET 中引入 LINQ（语言集成查询）简化了与 XML、数据库和对象搜索相关的常见任务。典型的应用程序可能涉及一个记录列表，例如网站访问记录。该列表包含数百万条记录，详细说明了访问者的 IP 地址、浏览器类型、访问日期和时间、访问时长、访问页面等。我们可能需要运行一个查询，以返回特定国家/地区在特定日期的访问记录。LINQ 允许我们编写一个简单的查询，该查询适用于各种数据源——例如，列表可以是 XML、数据库或对象。我们获得结果的速度取决于多种因素，尤其是在处理数据库和 XML 文件时。如果我们仅限于对象，即我们已经在内存中以列表形式拥有源数据，那么我们可以更准确地衡量不同技术之间的性能差异。

LINQ 可以很好地处理日期和短字符串等简单比较。PLINQ——LINQ 的并行扩展——可以显著加速这些过程，并且使用起来非常简单。我们也会对此进行探讨。然而，如果需要更复杂的查询，查询性能可能会受到严重影响。在本文中，我们将考虑一个 GPS 轨迹点数据库。这是 GPS 记录器记录的所有路线的列表。此类功能通常用于公司卡车和汽车，用于监控车辆去向——这越来越受到税务机关的要求，以证明私人和工作相关的里程。在我们的示例中，我们想查询我们的 GPS 点数据库，以返回在给定开始和结束日期之间，距离多个目标点在特定半径内的所有点。以蛮力方式计算地球表面两点之间的距离是一项艰巨的挑战，这应该能很好地说明使用 LINQ、PLINQ 和 GPU 执行此操作的优缺点。

GPS Track Points

必备组件

我们将使用 CUDAfy.NET 来编程 GPU。CUDAfy 库包含在下载文件中，但是，如果您想更改源代码并重新编译，则需要从 NVIDIA 下载并安装 CUDA 5.5。在所有情况下，您都需要一块支持 CUDA 的 NVIDIA GPU。有关入门和运行的更多信息，请参阅我之前的文章。

请参阅这篇关于 GPU 上的 Base64 编码的文章，或者这篇关于 CUDAfy 和 CUDA 的基本介绍的文章。

使用代码

此应用程序会生成用户定义的随机 GPS 路线点，然后允许用户输入搜索条件，包括

目标数量。
日期和时间范围。
距离目标的半径（以米为单位）。
然后，可以使用 LINQ、PLINQ 和 GPU 执行相同的查询。

代码的基础是 GPS 轨迹点和 GPS 轨迹。它们定义为 TrackPoint 和 Track。如果主机代码和 GPU 代码可以共享，那就更方便了。

TrackPoint class diagram

在 GPU 上运行的代码称为设备代码，需要对其进行标记，以便 CUDAfy 在转换为 CUDA C 代码时能够识别它。为此，请添加 Cudafy 属性。

[Cudafy]
[StructLayout(LayoutKind.Sequential)]
public struct TrackPoint
{
    public TrackPoint(double lon, double lat, long timeStamp)
    {
        Longitude = lon;
        Latitude = lat;
        TimeStamp = timeStamp;
    }
    ...
}

Cudafy 结构体时，默认情况下所有成员都会被翻译。要忽略某个成员，请使用 CudafyIgnore 属性。在此示例中，我们有一个 Time 属性。这是一个 DateTime 类型。GPU 不支持属性和 DateTime，因此我们需要将其标记为忽略。

[CudafyIgnore]
public DateTime Time
{
    get { return new DateTime(TimeStamp);  }
    set { TimeStamp = value.Ticks; }
}

用于测试 GPS 点是否在目标点半径米范围内的方法定义为

public double DistanceTo(TrackPoint B)
{
    double dDistance = Single.MinValue;
    double dLat1InRad = Latitude * CONSTS.PI2;
    double dLong1InRad = Longitude * CONSTS.PI2;
    double dLat2InRad = B.Latitude * CONSTS.PI2;
    double dLong2InRad = B.Longitude * CONSTS.PI2;
    
    double dLongitude = dLong2InRad - dLong1InRad;
    double dLatitude = dLat2InRad - dLat1InRad;

    // Intermediate result a.
    double a = Math.Pow(Math.Sin(dLatitude / 2.0), 2.0) +
               Math.Cos(dLat1InRad) * Math.Cos(dLat2InRad) *
               Math.Pow(Math.Sin(dLongitude / 2.0), 2.0);

    // Intermediate result c (great circle distance in Radians).
    double c = 2.0 * Math.Atan2(Math.Sqrt(a), Math.Sqrt(1.0 - a));

    // Distance
    dDistance = CONSTS.EARTHRADIUS * c;
    return dDistance * 1000;
}

地球表面任意两点之间的最短距离由大圆距离给出。如果您查看机上杂志，您会看到标准的二维地图，显示您从欧洲到加州的航班会经过格陵兰岛。这比出发点或到达点都更向北，这样做不是为了走风景优美的路线，而是因为这是大圆航线。计算距离需要大量的双精度浮点数学运算。

其他方法包含在内是为了完整性，并且在此示例的上下文中不在 GPU 上运行，但请随时自行尝试。

在 CPU 上运行查询的代码如下所示。我们将目标点数组以及其他查询参数传递进去。然后，使用 _track 数组的 LINQ 扩展的 Where 方法，返回 GetNearestTargetIndex 方法返回值小于 255 的所有点。如果值为 255，则表示该点不在任何目标的半径范围内，否则返回离该点最近的目标的索引。标准的 LINQ 将在一个处理器核心上执行。要利用更多核心，只需在 Where 方法前插入 AsParallel 即可。

private IEnumerable<TrackPoint> GetPoints(TrackPoint[] targets, 
        double radius, DateTime startTime, DateTime endTime, bool parallel)
{
    long targetStartTime = startTime.Ticks;
    long targetEndTime = endTime.Ticks;
    // Running the query in parallel is as easy as adding AsParallel().
    // We're not concerned with the ordering, so it's ideal.
    if(parallel)
        return _track.AsParallel().Where(tp => GetNearestTargetIndex(tp, 
                targets, radius, targetStartTime, targetEndTime) < 255);
    else
        return _track.Where(tp => GetNearestTargetIndex(tp, targets, 
                radius, targetStartTime, targetEndTime) < 255);
}

private byte GetNearestTargetIndex(TrackPoint tp, TrackPoint[] targets, 
        double radius, long targetStartTime, long targetEndTime)
{
    double minDist = Double.MaxValue;
    byte index = 255; 
    // If we're not within the time range then no need to look further.
    if (tp.TimeStamp >= targetStartTime && tp.TimeStamp <= targetEndTime)
    {
        int noTargets = targets.Length;
        // Go through the targets and get the index of the closest one.
        for (int i = 0; i < noTargets; i++)
        {
            TrackPoint target = targets[i];
            double d = tp.DistanceTo(target);
            if (d <= radius && d < minDist)
            {
                minDist = d;
                index = (byte)i;
            }
        }
        }
    return index;
}

现在我们来看一下与 GPU 通信的代码。这在 Track 类中实现。构造函数负责 Cudafy。这个过程是生成 CUDA C 代码、编译它，然后将输出（PTX）存储到模块中。此模块可以被序列化，以便在将来的运行中节省时间，或者在没有安装 NVIDIA CUDA SDK 的终端用户系统上运行。在这里，我们希望 Cudafy 两个类型：Track 和 TrackPoint。

public Track()
{
    // Does an already serialized valid cudafy module xml file exist. If so
    // no need to re-generate it.
    var mod = CudafyModule.TryDeserialize(csTRACK);
    if (mod == null || !mod.TryVerifyChecksums())
    {
        mod = CudafyTranslator.Cudafy(typeof(Track), typeof(TrackPoint));
        mod.Serialize(csTRACK);
    }
    // Get the default GPU device and load the module.
    _gpu = CudafyHost.GetDevice();
    _gpu.LoadModule(mod);
}

private const string csTRACK = "track";

高效利用 GPU 进行通用编程的关键是最小化需要在主机和 GPU 设备之间传输的数据量。在这里，我们通过先加载轨迹来实现这一点。然后，我们可以运行多个查询，而无需重新加载轨迹。由于默认情况下我们将操作 10,000,000 个 GPS 点，这代表 240,000,000 字节（每个 GPS 点有两个 System.Double 和一个 System.Int64 字段：经度、纬度和时间戳），这一点很重要。轨迹作为点数组上传。为了清晰起见，驻留在 GPU 上的变量以 _dev 后缀表示。作为一种优化，如果 GPU 上已分配的内存大于要上传的新轨迹，我们不会释放它并重新分配，而只是使用所需量（_currentLength）。变量 _indexes 是一个字节数组，其长度等于轨迹的长度。对于每个对应的点，我们将填充 255（如果该点不在任何目标的半径和日期范围内），或者填充目标的索引（如果它在范围内）。

public void LoadTrack(TrackPoint[] trackPoints)
{
    _currentLength = trackPoints.Length;
    // If the current number of points on the GPU are less than the number of points
    // we want to load then we resize the allocated memory on the GPU. We simply free
    // the existing memory and allocate new memory. We need arrays on the GPU to hold
    // the track points and the selected indexes. We make an array on the host to hold
    // the returned indexes.
    if (_trackPoints.Length < trackPoints.Length)
    {
        if (_trackPoints_dev != null)
        {
            _gpu.Free(_trackPoints_dev);
            _gpu.Free(_indexes_dev);
        }
        _trackPoints_dev = _gpu.Allocate(trackPoints);
        _indexes_dev = _gpu.Allocate<byte>(trackPoints.Length);
        _indexes = new byte[trackPoints.Length];
    }
    _trackPoints = trackPoints;
    // Copy the GPS points to the GPU.
    _gpu.CopyToDevice(trackPoints, 0, _trackPoints_dev, 0, trackPoints.Length);            
}

我们将从应用程序调用的方法是 SelectPoints。GPU 有多种内存形式。CPU 系统也有，因为 CPU 有自己的缓存，但我们从不显式使用它。然而，在 GPU 上，建议使用不同类型的内存，因为它会对性能产生重大影响。最大但也是最慢的内存是全局内存。我们就是把轨迹存储在那里。现代 GPU 通常至少有 1GB 的这种内存。另一种内存是常量内存。它只能由主机写入。对 GPU 来说，它是只读的，并且速度非常快。我们将用它来存储我们的目标。请记住，常量内存的量非常有限（以 KB 为单位，而不是 MB），并且不能被释放。设备函数被称为启动。GPU 设备的 Launch 方法接受以下参数

网格中块的数量。
每个块的线程数（最多 1024 个）。
要启动的函数名称。
设备上的轨迹点数组。
轨迹点的数量。
距离目标的半径（以米为单位）。
开始时间（以 tick 为单位）。
结束时间（以 tick 为单位）。
结果的字节数组。
目标数量。

启动基本上是以并行方式启动 blocksPerGrid*ciTHREADSPERBLOCK 个线程。这个数字可能相当大，但如果 GPU 函数执行时间超过 0.5 秒左右，则建议将过程分成多个调用。NVIDIA 驱动程序不喜欢 GPU 被阻塞太长时间，并且启动会在某个点超时。完成后，我们将 _indexes_dev 数组复制回主机，然后搜索值小于 255 的项，返回相应的 TrackPoint。我们可以选择对索引做些事情，以便识别哪个目标是最近的。这可以使用 TrackPointResult 结构体来完成，该结构体包含在源代码中作为示例。

public IEnumerable<TrackPoint> SelectPoints(TrackPoint[] targets, 
       double radius, DateTime startTime, 
       DateTime endTime, bool naive = false)
{
    int blocksPerGrid;
    // Validate the parameters and calculate how
    // many blocks of threads we will need on the GPU.
    // Each block of threads will execute ciTHREADSPERBLOCK threads.
    Initialize(radius, startTime, endTime, out blocksPerGrid);
    // Copy the targets to constant memory.
    _gpu.CopyToConstantMemory(targets, 0, _targets, 0, targets.Length);
    // Launch blocksPerGrid*ciTHREADSPERBLOCK threads in parallel.
    // Each thread will test one GPS point against all targets.
    _gpu.Launch(blocksPerGrid, ciTHREADSPERBLOCK, "SelectPointsKernel", 
                _trackPoints_dev, _currentLength, radius, 
                startTime.Ticks, endTime.Ticks, _indexes_dev, targets.Length);
    // Copy the indexes array back from the GPU to the host
    // and search for all points that have an index < 255.
    // These correspond to GPS points lying within the search criteria.
    _gpu.CopyFromDevice(_indexes_dev, 0, _indexes, 0, _currentLength);
    for (int i = 0; i < _currentLength; i++)
    {
        byte index = _indexes[i];
        if (index < 255)
            yield return _trackPoints[i];
    }
}

每个线程执行以下设备代码，该代码对一个点进行测试。GThread 标识线程的唯一 ID，以便线程知道要访问哪个点。在本例中，我们将利用另一种形式的 GPU 内存：共享内存。这种内存由一个块中的所有线程共享。当例如以非顺序方式读取少量数据而导致全局内存使用效率低下时，性能会下降。在这种情况下，最好使用共享内存作为中间存储。话虽如此，新的 Fermi 架构似乎在处理这段代码方面做得非常好，简单的实现 SelectPointsKernelNaive 的速度与下面的共享内存实现一样快。可以尝试一下。

[Cudafy]
public static void SelectPointsKernel(GThread thread, TrackPoint[] track, 
       int noPoints, double radius, long startTime, long endTime, 
       byte[] indexes, int noTargets)
{
    // Here we use another form of GPU memory called shared memory.
    // This is shared between threads of a single block.
    // The size of the shared memory must be constant
    // and here we set it to the number of threads per block.
    // This can be more efficient than the naive implementation below,
    // however on the latest fermi architecture there is little
    // or no difference.
    byte[] cache = thread.AllocateShared<byte>("cache", ciTHREADSPERBLOCK);
    // Get the unique index of the thread: size of the
    // block * block index + thread index.
    int tid = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
    // Check we are not beyond the end of the points.
    if (tid < noPoints)
    {
        TrackPoint tp = track[tid];
        double minDist = Double.MaxValue;
        byte index = 255;
        if (tp.TimeStamp >= startTime && tp.TimeStamp <= endTime)
        {
            for (int i = 0; i < noTargets; i++)
            {
                // get the target from constant memory
                TrackPoint target = _targets[i];
                // Calculate distance to target and if less
                // than radius and current nearest target
                // set minDist and index.
                double d = tp.DistanceTo(target);
                if (d <= radius && d < minDist)
                {
                    minDist = d;
                    index = (byte)i;
                }
            }
        }
        // set the index.
        cache[thread.threadIdx.x] = index;
        // Synchronize all threads in block
        thread.SyncThreads();
        // Write the results into global memory array
        indexes[tid] = (byte)cache[thread.threadIdx.x];
    }
}

基准测试

Benchmarks on Core i5 laptop

Benchmarks on Core i7-980X workstation

结果非常令人印象深刻。我们看到 PLINQ 相较于 LINQ 带来了非常有价值的优势，但对于要求极高的应用程序，GPU 在此示例中能提供高达 30 倍的改进。即使轨迹需要重新上传到 GPU（在笔记本电脑上耗时 150 毫秒），优势仍然相当可观。这非常重要，意味着在更常规的业务应用程序中使用 GPU 应该被考虑。这里的代码可以轻松地适应其他用途。当集成到（Web）服务器应用程序中时，负载可以大大降低，从而在设备和能源方面带来可观的成本节省。

LINQ 和 PLINQ 的任务管理器

LINQ versus PLINQ

GPU 的任务管理器

GPU

获取 CUDAfy.NET

请访问 CUDAfy 网站，以确保您拥有最新版本的库。

许可证

Cudafy.NET SDK 包含两个大型示例项目，其中包含射线追踪、涟漪效果和分形等。它作为双重许可软件库提供。LGPL 版本适用于专有或开源应用程序的开发，前提是您能够遵守 GNU LGPL 版本 2.1 中包含的条款和条件。请访问 Cudafy 网站了解更多信息。

历史

首次发布。

2013 年 9 月 17 日

更新为使用 CUDAfy V1.26 和 CUDA 5.5