OpenGL / OpenCL 互操作性：使用自动立体图的案例研究

martin_bisson

4.94/5 (25投票s)

2013 年 11 月 20 日

CPOL

19分钟阅读

56547

2130

如何使用 OpenCL 和 OpenGL 在 GPU 上实现算法。

引言

在过去十年左右的时间里，GPU（图形处理单元）已从专门用于（并致力于）渲染任务的硬件发展成为可编程的高性能计算设备，为数值计算开辟了新的可能性。

许多算法都可以利用 GPU 的海量处理能力来更快地执行任务和/或处理更大的数据集。即使在固定功能的、不可编程的图形管道时代，图形硬件也可以用来加速主要与图像处理相关的算法。例如，固定管道功能可用于计算图像差异、模糊图像、混合图像，甚至协助计算图像（或值数组）的平均值。

随后，出现了可编程管道阶段，为程序员提供了更大的灵活性。通过可编程阶段提供的新可能性，更广泛的算法可以被移植到 GPU 上执行。通常需要一些独创性才能将算法转换成适合在屏幕上渲染某些内容的形式，通常是屏幕对齐的四边形。

如今，可编程 GPU 支持更高级别的编程范例，将其转变为 GPGPU（通用图形处理单元）。这种新范例允许以更通用、非图形化的方式访问 GPU 计算硬件，从而实现与渲染无关的算法。

本文以自动立体图生成为例，探讨了 GPGPU API 提供的新可能性，其中可编程渲染管道可以通过与 GPGPU 功能进行互操作来扩展。使用 OpenGL 渲染的场景的深度缓冲区被用作输入，通过 GPU 使用 OpenCL（GPGPU）内核和 OpenGL GLSL（可编程渲染管道）着色器生成自动立体图，而无需将深度数据读回 CPU。

案例研究：自动立体图生成

本文简要介绍了自动立体图生成算法的基础，无意详细回顾自动立体图的生成。有关更多信息，请参阅参考文献。

当自动立体图变得越来越普遍时，立体图重新流行起来。自动立体图是那些当观看者不聚焦于图像平面（通常是图像平面之后）时可以被视为 3D 场景的单张图像。

自动立体图中编码的 3D 场景起初可能难以看清，但过一段时间后，观看这些隐藏在自动立体图中的“秘密”场景就变得非常轻松自如。

本文实现的算法是最简单的自动立体图生成算法之一。它基本上是通过根据输入深度图中像素的 z 深度来改变“重复长度”来重复一个可平铺的图案（可以是可平铺的纹理或随机生成的纹理）。所以基本上

For each line of the output image:
    Copy an entire line of the repeating texture (the tile)
    For each pixel of the line in the input depth map:
        Copy the color of the pixel located at one-tile-width pixels to left, minus an offset X
            where X is 0 for maximum depth (furthest from eye)
              and X is the maximum number of offset pixels (~30 pixels) for minimum depth (closest to eye)

因此，像素离观看者越近，重复图案就越短。这就是欺骗眼睛和大脑认为图像是三维的基础。输出图像的宽度将是重复图像和输入深度图像宽度的总和，以便为重复图像的初始未改变副本留出空间。

我们将在稍后检查 CPU 实现，因为它仍然是相当高级别的，同时提供了所有工作细节，而不是提供更正式的算法描述。

参考文献

立体图描述：http://en.wikipedia.org/wiki/Stereogram
自动立体图描述：http://en.wikipedia.org/wiki/Autostereogram
自动立体图观看技巧：http://www.hidden-3d.com/how_to_view_stereogram.php
关于立体图的非技术性和技术性讨论：http://www.techmind.org/stereo/stereo.html

总体实现概述

下图代表了整个算法流程。

渲染 3D 场景

3D 场景的渲染使用 OpenGL Core Profile 管道完成。本文示例使用的场景包含一个在开放盒子墙壁上弹跳的简单动画球。选择动态场景是为了从不同的实现中提供更多的“实时”效果。

场景被渲染到纹理中使用framebuffer对象，以便更容易地操作结果数据。使用纹理作为渲染目标而不是主后缓冲器有一些优点

输出尺寸（宽度和高度）更容易控制
可以避免渲染窗口与其他窗口重叠时出现的问题
纹理使用通常更自然地适合此类后处理管道

然而，也可以使用标准后缓冲器进行渲染，然后直接从该缓冲器读取。

场景渲染通常输出两个缓冲器：一个颜色缓冲器和一个深度缓冲器。后者与立体图生成相关。因此，在渲染场景时没有必要存储颜色；只有深度很重要。因此，在创建场景将渲染到的framebuffer对象时，无需附加颜色纹理。以下代码显示了仅以深度纹理作为目标创建帧缓冲器对象的代码。

// Allocate a texture to which depth will be rendered.
// This texture will be used as an input for our stereogram generation algorithm.
glGenTextures( 1 , &mDepthTexture );
glBindTexture( GL_TEXTURE_2D , mDepthTexture );
glTexImage2D(
    GL_TEXTURE_2D ,
    0 ,
    GL_DEPTH_COMPONENT32 ,
    kSceneWidth ,
    kSceneHeight ,
    0 ,
    GL_DEPTH_COMPONENT ,
    GL_FLOAT ,
    0
    );

glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_MAG_FILTER , GL_LINEAR        );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_LINEAR        );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_WRAP_S     , GL_CLAMP_TO_EDGE );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_WRAP_T     , GL_CLAMP_TO_EDGE );

glBindTexture( GL_TEXTURE_2D , 0 );


// Create a framebuffer object to render directly to the depth texture.
glGenFramebuffers( 1 , &mDepthFramebufferObject );

// Attach only the depth texture: we don't even bother attaching a target
// for colors, because we don't care about it.
glBindFramebuffer( GL_FRAMEBUFFER , mDepthFramebufferObject );
glFramebufferTexture2D(
    GL_FRAMEBUFFER ,
    GL_DEPTH_ATTACHMENT ,
    GL_TEXTURE_2D ,
    mDepthTexture ,
    0
    );
glBindFramebuffer( GL_FRAMEBUFFER , 0 );

场景渲染还需要其他初始化任务。特别是

创建、加载和编译渲染着色器
创建顶点缓冲器

在主渲染循环中，必须执行以下任务

将帧缓冲器对象设置为渲染目标
将着色器程序设置为活动程序
渲染场景

为了使文章长度合理，这些任务在此处不作解释。它们相当常见，直观且与此算法无关。

以下是此程序生成的场景渲染。

水平坐标计算

这是算法中一个有趣的部分，因为它将是 CPU 和 GPU 实现之间差异最大的部分。

在此步骤之后，将为最终输出图像的每个像素计算重复平铺纹理坐标。计算垂直纹理坐标不是一个主要挑战，因为平铺在垂直方向上只是简单地重复，没有任何变化。在此步骤中甚至不会计算垂直纹理坐标，它们将在下一步中轻松确定。此立体图生成算法的核心实际上是计算重复平铺中每个输出图像像素的水平纹理坐标。

此步骤的结果将是一个二维图像，其大小与最终输出立体图相同，其中每个像素都包含一个浮点值，表示水平纹理坐标。这些浮点值将从左到右不断增加，其中小数部分将表示实际坐标（在 0 到 1 的范围内），整数部分将表示到目前为止图案重复的次数。使用此表示法是为了避免查找值时出现混合问题。例如，如果算法在值 0.99 和 0.01 之间采样，插值将得到大约 0.5 的采样值，这是完全错误的。通过使用值 0.99 和 1.01，插值将得到大约 1.0 的采样值，这是连贯的。

上面列出的算法伪代码可以略作修改以适应此中间步骤。在将第一个像素设置为整行重复平铺坐标（即，从 0 到 1 的递增值以获得整行平铺）后，查找步骤可以通过查找左侧的一个平铺宽度像素，减去取决于深度的值来开始。所以，在伪代码中

For each line of the output coordinate image:
    Write the coordinates for the first line of the repeating tile
    For each pixel of the line in the input depth map:
        Sample the coordinate in the currently-written line one-tile-width pixels to left, minus an offset X
            where X is 0 for maximum depth (furthest from eye)
              and X is the maximum number of offset pixels (~30 pixels) for minimum depth (closest to eye)
        Add 1 to this value so that result is constantly increasing
        Store computed value in output coordinate image

CPU 实现将提供更清晰的实现细节，因为它仍然相当高级别，但提供了所有工作细节。

渲染立体图

最后一步将坐标“图像”和重复平铺图像作为输入，通过在适当的位置采样平铺图像来简单地渲染最终图像。它从输入坐标“图像”获取水平纹理坐标。它从输出像素坐标计算垂直纹理坐标（平铺在垂直方向上只是简单地重复）。

此采样是在 GPU 上使用自定义着色器完成的。渲染一个屏幕对齐的四边形，并使用以下像素着色器来计算生成立体图中的最终颜色。

#version 150
smooth in vec2 vTexCoord;
out vec4       outColor;

// Sampler for the generated offset texture.
uniform sampler2D uOffsetTexture;
// Sampler for the repeating pattern texture.
uniform sampler2D uPatternTexture;
// Scaling factor (i.e. ratio of height of two previous textures).
uniform float     uScaleFactor;

void main( )
{
    // The horizontal lookup coordinate comes directly from the
    // computed offsets stored in the offset texture.
    float lOffsetX = texture( uOffsetTexture, vTexCoord ).x;
    
    // The vertical coordinate is computed using a scaling factor
    // to map between the coordinates in the input height texture
    // (i.e. vTexCoord.y) and where to look up in the repeating pattern.
    // The scaling facture is the ratio of the two textures' height.
    float lOffsetY = ( vTexCoord.y * uScaleFactor );
    
    vec2  lCoords  = vec2( lOffsetX , lOffsetY );
    outColor = texture( uPatternTexture , lCoords );
};

至此，算法概述结束。下一节将描述坐标生成阶段的 CPU 实现。

CPU 实现

算法的 CPU 实现仅涵盖从输入深度生成偏移量（即纹理坐标）。它是上面伪代码的简单 C++ 翻译。它仍然由三个步骤组成

首先，深度从 GPU 读回 CPU
然后，从这些深度生成偏移量
最后，偏移量从 CPU 写回 GPU

步骤 1：从 GPU 读取输入深度到 CPU

渲染场景后，深度存储在 GPU 上的纹理中。为了访问 CPU 实现的这些深度，必须将深度从 GPU 获取并存储在 CPU 可访问的内存中。std::vector的floats 用于在 CPU 端存储这些深度，如下面的代码所示。

// Read the depths data from the GPU.
glBindTexture( GL_TEXTURE_2D , mDepthTexture );
glGetTexImage(
    GL_TEXTURE_2D ,
    0 ,
    GL_DEPTH_COMPONENT ,
    GL_FLOAT ,
    mInputDepths.data()
    );
glBindTexture( GL_TEXTURE_2D , 0 );

然后将深度存储在浮点值向量中。

步骤 2：处理

生成偏移量只是应用上述伪代码中描述的算法，并将结果存储在常规内存数组中。下面的代码显示了从mInputDepths读取到mOutputOffsets的翻译。

const int lPatternWidth    = pPatternRenderer.GetPatternWidth();
const int lStereogramWidth = kSceneWidth + lPatternWidth;
for ( int j = 0; j < kSceneHeight; ++j )
{
    // First initialize the first band of lookups without depths offsets.
    for ( int i = 0, lCountI = lPatternWidth; i < lCountI; ++i )
    {
        float& lOutput = mOutputOffsets[ j * lStereogramWidth + i ];
        lOutput = i / static_cast< float >( lPatternWidth );
    }

    // Then compute offsets.
    for ( int i = lPatternWidth; i < lStereogramWidth; ++i )
    {
        float& lOutput = mOutputOffsets[ j * lStereogramWidth + i ];

        // Get the depth value associated with this pixel.
        const int   lInputI = i - lPatternWidth;
        const float lDepthValue = mInputDepths[ j * kSceneWidth + lInputI ];
        // Get where to look up for the offset value.
        const float lLookUpPos = static_cast< float >( lInputI ) + kMaxOffset * ( 1 - lDepthValue );

        // Lerp between pixel values.
        const int lPos1 = static_cast< int >( lLookUpPos );
        const int lPos2 = lPos1 + 1;
        const float lFrac = lLookUpPos - lPos1;
        const float lValue1 = mOutputOffsets[ j * lStereogramWidth + lPos1 ];
        const float lValue2 = mOutputOffsets[ j * lStereogramWidth + lPos2 ];

        // We add 1.0 to the lerp-ed so that offset values are always increasing
        // in a given row (to make sure interpolation between any offset value
        // still makes sense).
        const float lValue = 1.0f + ( lValue1 + lFrac * ( lValue2 - lValue1 ) );

        lOutput = lValue;
    }
}

步骤 3：将输出偏移量从 CPU 写入 GPU

生成偏移量后，必须将它们发送回 GPU 进行最终的立体图渲染。此操作基本上是步骤 1 的反向操作，并显示在下面的代码中。

glBindTexture( GL_TEXTURE_2D , mOffsetTexture );
glTexSubImage2D( 
    GL_TEXTURE_2D ,
    0 ,
    0 ,
    0 ,
    lStereogramWidth ,
    kSceneHeight ,
    GL_RED ,
    GL_FLOAT ,
    mOutputOffsets.data()
    );
glBindTexture( GL_TEXTURE_2D , mOffsetTexture );

然后，偏移量将被写入 GPU 内存。

至此，算法的 CPU 实现结束。这种方法最大的缺点是需要为每一帧在 CPU 和 GPU 之间交换相对大量的数据。从 GPU 读取图像数据进行处理，然后再写回 GPU，在实时应用程序中可能是一个相当大的性能瓶颈。

为了防止此问题，处理将在 GPU 内存中直接进行，从而避免了 CPU 和 GPU 之间的往返读写。下一节将描述这种方法。

GPU 实现

为了避免 CPU 和 GPU 之间不必要的往返，深度数据必须直接在 GPU 上进行处理。然而，立体图生成算法需要查找同一输出图像行的先前设置的值。从同一个纹理/图像缓冲器读取和写入对传统的 GPU 处理方法（如使用片段着色器）非常不友好。

可以使用“带”式方法，其中垂直带从左到右渲染，每个带的宽度不超过到左侧的最小查找距离。在源代码提供的示例中，重复图案的宽度为 85 像素，并且到该完整查找的最大偏移量为 30 像素（kMaxOffset的值），resulting a maximum band width of 55 pixels. Because of the impossibility of reading at random locations from the texture being rendered to, two copies of the texture being rendered to would need to be kept: one to read from, and one to write to. Then what was just written would have to be copied to the other texture.

这种方法需要两个纹理副本，这并不理想。此外，带宽度直接影响渲染通道的数量，而渲染通道的数量又直接影响性能。然而，此宽度取决于重复图案的宽度，该宽度可以从一代到另一代变化，以及最大偏移量，这是一个可能受益于实时调整的参数。性能取决于变化的参数是远非理想。

需要比使用常规可编程渲染管道更灵活的方法。因此，出现了 OpenCL。此 GPGPU API 的“通用目的”部分对于此类应用程序尤其重要。它将允许使用 GPU 进行更通用、不那么面向渲染的算法，这种灵活性将允许 GPU 高效地用于立体图生成。

首先，将展示 CPU 实现的渲染部分需要做的一些修改。然后，将描述创建能够与 OpenGL 上下文共享资源的 OpenCL 上下文。最后，将介绍用于生成立体图的 OpenCL 内核以及运行它所需的元素。

对场景渲染的修改

CPU 版本算法使用的深度纹理无法与前面介绍的 GPU 实现一起使用。此纹理必须与 OpenCL 上下文共享，并且 OpenCL 对 OpenCL 可以直接访问的 OpenGL 纹理格式有限制。根据 clCreateFromGLTexture2D 的文档，它引用了支持的图像通道顺序值表，GL_DEPTH_COMPONENT32 不是要与 OpenCL 共享的 OpenGL 纹理内部格式的支持值。这很不幸，因为该格式的内部表示很可能与将要使用的表示相同，但是这种缺乏支持的问题可以规避。

为了从场景渲染步骤中获得深度纹理，将第二个纹理附加到帧缓冲器对象。请记住，在 CPU 版本中只附加了一个深度纹理。此深度纹理仍需附加，以便它可以作为深度测试正常工作的深度缓冲器。然而，另一个纹理将被附加为“颜色附件”，但它接收的不是颜色，而是深度值。以下代码显示了如何创建此纹理以及如何将其附加到帧缓冲器对象。

// Skipped code to allocate depth texture...

// *** DIFFERENCE FROM CPU IMPLEMENTATION ***
// However, because OpenCL can't bind itself to depth textures, we also create
// a "normal" floating point texture that will also hold depths.
// This texture will be the input for our stereogram generation algorithm.
glGenTextures( 1 , &mColorTexture );
glBindTexture( GL_TEXTURE_2D , mColorTexture );
glTexImage2D(
    GL_TEXTURE_2D ,
    0 ,
    GL_R32F ,
    kSceneWidth ,
    kSceneHeight ,
    0 ,
    GL_RED ,
    GL_FLOAT ,
    0
    );

glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_MAG_FILTER , GL_LINEAR        );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_LINEAR        );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_WRAP_S     , GL_CLAMP_TO_EDGE );
glTexParameteri( GL_TEXTURE_2D , GL_TEXTURE_WRAP_T     , GL_CLAMP_TO_EDGE );

glBindTexture( GL_TEXTURE_2D , 0 );


// Create a framebuffer object to render directly to the depth texture.
glGenFramebuffers( 1 , &mDepthFramebufferObject );

// Attach the depth texture and the color texture (to which depths will be output)
glBindFramebuffer( GL_FRAMEBUFFER , mDepthFramebufferObject );
glFramebufferTexture2D(
    GL_FRAMEBUFFER ,
    GL_DEPTH_ATTACHMENT ,
    GL_TEXTURE_2D ,
    mDepthTexture ,
    0
    );
glFramebufferTexture2D( 
    GL_FRAMEBUFFER ,
    GL_COLOR_ATTACHMENT0 ,
    GL_TEXTURE_2D ,
    mColorTexture ,
    0
    );
glBindFramebuffer( GL_FRAMEBUFFER , 0 );

然后需要一个片段着色器将深度渲染到此“颜色附件”中。正如以下代码所示，这非常简单

#version 150
out vec4 outColor;

void main( )
{
    float lValue = gl_FragCoord.z;
    outColor = vec4( lValue , lValue , lValue , 1.0 );
}

这些修改将得到一个可与clCreateFromGLTexture2D()一起使用的纹理，用于与 OpenCL 上下文共享，如下一节所示。

OpenCL 上下文创建

通常会执行以下步骤来创建 OpenCL 上下文

List OpenCL platforms and choose one (usually the first one).
List OpenCL devices on this platform and choose one (usually the first one).
Create an OpenCL context on this device.

但是，对于立体图生成算法的实现，必须小心分配一个能够从现有上下文访问 OpenGL 资源的 OpenCL 上下文。将额外参数传递给 OpenCL 上下文创建例程以请求兼容的上下文。这意味着，例如，如果 OpenGL 上下文是在与我们试图分配 OpenCL 上下文的设备不同的设备上创建的，则上下文创建可能会失败。因此，需要修改创建步骤以强制执行此兼容性要求

List OpenCL platforms.
For each platform:
    List OpenCL devices on this platform
    For each device:
        Try to allocate a context
            on this device
            compatible with current OpenGL context
        if context successfully created:
            stop

请注意，必须遍历所有平台和设备以确保找到正确的上下文。以下显示了执行此 OpenCL 上下文创建的代码。

cl_int lError = CL_SUCCESS;
std::string lBuffer;

//
// Generic OpenCL creation.
//

// Get platforms.
cl_uint lNbPlatformId = 0;
clGetPlatformIDs( 0 , 0 , &lNbPlatformId );

if ( lNbPlatformId == 0 )
{
    std::cerr << "Unable to find an OpenCL platform." << std::endl;
    return false;
}


// Loop on all platforms.
std::vector< cl_platform_id > lPlatformIds( lNbPlatformId );
clGetPlatformIDs( lNbPlatformId , lPlatformIds.data() , 0 );

// Try to find the device with the compatible context.
cl_platform_id lPlatformId = 0;
cl_device_id lDeviceId = 0;
cl_context lContext = 0;

for ( size_t i = 0; i < lPlatformIds.size() && lContext == 0; ++i )
{
    const cl_platform_id lPlatformIdToTry = lPlatformIds[ i ];

    // Get devices.
    cl_uint lNbDeviceId = 0;
    clGetDeviceIDs( lPlatformIdToTry , CL_DEVICE_TYPE_GPU , 0 , 0 , &lNbDeviceId );

    if ( lNbDeviceId == 0 )
    {
        continue;
    }

    std::vector< cl_device_id > lDeviceIds( lNbDeviceId );
    clGetDeviceIDs( lPlatformIdToTry , CL_DEVICE_TYPE_GPU , lNbDeviceId , lDeviceIds.data() , 0 );


    // Create the properties for this context.
    cl_context_properties lContextProperties[] = {
        // We need to add information about the OpenGL context with
        // which we want to exchange information with the OpenCL context.
        #if defined (WIN32)
        // We should first check for cl_khr_gl_sharing extension.
        CL_GL_CONTEXT_KHR , (cl_context_properties) wglGetCurrentContext() ,
        CL_WGL_HDC_KHR , (cl_context_properties) wglGetCurrentDC() ,
        #elif defined (__linux__)
        // We should first check for cl_khr_gl_sharing extension.
        CL_GL_CONTEXT_KHR , (cl_context_properties) glXGetCurrentContext() ,
        CL_GLX_DISPLAY_KHR , (cl_context_properties) glXGetCurrentDisplay() ,
        #elif defined (__APPLE__)
        // We should first check for cl_APPLE_gl_sharing extension.
        #if 0
        // This doesn't work.
        CL_GL_CONTEXT_KHR , (cl_context_properties) CGLGetCurrentContext() ,
        CL_CGL_SHAREGROUP_KHR , (cl_context_properties) CGLGetShareGroup( CGLGetCurrentContext() ) ,
        #else
        CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE , (cl_context_properties) CGLGetShareGroup( CGLGetCurrentContext() ) ,
        #endif
        #endif
        CL_CONTEXT_PLATFORM , (cl_context_properties) lPlatformIdToTry ,
        0 , 0 ,
    };


    // Look for the compatible context.
    for ( size_t j = 0; j < lDeviceIds.size(); ++j )
    {
        cl_device_id lDeviceIdToTry = lDeviceIds[ j ];
        cl_context lContextToTry = 0;

        lContextToTry = clCreateContext(
            lContextProperties ,
            1 , &lDeviceIdToTry ,
            0 , 0 ,
            &lError
            );
        if ( lError == CL_SUCCESS )
        {
            // We found the context.
            lPlatformId = lPlatformIdToTry;
            lDeviceId = lDeviceIdToTry;
            lContext = lContextToTry;
            break;
        }
    }
}

if ( lDeviceId == 0 )
{
    std::cerr << "Unable to find a compatible OpenCL device." << std::endl;
    return false;
}


// Create a command queue.
cl_command_queue lCommandQueue = clCreateCommandQueue( lContext , lDeviceId , 0 , &lError );
if ( !CheckForError( lError ) )
{
    std::cerr << "Unable to create an OpenCL command queue." << std::endl;
    return false;
}

创建 OpenCL 上下文后，现在可以创建 OpenCL 缓冲器对象（类型为cl_mem）来表示要共享的 OpenGL 纹理。这些缓冲器不会分配内存，它们将仅仅是 OpenGL 纹理的相同底层缓冲器的别名，允许 OpenCL 读取和写入这些缓冲器。

为了创建这些 OpenGL 纹理的引用，使用了 clCreateFromGLTexture2D 函数，如下所示

// OpenCL 1.2 deprecates clCreateFromGLTexture2D to use
// clCreateFromGLTexture, but we keep it to work with OpenCL 1.1.
mDepthImage = clCreateFromGLTexture2D(
    mContext ,
    CL_MEM_READ_ONLY ,
    GL_TEXTURE_2D ,
    0 ,
    pInputDepthTexture ,
    &lError
    );
if ( !CheckForError( lError ) )
    return false;

mOffsetImage = clCreateFromGLTexture2D(
    mContext ,
    CL_MEM_WRITE_ONLY ,
    GL_TEXTURE_2D ,
    0 ,
    pOutputOffsetTexture ,
    &lError
    );
if ( !CheckForError( lError ) )
    return false;

请注意，此函数已被 OpenCL 1.2 中的 clCreateFromGLTexture 弃用，但 clCreateFromGLTexture2D 仍将使用，以便此应用程序可以在仅支持 OpenCL-1.1 的系统上运行。

现在可以将这些缓冲器用作常规 OpenCL 缓冲器，它们将由将在下一节中描述的 OpenCL 内核处理。

内核设计、实现和执行

本节的目的不是深入探讨 OpenCL 的概念和语法细节，而是介绍与此问题相关的元素。在此立体图生成算法的上下文中，有两个因素对内核设计有重大影响：首先，同一行的像素之间存在数据依赖性，其次，内核无法在同一个内核执行中从同一个图像缓冲器读取和写入。

内核应处理多少数据？

内核设计用于处理数据的子集，以便它们可以由 OpenCL 兼容设备中可用的多个计算单元并行运行。在 OpenCL 中实现图像处理算法时，一个非常流行的设计选择是让每个内核实例处理图像的每个像素，这允许大量的并行性。然而，立体图生成算法显示了一个问题性数据依赖性，即对于要处理的任何像素，它需要左侧同一行的像素。因此，将内核设计为一次运行每行而不是每个像素更适合当前问题，因此提出的内核将一次处理整行。

如何避免从同一缓冲器读取和写入？

另一个与此数据依赖性问题有些相关的问题是，在 OpenCL 内核中无法从同一个图像缓冲器读取和写入，就像 OpenGL 纹理在同一个渲染通道中不能被采样和写入一样。然而，已处理像素的结果对于计算即将到来的像素是必需的。算法需要进行调整。

一个简单的观察可以提供帮助：一个值永远不需要查找比重复图像宽度更远的位置。因此，可以使用一个相同宽度的本地缓冲器来保存最后计算的偏移量。通过使用这个本地缓冲器进行任意读写，并将其用作循环缓冲器，可以避免读写问题。当计算出偏移量时，内核总是从本地缓冲器读取，并将结果写回本地缓冲器和输出图像。因此，永远不需要从输出图像读取，这就解决了问题。

在使用 GPGPU API 实现算法时，这种类型的调整是很常见的。与 CPU 实现相比，这些 API 通常提供不同的功能并呈现不同的限制，尤其是在同步原语方面。可能需要进行此类修改来移植算法，如本例所示。然而，它们也可以是优化，允许内核运行得更快，例如通过更有效的内存访问模式。在将算法从 CPU 移植到 GPGPU 时，这一点必须牢记：即使对于像这样的简单问题，转换也不是总是一致的。

在解决了这些设计问题之后，现在就可以实现这个内核了。以下代码说明了上述问题。

// We will sample using normalized coordinates.
// Because we will sample exact values, we can choose nearest filtering.
const sampler_t kSampler =
    CLK_NORMALIZED_COORDS_FALSE
    | CLK_ADDRESS_CLAMP_TO_EDGE
    | CLK_FILTER_NEAREST;

// Stereogram-generating kernel.
__kernel void Stereogram(
    __write_only image2d_t pOffsetImage ,
    __read_only image2d_t pDepthImage
    )
{
    // Private buffer to hold last image offset.
    float lBuffer[ kPatternWidth ];
    
    const int2 lOutputDim = get_image_dim( pOffsetImage );
    const int  lRowPos = get_global_id( 0 );
    if ( lRowPos >= lOutputDim.y )
        return;
    
    // First copy direct values.
    for ( int i = 0 ; i < kPatternWidth; ++i )
    {
        const float lValue = ( i / (float) kPatternWidth );
        // We copy them in the temporary buffer from which we will fetch upcoming offsets.
        lBuffer[ i ] = lValue;
        
        // ... and we also output it in the first band of the image.
        const int2 lOutputPos = { i , lRowPos };
        write_imagef( pOffsetImage , lOutputPos , (float4) lValue );
    }
    
    // Then actually generate offsets based on depth.
    for ( int i = kPatternWidth ; i < lOutputDim.x; ++i )
    {
        const int2 lLookupPos = { i - kPatternWidth , lRowPos };
        const float4 lDepth = read_imagef( pDepthImage , kSampler , lLookupPos );
        const float  lOffset = kMaxOffset * ( 1 - lDepth.x );
        const float  lPos  = i + lOffset;
        const int    lPos1 = ( (int) ( lPos ) );
        const int    lPos2 = ( lPos1 + 1 );
        const float  lFrac = lPos - lPos1;
        const float  lValue1 = lBuffer[ lPos1 % kPatternWidth ];
        const float  lValue2 = lBuffer[ lPos2 % kPatternWidth ];
        const float  lValue = 1 + lValue1 + lFrac * ( lValue2 - lValue1 );
        
        // Update private buffer.
        lBuffer[ i % kPatternWidth ] = lValue;
        
        // Update output image.
        const int2 lOutputPos = { i , lRowPos };
        write_imagef( pOffsetImage , lOutputPos , (float4) lValue );
    }
};

此内核代码必须先由 OpenCL 驱动程序编译，然后才能运行。这样做就像编译任何 OpenCL 内核一样

// Create program.
const char* lCode = kKernelCode;

// We pass compilation parameters to define values that will be constant for
// all execution of the kernel.
std::ostringstream lParam;
lParam << "-D kPatternWidth=" << pPatternWidth << " -D kMaxOffset=" << kMaxOffset;

cl_program lProgram = clCreateProgramWithSource( mContext , 1 , &lCode , 0 , &lError );
if ( !CheckForError( lError ) )
    return false;

lError = clBuildProgram( lProgram , 1 , &mDeviceId , lParam.str().c_str() , 0 , 0 );
if ( lError == CL_BUILD_PROGRAM_FAILURE )
{
    // Determine the size of the log
    size_t lLogSize;
    clGetProgramBuildInfo(
        lProgram , mDeviceId , CL_PROGRAM_BUILD_LOG , 0 , 0 , &lLogSize
        );

    // Get the log
    std::string lLog;
    lLog.resize( lLogSize );
    clGetProgramBuildInfo(
        lProgram ,
        mDeviceId ,
        CL_PROGRAM_BUILD_LOG ,
        lLogSize ,
        const_cast< char* >( lLog.data() ) ,
        0
        );

    // Print the log
    std::cerr << "Kernel failed to compile.\n"
              << lLog.c_str() << "." << std::endl;
}
if ( !CheckForError( lError ) )
    return false;

cl_kernel lKernel = clCreateKernel( lProgram , "Stereogram" , &lError );
if ( !CheckForError( lError ) )
    return false;

一些参数被定义为常量，这些常量将用于此内核的所有执行。可以选择另一种策略来允许例如kMaxOffset参数的运行时调整。此变量的值可以作为参数传递给内核函数，但在本应用程序中，它被保持为常量，因此将其定义为内核编译时常量。

内核准备运行的唯一剩下的事情是绑定内核函数参数，即输入和输出图像缓冲器

// Now that we initialized the OpenCL texture buffer, we can set
// them as kernel parameters, they won't change, the kernel will
// always be executed on those buffers.
lError = clSetKernelArg( mKernel , 0 , sizeof( mOffsetImage ) , &mOffsetImage );
if ( !CheckForError( lError ) )
    return false;

lError = clSetKernelArg( mKernel , 1 , sizeof( mDepthImage ) , &mDepthImage );
if ( !CheckForError( lError ) )
    return false;

这些参数可以一次性设置，因为它们不会改变。内核始终在这些缓冲器上运行，因此这些参数可以在初始化时设置，而不是在主循环中设置。

在主循环中运行内核需要三个简单步骤

同步 OpenGL 纹理，以确保 OpenGL 在使用它们之前已完成写入
运行 OpenCL 内核
同步 OpenGL 纹理，以确保 OpenCL 在将它们返回给 OpenGL 之前已完成写入

以下代码显示了如何执行这些任务

cl_mem lObjects[] = { mDepthImage , mOffsetImage };
cl_int lError = 0;

// We must make sure that OpenGL is done with the textures, so
// we ask to sync.
glFinish();
const int lNbObjects = sizeof( lObjects ) / sizeof( lObjects[0] );
lError = clEnqueueAcquireGLObjects(
    mCommandQueue , lNbObjects , lObjects , 0 , NULL , NULL
    );
CheckForError( lError );

// Perform computations.
lError = clEnqueueNDRangeKernel(
    mCommandQueue ,
    mKernel ,
    1 ,
    NULL ,
    &mSize ,
    &mWorkgroupSize ,
    0 ,
    NULL ,
    NULL
    );
CheckForError( lError );

// Before returning the objects to OpenGL, we sync to make sure OpenCL is done.
lError = clEnqueueReleaseGLObjects(
    mCommandQueue , lNbObjects , lObjects , 0 , NULL , NULL
    );
CheckForError( lError );
lError = clFinish( mCommandQueue );
CheckForError( lError );

因此，将不再需要将数据从 GPU 传输回 CPU，然后再从 CPU 传输到 GPU，即可在 GPU 上计算出偏移量。

至此，算法的 GPU 实现结束。它表明，通过结合 OpenGL 和 OpenCL，可以避免 CPU 内存和 GPU 内存之间昂贵的往返，同时仍然保持足够的灵活性来实现非平凡算法。

代码

本文提供的代码展示了此处介绍的概念的实现。它不是设计成特别可重用的。它的目的是尽可能简单，尽可能贴近 OpenGL 和 OpenCL API 调用，并且依赖项最少，以便清楚地说明文章的目标。事实上，这个演示应用程序最初是在一个个人框架中开发的，然后被剥离，以达到当前最小的应用程序。

此演示在 Intel 和 NVidia 硬件上成功运行。它没有在 AMD 硬件上进行测试，但应该可以按原样运行，或者在最坏的情况下只需要进行少量修改。它在 Windows Vista 和 7（使用 Microsoft Visual Studio 编译）、Ubuntu Linux（使用 GCC 编译）和 OS X Mountain Lion（使用 GCC 编译）上运行。

该应用程序支持三种模式，可以通过按空格键交替切换。第一种模式是使用非常基本的照明着色进行场景的常规渲染。第二种模式是立体图生成的 CPU 实现。第三种模式是立体图生成的 GPU 实现。

在 Intel HD Graphics 4000 硬件上，第一种模式（常规渲染）的运行速度约为每秒 1180 帧。第二种模式（CPU 实现）的运行速度约为每秒 11 帧。第三种模式（GPU 实现）的运行速度约为每秒 260 帧。尽管每秒帧数不一定是精确的性能指标，但它们仍然能让人体会到结果。很明显，通过避免从 GPU 到 CPU 的往返并利用 GPU 的并行计算能力，可以实现显著的性能提升。

结论

本文介绍的立体图生成算法是演示使用 GPGPU 与渲染管道交互的强大功能的绝佳机会。已经表明，使用纯粹的可编程渲染管道（GLSL 着色器）要么不可能实现，要么会导致非常低效的实现，而使用 OpenCL 访问 OpenGL 纹理并以 GLSL 不太友好的方式处理它们，可以相当容易地实现算法的一部分。

通过提供灵活的手段在 GPU 上直接实现更复杂的算法，渲染管道（OpenGL）和 GPGPU API（OpenCL）之间的交互为 GPU 数据处理提供了优雅而有效的解决方案，用于有趣的（即困难的）问题。它为开发人员提供了处理此类问题的工具，而无需像在 GPU 上实现算法时通常需要的那么多编程技巧，甚至为比常规可编程渲染管道提供的更多可能性打开了大门。

话虽如此，这种实现仍然可以大大改进。OpenCL 不是一个“自动”实现便携式性能的神奇魔杖。优化 OpenCL 实现本身可能是一项艰巨的任务……因此，进一步研究此演示应用程序以了解如何改进其简单实现以获得更高的性能将很有意义。此外，OpenGL 计算着色器也是探索解决类似问题的一个有趣途径。

参考文献

http://sa09.idav.ucdavis.edu/docs/SA09_GL_interop.pdf：一篇关于 OpenGL / OpenCL 互操作性的有趣演示。