Visual C++ 10.0 GDI+C++/CLI Visual Studio 2008 Visual Studio 2010 .NET 2.0 MFC Intermediate Dev C++.NET C#

快速 SIMD 原型设计

Shao Voon Wong

4.97/5 (51投票s)

2016 年 10 月 30 日

Ms-PL

9分钟阅读

80322

1405

轻松原型化 SIMD 向量化代码。

下载源代码 - 692 KB

引言

引言
简单示例
绘制圆形
圆形覆盖效果
六边形覆盖效果
历史
参考

引言

很多时候，当我们使用内部函数 (intrinsic) 将计算代码转换为 SIMD 时，SIMD 代码编写起来更耗时耗力，而且可读性差，难以维护，相比于标量代码。我们还必须维护一个标量代码版本，以便应用程序在没有 SIMD 支持的系统上运行时可以回退。但往往，我们发现，向量化代码的性能反而更低。使用高度优化的 SIMD 库是一回事，而使用 SIMD 手动编写自己的库又是另一回事。在本文中，我们将介绍 Visual C++ 附带的一些 SIMD 向量类及其重载运算符 (+,-,*,/)（这些类实际上来自 Intel C++ 库），以简化我们的 SIMD 编写，并减少编写可读 SIMD 代码的工作量。您愿意阅读/编写 z = x + y; 还是 z = _mm_add_ps(x, y);？答案显而易见。这样，如果发现我们手动编写的向量化代码执行速度较慢，我们就可以毫不心疼地将其丢弃。

简单示例

我将用一个简单的例子来说明如何使用 SSE (非 SSE2) 浮点向量类 F32vec4 来进行四个整数的加法。F32vec4 定义在 fvec.h 头文件中。

F32vec4 vec1(1.0f, 2.0f, 3.0f, 4.0f);
F32vec4 vec2(5.0f, 6.0f, 7.0f, 8.0f);

F32vec4 result = vec1 + vec2;

不使用 F32vec4 类而使用纯 SSE 内部函数，代码如下。正如您所见，使用内部函数，我们必须记住它们，或者在需要使用它们时查阅文档。对于那些希望延迟痴呆症或阿尔茨海默病发作的人来说，可以尝试记忆那些难以记住的内部函数名称。

__m128 vec1 = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 vec2 = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

__m128 result = _mm_add_ps(vec1, vec2);

下面是来自 Optimizing software in C++ 电子书的向量类及其相关头文件的表格。

向量类	每个元素的字节数，位	元素数量	元素类型	向量总大小，位	头文件
`Is8vec8`	8	8	`char`	64	ivec.h
`Iu8vec8`	8	8	`unsigned char`	64	ivec.h
`Is16vec4`	16	4	`short int`	64	ivec.h
`Iu16vec4`	16	4	`unsigned short int`	64	ivec.h
`Is32vec2`	32	2	`int`	64	ivec.h
`Iu32vec2`	32	2	`无符号整数`	64	ivec.h
`I64vec1`	64	1	`__int64`	64	ivec.h
`Is8vec16`	8	16	`char`	128	dvec.h
`Iu8vec16`	8	16	`unsigned char`	128	dvec.h
`Is16vec8`	16	8	`short int`	128	dvec.h
`Iu16vec8`	16	8	`unsigned short int`	128	dvec.h
`Is32vec4`	32	4	`int`	128	dvec.h
`Iu32vec4`	32	4	`无符号整数`	128	dvec.h
`I64vec2`	64	2	`__int64`	128	dvec.h
`F32vec4`	32	4	`float`	128	fvec.h
`F64vec2`	64	2	`double`	128	dvec.h
`F32vec8`	32	8	`float`	256	dvec.h
`F64vec4`	64	4	`double`	256	dvec.h

Optimizing software in C++：不建议使用 ivec.h 中的 64 位向量，因为它们与浮点代码不兼容。如果您确实使用了 64 位向量，那么必须在 64 位向量操作之后、任何浮点代码之前执行 _mm_empty()。128 位向量没有这个问题。256 位向量需要微处理器和操作系统支持 AVX 指令集。

大多数整数向量类不支持向量乘法运算，而所有整数向量类都不支持向量除法运算，而浮点和双精度向量类则支持向量加法、减法、乘法和除法。下面是一个使用标量除法来实现您自己的整数 (32 位) 除法运算符的示例。当然，使用标量运算会更慢。

static inline Is32vec4 operator / (Is32vec4 const & a, Is32vec4 const & b) 
{
    Is32vec4 ans;
    ans[0] = a[0] / b[0];
    ans[1] = a[1] / b[1];
    ans[2] = a[2] / b[2];
    ans[3] = a[3] / b[3];
    return ans;
}

绘制圆形

我们将使用 F32vec4 类 (128 位浮点向量) 来绘制一个圆。我将首先展示一个纯标量 C++ 代码来实现它。

Draw a green circle

Pixel within the circle

代码使用勾股定理来确定像素是否落在圆内。使用圆心处的 X 和 Y 坐标，我们计算斜边，使用 sqrt(X²+Y²)。斜边是像素到圆心的距离。如果小于半径，则在圆内，像素应显示为绿色。

// Draw Circle without optimization
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    if( !pixelsCanvas )
        return;

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=1, fx+=1.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // Subtract center X from the pixel X coordinates
            float X = fx - fPointX;
            // Subtract center Y from the pixel Y coordinates
            float Y = fy - fPointY;

            // compute the square of X, that is X * X = X to power of 2
            X = X * X;
            // compute the square of Y, that is Y * Y = Y to power of 2
            Y = Y * Y;

            // Add up the X square and Y square
            float hypSq = X + Y;

            // Find the hypotenuse by computing square root
            float hyp = std::sqrt(hypSq);

            UINT origPixel = pixelsCanvas[index];
            if(hyp<=fRadius)
            {
                pixelsCanvas[index] = color;
            }
            else
            {
                pixelsCanvas[index] = origPixel;
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0,
      m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

我必须承认，这是一种效率低下的绘制圆形的方法。更好、更快的方法是先绘制一个带笔触的圆，然后使用 floodfill 将圆填充为实心颜色。而且圆的边界也没有抗锯齿。我们最后的演示需要以这种方式绘制圆形。

下面是使用内部函数（尚未使用 F32vec4）编写的版本。我不会尝试解释这个内部函数版本，但会解释 F32vec4 版本。两个版本都类似。

// Draw Circle with SIMD intrinsic
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    __m128 vecPointX = _mm_set_ps1(fPointX);
    // vector of 4 y coordinates of the circle center
    __m128 vecPointY = _mm_set_ps1(fPointY);
    // vector of 4 radius of the circle center
    __m128 vecRadius = _mm_set_ps1(fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            __m128 vecX= _mm_set_ps(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            __m128 vecY = _mm_set_ps1(fy);

            // Subtract center X from the pixel X coordinates
            vecX = _mm_sub_ps(vecX, vecPointX);

            // Subtract center Y from the pixel Y coordinates
            vecY = _mm_sub_ps(vecY, vecPointY);

            // compute the square of X, that is X * X = X to power of 2
            vecX = _mm_mul_ps(vecX, vecX);
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = _mm_mul_ps(vecY, vecY);

            // Add up the X square and Y square
            __m128 vecHypSq = _mm_add_ps(vecX, vecY);

            // Find the hypotenuse by computing square root
            __m128 vecHyp = _mm_sqrt_ps(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            __m128 mask = _mm_cmple_ps(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0, 
             m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

下面是使用 Microsoft Visual Studio 提供的 F32vec4 类编写的代码。事实上，我先编写了向量类版本，然后编写了标量和 SIMD 内部函数版本。将向量类版本转换为内部函数版本更容易：我只需要在向量类的重载运算符中查找内部函数名称，然后相应地替换它们。

// Draw Circle with SIMD class
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 100.0f; // x coordinates of the circle center
    float fPointY = 100.0f; // y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointX(fPointX);
    // vector of 4 y coordinates of the circle center
    F32vec4 vecPointY(fPointY);
    // vector of 4 radius of the circle center
    F32vec4 vecRadius(fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff00ff00;
    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            F32vec4 vecY((float)(fy));

            // Subtract center X from the pixel X coordinates
            vecX -= vecPointX;
            // Subtract center Y from the pixel Y coordinates
            vecY -= vecPointY;

            // compute the square of X, that is X * X = X to power of 2
            vecX = vecX * vecX;
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = vecY * vecY;

            // Add up the X square and Y square
            F32vec4 vecHypSq = vecX + vecY;

            // Find the hypotenuse by computing square root
            F32vec4 vecHyp = sqrt(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            F32vec4 mask = cmple(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];
                pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

基准测试结果如下。

Intel i7 CPU (四核)

Debug Mode

Scalar : 12ms
SSE Intrinsic : 5ms
SSE vector class : 38ms.

Release Mode

Scalar : 2ms
SSE Intrinsic : 1ms
SSE vector class : 1ms.

Core 2 Duo CPU (双核)

Debug Mode

Scalar : 18ms
SSE Intrinsic : 5ms
SSE vector class : 57ms.

Release Mode

Scalar : 4ms
SSE Intrinsic : 1ms
SSE vector class : 1ms.

正如读者所见，SSE 向量在调试模式下的基准测试结果非常糟糕，对于我们每隔 30 毫秒动画显示展开圆的演示应用程序来说是不可接受的。原因可能是因为在调试模式下，编译器不会对内联构造函数和方法进行内联。我通过始终在发布模式下运行程序来解决这个问题。毕竟，这是一个不需要调试的简单图形程序，通过观察我能弄清楚哪里出了问题。

现在我开始解释前面的代码片段。

float fPointX = 100.0f; // x coordinates of the circle center
float fPointY = 100.0f; // y coordinates of the circle center
float fRadius = 40.0f; // radius of the circle center

// vector of 4 x coordinates of the circle center
F32vec4 vecPointX(fPointX);
// vector of 4 y coordinates of the circle center
F32vec4 vecPointY(fPointY);
// vector of 4 radius of the circle center
F32vec4 vecRadius(fRadius);

这段代码只是初始化了圆心 X 坐标和 Y 坐标的两个向量。第三个向量是圆的半径。

float fy=0.0f;
float fx=0.0f;

UINT color = 0xff00ff00;
for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
{
    for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
    {
        // calculate the index of destination pixel array
        int index = row * stride + col;

        // vector of X coordinates of the 4 pixels, it is inverse of of little endian
        F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
        // vector of Y coordinates of the 4 pixels
        F32vec4 vecY((float)(fy));

这是一个嵌套的 for 循环。外层循环迭代行，内层循环每 4 列迭代一次。vecX 和 vecY 是像素的列和行。您可能已经注意到 vecX 构造函数的浮点参数是反向的。这是因为 Intel 处理器是小端序。如果它们没有反向，看起来会像这样。

Circle with columns of 4 pixels inverted

// Subtract center X from the pixel X coordinates
vecX -= vecPointX;
// Subtract center Y from the pixel Y coordinates
vecY -= vecPointY;

// compute the square of X, that is X * X = X to power of 2
vecX = vecX * vecX;
// compute the square of Y, that is Y * Y = Y to power of 2
vecY = vecY * vecY;

// Add up the X square and Y square
F32vec4 vecHypSq = vecX + vecY;

// Find the hypotenuse by computing square root
F32vec4 vecHyp = sqrt(vecHypSq);

现在我们将圆心从像素的 XY 坐标中减去，以获得像素相对于圆心的位置。我们可能会得到负的 X 或 Y 结果，但不必将它们转换为绝对值，因为下一个计算平方值（例如 X²、Y²）的乘法运算会将负值转换为正值，因为两个负数的乘积是正数。其余代码计算斜边。

// Generate the mask for condition of hypotenuse < radius
F32vec4 mask = cmple(vecHyp, vecRadius);

第一条语句生成一个掩码，如果三角形的斜边小于圆的半径。当条件为真时，掩码全为 1 (0xffffffff)，否则全为 0。

// all 4 pixel in mask vector, falls within the width
if(col+3<bitmapDataCanvas.Width)
{
    UINT origPixel = pixelsCanvas[index+0];
    pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
    pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

    origPixel = pixelsCanvas[index+1];
    pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

    origPixel = pixelsCanvas[index+2];
    pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

    origPixel = pixelsCanvas[index+3];
    pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
}
else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
{
    UINT origPixel = pixelsCanvas[index+0];
    pixelsCanvas[index+0] = color & (__m128(mask)).m128_u32[0];
    pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

    if(col+1<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+1];

        pixelsCanvas[index+1] = color & (__m128(mask)).m128_u32[1];
        pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
    }

    if(col+2<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+2];

        pixelsCanvas[index+2] = color & (__m128(mask)).m128_u32[2];
        pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
    }

    if(col+3<bitmapDataCanvas.Width)
    {
        origPixel = pixelsCanvas[index+3];

        pixelsCanvas[index+3] = color & (__m128(mask)).m128_u32[3];
        pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
    }
}

第一个 if 条件测试 column + 3 是否小于图像宽度。如果是，它将根据掩码是否为 0xffffffff 来为 pixelCanvas[i] 赋值。如果掩码全为零，则将原始颜色应用于 pixelCanvas[i]。else 块在 column + 3 大于图像宽度时执行，我们必须测试每个像素是否在边界内，然后将颜色分配给 pixelCanvas[i]。

出于好奇，我修改了之前的代码片段来绘制一个带有渐变的圆。您注意到渐变色“刺入”了中心。这没关系，因为我们最终的演示不涉及渐变。

Draw a circle with gradient

// Draw Circle with gradient
void CScratchPadDlg::OnPaint()
{
    // ... non-relevant GDI+ code is not shown here

    UINT col = 0;
    int stride = bitmapDataCanvas.Stride >> 2;

    float fPointX = 50.0f; // x coordinates of the circle center
    float fPointY = 50.0f; // Y coordinates of the circle center
    float fRadius = 40.0f; // radius of the circle center

    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointX(fPointX, fPointX, fPointX, fPointX);
    // vector of 4 x coordinates of the circle center
    F32vec4 vecPointY(fPointY, fPointY, fPointY, fPointY);
    // vector of 4 radius of the circle center
    F32vec4 vecRadius(fRadius, fRadius, fRadius, fRadius);

    float fy=0.0f;
    float fx=0.0f;

    UINT color = 0xff0000ff;
    UINT color2 = 0xff00ffff;

    UINT* arrGrad = GenGrad(color, color2, 40);

    for(UINT row = 0, fy=0.0f; row < bitmapDataCanvas.Height; row+=1, fy+=1.0f)
    {
        for(col = 0, fx=0.0f; col < bitmapDataCanvas.Width; col+=4, fx+=4.0f)
        {
            // calculate the index of destination pixel array
            int index = row * stride + col;

            // vector of X coordinates of the 4 pixels, it is inverse of of little endian
            F32vec4 vecX(fx+3.0f, fx+2.0f, fx+1.0f, fx+0.0f);
            // vector of Y coordinates of the 4 pixels
            F32vec4 vecY(fy, fy, fy, fy);

            // Subtract center X from the pixel X coordinates
            vecX -= vecPointX;
            // Subtract center Y from the pixel Y coordinates
            vecY -= vecPointY;

            // compute the square of X, that is X * X = X to power of 2
            vecX = vecX * vecX;
            // compute the square of Y, that is Y * Y = Y to power of 2
            vecY = vecY * vecY;

            // Add up the X square and Y square
            F32vec4 vecHypSq = vecX + vecY;

            // Find the hypotenuse by computing square root
            F32vec4 vecHyp = sqrt(vecHypSq);

            // Generate the mask for condition of hypotenuse < radius
            F32vec4 mask = cmple(vecHyp, vecRadius);

            // all 4 pixel in mask vector, falls within the width
            if(col+3<bitmapDataCanvas.Width)
            {
                UINT origPixel = pixelsCanvas[index+0];

                pixelsCanvas[index+0] = 
                  arrGrad[ (int)((__m128(vecHyp)).m128_f32[0]) ] & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                origPixel = pixelsCanvas[index+1];
                pixelsCanvas[index+1] = 
                  arrGrad[ (int)((__m128(vecHyp)).m128_f32[1]) ] & (__m128(mask)).m128_u32[1];
                pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);

                origPixel = pixelsCanvas[index+2];
                pixelsCanvas[index+2] = 
                  arrGrad[ (int)((__m128(vecHyp)).m128_f32[2]) ] & (__m128(mask)).m128_u32[2];
                pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);

                origPixel = pixelsCanvas[index+3];
                pixelsCanvas[index+3] = 
                  arrGrad[ (int)((__m128(vecHyp)).m128_f32[3]) ] & (__m128(mask)).m128_u32[3];
                pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
            }
            else // all 4 pixel in mask vector do not falls within the width: have to test 1 by 1.
            {
                UINT origPixel = pixelsCanvas[index+0];

                pixelsCanvas[index+0] = 
                  arrGrad[ (int)((__m128(vecHyp)).m128_f32[0]) ] & (__m128(mask)).m128_u32[0];
                pixelsCanvas[index+0] |= origPixel & ~((__m128(mask)).m128_u32[0]);

                if(col+1<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+1];

                    pixelsCanvas[index+1] = 
                       arrGrad[ (int)((__m128(vecHyp)).m128_f32[1]) ] & (__m128(mask)).m128_u32[1];
                    pixelsCanvas[index+1] |= origPixel & ~((__m128(mask)).m128_u32[1]);
                }

                if(col+2<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+2];

                    pixelsCanvas[index+2] = 
                       arrGrad[ (int)((__m128(vecHyp)).m128_f32[2]) ] & (__m128(mask)).m128_u32[2];
                    pixelsCanvas[index+2] |= origPixel & ~((__m128(mask)).m128_u32[2]);
                }

                if(col+3<bitmapDataCanvas.Width)
                {
                    origPixel = pixelsCanvas[index+3];

                    pixelsCanvas[index+3] = 
                       arrGrad[ (int)((__m128(vecHyp)).m128_f32[3]) ] & (__m128(mask)).m128_u32[3];
                    pixelsCanvas[index+3] |= origPixel & ~((__m128(mask)).m128_u32[3]);
                }
            }
        }
    }

    delete [] arrGrad;

    graphics.DrawImage(m_pSrcBitmap,0,0,m_pSrcBitmap->GetWidth(),m_pSrcBitmap->GetHeight());
}

UINT* CDrawCircleDlg::GenGrad(UINT a, UINT b, UINT len)
{
    int r1=(a&0xff0000)>>16,g1=(a&0xff00)>>8,b1=(a&0xff); //Any start color
    int r2=(b&0xff0000)>>16,g2=(b&0xff00)>>8,b2=(b&0xff); //Any start color

    UINT* arr = new UINT[len+1];

    for(UINT i=0;i<len+1;i++)
    { 
        int r,g,b;
        r = r1 + (i * (r2-r1) / len);
        g = g1 + (i * (g2-g1) / len);
        b = b1 + (i * (b2-b1) / len);
        
        arr[i] = 0xff000000 | r << 16 | g << 8 | b;
    }

    return arr;
}

圆形覆盖效果

Circle Cover Effect

在本节中，我们将使用之前开发的绘制圆形的代码来实现一个圆形覆盖效果，其中一个圆不断扩大并露出下面的图像。那么代码是如何知道何时停止扩大（即整个图像已被覆盖/替换）呢？看下图。当您鼠标单击客户端区域时，将计算到 4 个角落的距离，并选择最远的距离（红线）作为最大展开半径。

Distance from 4 corners

提示：您可以将 res 文件夹中的 flickr1.jpg 和 flickr2.jpg 替换为您自己的图片（只需将它们重命名为 flickr1.jpg 和 flickr2.jpg，但它们的尺寸必须相同），CircleCover 应用程序将在启动时自动将客户端尺寸调整为与第一个嵌入图像相同。

六边形覆盖效果

Hexagon Cover Effect

六边形覆盖效果与圆形覆盖应用程序类似。唯一的区别是展开的边缘被小的六边形取代。与 CircleCover 应用程序一样，您可以在编译前替换图片。

Hexagon center within circle radius?

它的工作原理是这样的。计算每个六边形中心与鼠标点的距离。如果距离大于六边形，则显示该六边形。如果距离较小，则使用预先计算好的掩码（存储每个像素到六边形中心的距离）来确定是否应显示下面的像素。我没有将此应用程序转换为使用 SSE，仅仅是因为速度提升不会很大，因为每个六边形中心与鼠标点之间的距离计算次数不多。并且每个像素到六边形中心的距离是作为掩码预先计算的。我将其留给读者作为练习，将此代码转换为使用 SSE。源代码注释非常详细。

结论

我们已经看到了如何使用 Visual Studio 提供的 SIMD 向量类来提高代码清晰度和速度。然而，我们也看到了在调试模式下速度惩罚相当大（40 倍）。SIMD 向量类只提供了简单的运算，如加法、减法、乘法、除法和平方根。没有三角函数（如正弦和余弦）。这就是为什么我将效果限制为使用勾股定理。源代码托管在 CodePlex。

历史

2012 年 3 月 5 日：添加了如何使用标量除法编写整数除法运算。
2012 年 2 月 28 日：更新了文章，包含向量类及其头文件表格。
2012 年 2 月 19 日：首次发布

参考

Agner Fog 的《Optimizing software in C++》