2D 向量类包装器 SSE 优化数学运算

Chesnokov Yuriy

4.93/5 (25投票s)

2007年10月17日

GPL3

6分钟阅读

83137

1193

本文介绍了一个 2D 向量包装器，该包装器经过 SSE 指令优化，可用于进行浮点精度的数学运算。

下载源代码 - 7.02 KB

引言

大家可能都知道，数组很糟糕。因此，我编写了一个漂亮的 1D/2D 向量包装器，支持矩阵运算，并针对 SSE 进行了优化以实现快速处理。它在涉及矩阵数学的应用程序中很有用，例如矩阵的卷积、加法、减法、乘法等，尤其是在处理浮点精度矩阵时。例如，我在神经网络中使用了它，以支持计算机视觉程序的向量机分类器。该类采用规范的编码风格编写，支持只读访问，并实现了 const 正确性。

背景

您应该熟悉矩阵及其背后的数学原理。可以快速了解一下 SOS math。同时，也可以查看 CodeProject 上提供的各种 SSE 编程文章。

使用代码

首先，请确保您已在 stdafx.h 中包含以下头文件：

#include <mm3dnow.h>
#include <float.h>
#include <time.h>
#include <math.h>
#include <windows.h>
#include <stdio.h>

向量包装器由 vec2D 类呈现。您可以拥有一个 N x M（N 行 M 列）的 2D 向量，或者一个简单的 1D 向量——无论是行向量 1xM 还是列向量 Nx1。

vec2D(const wchar_t* file);
vec2D(unsigned int ysize = 1, unsigned int xsize = 1, int yoffset = 0, int xoffset = 0, const float* data = 0);

使用第一个构造函数，您可以从磁盘上的文本格式文件中读取数组。

matrix.txt

2 3
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0

matrix.txt 文件提供了一个 2x3 的矩阵，所有元素都为零。

或者，您可以使用第二个构造函数来初始化向量。如果不提供任何参数，您将得到一个 1x1 的矩阵，即一个标量。ysize 表示行数，xsize 表示列数。使用 yoffset 和 xoffset，您可以选择负或正索引来访问矩阵中的第一个元素。因此，您可以构建卷积滤波器，例如用于平滑的高斯滤波器，或用于边缘检测的 Sobel、Prewitt 滤波器，其中您有一个 3x3 的矩阵，并且零索引元素正好位于矩阵的中心。您还可以提供一个外部缓冲区 data 来初始化矩阵；否则，它将用零填充。

vec2D v1(1, 10);    //is the 1x10 row vector
vec2D v2(10, 1);    //is the 10x1 column vector

//this one initializes gaus_filter matrix with read-only access
const float gblur[] = { 0.0625f,  0.125f,  0.0625f,
                        0.125f,   0.25f,   0.125f,
                        0.0625f,  0.125f,  0.0625f };
const vec2D gaus_filter(3, 3, -1, -1, gblur);
float val = gaus_filter(0, 0);    //val = 0.25f

使用以下函数，您可以访问矩阵行和列的第一个和最后一个可用索引：

inline int xfirst() const; //返回数组的第一个列索引
inline int yfirst() const; //返回数组的第一个行索引
inline int xlast() const; //返回数组的最后一个列索引
inline int ylast() const;//返回数组的最后一个行索引

您可以通过这种方式访问矩阵的所有元素：

//Let's print out our gaus_filter;
for (int y = gaus_filter.yfirst(); y <= gaus_filter.ylast(); y++) {
        for (int x = gaus_filter.xfirst(); x <= gaus_filter.xlast(); x++) 
                wprintf(L" %g", gaus_filter(y, x));        
        wprintf(L"\n");
}

或者您也可以通过以下方式实现相同的功能：

void print(const wchar_t* file = 0) const;

如果文件不是 NULL，则将矩阵保存到磁盘文件；否则，将其打印到标准输出。

要获取矩阵的大小，请使用以下函数：

inline unsigned int width() const; //数组宽度，列数
inline unsigned int height() const; //数组高度，行数
inline unsigned int length() const; //数组总大小，宽度*高度

为了访问矩阵的元素，存在四个重载的运算符，支持读写和只读访问：

inline float& operator()(int y, int x);
inline float operator()(int y, int x) const;
inline float& operator[](unsigned int i);
inline float operator[](unsigned int i) const;

使用 [] 运算符，您可以像 Matlab 一样访问所有元素。但是，矩阵的第一个行和列索引应从 0 开始。

//In Matlab we define the array like:
// v = zeros(3, 5);
//and we can print its contents with [] operator
// v[:]

//The same applies to [] overloaded operator
vec2D v(3, 5);   //we have xfirst() and yfirst() equals to 0 as ordinary C array
for (unsigned int i = 0; i < v.length(); i++)
        wprintf(L"%g\n", v[i]);

您还可以通过以下方式访问矩阵的元素：

inline float get(int y, int x) const;

这允许索引超出矩阵的维度。在这种情况下，您将获得周期性边界扩展。因此，如果您的第一个行和列索引都等于 0，并且尝试访问 vec(-1, -1)，您将获得 vec(1, 1) 元素。

通过重载的赋值运算符，您可以复制矩阵，或使用外部数据缓冲区进行初始化：

inline const vec2D& operator=(const vec2D& v);
inline const vec2D& operator=(const float* pf);

const float gblur[] = { 0.0625f,  0.125f,  0.0625f,
                        0.125f,   0.25f,   0.125f,
                        0.0625f,  0.125f,  0.0625f };
vec2D v1(3, 3);
v1 = gblur;
vec2D v2(3, 3);
vec2D v3(10, 5);
v2 = v1;
v3 = v2;    //v3 will be resized to 3 by 3 matrix

使用 set 元素函数，可以将矩阵内容初始化为特定值：

void set(float scalar); //将所有矩阵元素设置为标量
void set(float scalar, RECT& r); //将 RECT 中的矩阵元素设置为标量
void setrand(); //将矩阵元素设置为 -1.0 ... 1.0 范围内的随机值

RECT 只是一个 Windows 结构。

SSE 优化运算

以下函数处理 SSE 优化的数学运算：

void add(float scalar);
void sub(float scalar);
void mul(float scalar);
void div(float scalar);

您可以将矩阵的所有元素与标量进行加、减、乘或除。

vec2D v(3, 5);
v.print();
v.add(3.4f);
v.print();

矩阵算术运算的函数是：

void add(const vec2D& a, const vec2D& b); //c = a.+b sse 优化
void sub(const vec2D& a, const vec2D& b); //c = a.-b sse 优化
void mul(const vec2D& a, const vec2D& b); //c = a*b
void div(const vec2D& a, const vec2D& b); //c = a./b sse 优化

其中 c 是调用该函数的矩阵，其元素将是运算结果。add、sub 和 div 是逐元素运算，因此 a 和 b 矩阵以及 c 应该具有相同的大小。mul（乘法）运算不是 SSE 优化的，因为在乘法过程中，a 中的每一行都与 b 中的每一列相乘。

vec2D a(10, 10);
vec2D b = a;
vec2D c = a;
a.setrand();
b.setrand();
c.add(a, b);      //c = a + b
a.print();
b.print();
c.print();

vec2D a(3, 5);
vec2D b(5, 2);
vec2D c(3, 2);
a.setrand();
b.setrand();
c.mul(a, b);     //c = a * b

SSE 优化的矩阵乘法在此处实现：

void mult(const vec2D& a, const vec2D& b); //c = a*b' sse 优化
void mule(const vec2D& a, const vec2D& b); //c = a.*b sse 优化

mult 和 mul 的区别在于 b 矩阵被转置了，因此在乘法过程中，我们将 a 的每一行与 b 的每一行相乘，这很容易进行 SSE 优化。

//arrange transposed version of b for mult
vec2D a(3, 5);
vec2D b(2, 5);
vec2D c(3, 2);
a.setrand();
b.setrand();
c.mult(a, b);     //c = a * b';   SSE optimized
a.print();
b.print();
c.print();

而 mule 只是对大小相同的 a 和 b 矩阵进行逐元素乘法。

2D 卷积、归一化和直方图均衡化

另一个在信号处理中有用的函数是与滤波器进行卷积：

void conv2D(const vec2D& a, const vec2D& filter);
void conv2D(const vec2D& a, const vec2D& re, const vec2D& im);

第一个是与滤波器进行卷积，第二个是与由实部和虚部组成的复数滤波器进行卷积。

//gaussian smoothing
vec2D v_original(100, 100);
vec2D v_smoothed = v_original;
v_smoothed.conv2D(v_original, gaus_filter);


//edge detection
const float Re[] = { 0.5f, 0.0f, -0.5f,
                     0.5f, 0.0f, -0.5f,
                     0.5f, 0.0f, -0.5f };
const float Im[] = { 0.5f,  0.5f,  0.5f,
                     0.0f,  0.0f,  0.0f,
                    -0.5f, -0.5f, -0.5f};
vec2D re(3, 3, -1, -1, Re);
vec2D im(3, 3, -1, -1, Im);
vec2D v_edges(100, 100);

//set the rectangle in the matrix to 0.0 
//so there will be edges on the brim of it
RECT r(10, 10, 50, 50);
v_smoothed.set(0.0f, r);

v_edges.conv2D(v_smoothed, re, im);

归一化和直方图均衡化在图像处理中很有用，例如在人脸检测过程中：

void normalize(float a, float b); //归一化到 a...b 范围
void histeq(vec2D& hist); //直方图归一化

对于归一化，矩阵只是被归一化，使其最小值变为 a，最大值变为 b。

对于直方图均衡化，提供一个大小为 1x256 的外部行向量 hist。均衡化之前的矩阵应该包含 0...255 范围内的值，就像标准的位图图像一样。均衡化之后，其值将映射到 0...1.0 的范围。

vec2D hist(1, 256);
vec2D v_heq(100, 100);
v_heq.setrand();
v_heq.norm(0.0f, 255.0f);
v_heq.histeq(hist);