基于模板参数的循环展开

Kevin Drzycimski

4.90/5 (33投票s)

2010 年 4 月 24 日

CPOL

6分钟阅读

60211

391

在编译时展开循环，通过模板参数推导。

下载源代码 - 2.99 MB

引言

本文讨论循环展开，这是一种提高速度的技术。对于不同的问题大小，您可以使用整型模板参数。

背景

编写高性能代码时，循环展开是一项很棒的功能。想象一下，当循环次数很少且在编译时已知时，就像这样

for (int i = 0; i < 8; ++i) {
    a[i] = b[i] + c[i];
}

大多数时候，CPU（和 GPU）展开循环会更快，因为否则您每次都必须在 `}` 处跳转到循环的开头。当循环展开后，代码看起来是这样的

a[0] = b[0] + c[0];
a[1] = b[1] + c[1];
a[2] = b[2] + c[2];
a[3] = b[3] + c[3];
a[4] = b[4] + c[4];
a[5] = b[5] + c[5];
a[6] = b[6] + c[6];
a[7] = b[7] + c[7];

由于流水线和其他问题，这比原始循环更快。处理器可以直接按顺序执行每个语句，并向前查找后续语句并预取值。（否则，在每一行之间都会有 `JNZ` 和 `INC` 以及其他一些汇编语句。）

实现方法

当然，像这样编写每个循环将是可怕的（这就是为什么存在循环语句！）。一些小的循环已经被编译器展开。Nvidia CUDA 编译器和 IBM 编译器提供了 `#pragma unroll` 或 `#pragma unroll N`（其中 N 是一个字面量常数）来建议编译器展开下面的循环，如果它可以预测其迭代次数。不幸的是，它们在模板参数方面存在问题

template <int N>
void doit() {
    #pragma unroll N
    for (int i = 0; i < N; ++i) {
        a[i] = b[i] + c[i];
    }
}

nvcc 不喜欢 N 在这个位置，并且在没有任何参数的情况下，它无法推导出迭代次数，因此循环不会被展开。CUDA 编程中的糟糕之处在于，如果数组访问是非确定性的（例如 `a[i]`，其中 `i` 不是固定的），编译器会将数组放在芯片外位置，其延迟比芯片内慢约 700 倍（您只会通过常量索引访问数组，例如 `a[0] = a[1] + a[2]`）。但是我在许多函数中都有一个可变的模板参数。我本可以用预处理器宏重构所有这些，但这将是很尴尬的。于是就有了 Unroller

//static unroller
template <typename Action, int Begin, int End, int Step = 1>
struct UnrollerS {
    static void step() {
        Action::action(Begin);
        UnrollerS<Action, Begin+Step, End, Step>::step();
    }
};
//end of static unroller
template <typename Action, int End, int Step>
struct UnrollerS<Action, End, End, Step> {
    static void step() { }
};

//dynamic unroller
template <typename Action, int Begin, int End, int Step = 1>
struct UnrollerD {
    static void step(Action& a) {
        a.action(Begin);
        UnrollerD<Action, Begin+Step, End>::step(a);
    }
};
//end of dynamic unroller
template <typename Action, int End, int Step>
struct UnrollerD<Action, End, End, Step> {
    static void step(Action& a) { }
};

`UnrollerD` 和 `UnrollerS` 调用它们的 `Action`（静态函数或成员函数），然后以 `Begin` 递增的方式自身循环，直到它们达到部分特化，此时 `Begin == End`，它们什么也不做。

使用代码

因此，如果您有一个不需要任何额外信息的动作要执行，您可以将其声明为一个带有静态函数 `action(i)` 的结构体，并将其传递给 `UnrollerS`

struct Printer {
    static void action(int i) {
        printf("%d\n", i);
    }
};

//later...

UnrollerS<Printer, 10, 20>::step();

如果您必须将一些值传递给 Action，请使用 `UnrollerD`

struct Aassign {
    int* rhs;
    int *m;
    void action(int i) {
        m[i] = rhs[i];
    }
};

nbase<M>& operator=(int* rhs) {
    Aassign assign;
    assign.m = m_array;
    assign.rhs = rhs;
    //m_array and rhs are both of (templated) size M
    UnrollerD<Aassign, 0, M>::step(assign);
    return *this;
}

这取决于您编译器的内联能力，它会在多大程度上内联您的 Action。Visual C++ 编译器即使有 `inline` 声明，也只内联 2 或 4 个步骤。您必须在 `action` 和 `step` 函数上使用 `__forceinline` 来强制完全展开循环。

成员函数指针

如前一版本所述，这是 `UnrollerM` 的代码

//member function unroller
template<typename Action, void (Action::*Function)(int), int Begin, int End, int Step = 1>
struct UnrollerM {
    static void step(Action& action) {
        (action.*Function)(Begin);
        UnrollerM<Action, Function, Begin+Step, End, Step>::step(action);
    }
};
//end of member function unroller
template<typename Action, void (Action::*Function)(int), int End, int Step>
struct UnrollerM<Action, Function, End, End, Step> {
    static void step(Action& action) {
    }
};

如果您有不同的方法并且不想为每个方法创建一个 `Action`（同时也启用了数据共享，但此处未显示），那么它会很有用。

struct MultiFunctor {
    void twice(int i) {
        cout << 2*i << ' ';
    }
        void quad(int i) {
        cout << i*i << ' ';
    }
};

//...
MultiFunctor mfunc;
//call quad(int) from mfunc from 0 to 10
UnrollerM<MultiFunctor, &MultiFunctor::quad, 0, 10>::step(mfunc);
cout << endl;
//call twice(int) from mfunc from 20 to 0 step -2 (note, 0 is not evaluated)
UnrollerM<MultiFunctor, &MultiFunctor::twice, 20, 0, -1>::step(mfunc);

Lambda 函数

下一步将是使用新的 lambda 函数，而不是“难以输入”的成员函数指针。在 Visual Studio 2010 中，它们似乎有一个匿名类型，位于一个未知的命名空间中，因此很难将其硬编码为模板参数。幸运的是，作为成员函数模板，编译器本身可以推导出 lambda 的正确类型。

template<int Begin, int End, int Step = 1>
//lambda unroller
struct UnrollerL {
    template<typename Lambda>
    static void step(Lambda& func) {
        func(Begin);
        UnrollerL<Begin+Step, End, Step>::step(func);
    }
};
//end of lambda unroller
template<int End, int Step>
struct UnrollerL<End, End, Step> {
    template<typename Lambda>
    static void step(Lambda& func) {
    }
};

当您熟悉新的 lambda 语法后，这将是最容易使用的 unroller。

//use it as local variable
auto lambda = [](int i){ cout << i << ' '; };
//means: from 30 to 20 backward
UnrollerL<30, 20, -1>::step(lambda);
cout << endl;

int numbers[] = {1, 22, 333, 4444, 55555};
//capture the array, and print it out (defined lambda on the fly)
UnrollerL<0, 5>::step( [&numbers] (int i) {
    cout << numbers[i] << ' ';
}
);

性能测试

测试在 Intel P8600（双核，2.4 GHz）上使用 Visual Studio 2010 C++ 编译器以 Release 模式和一些优化进行了。`__forceinline` 已添加到 `UnrollerL` 的 `step` 函数中。对于此测试，我创建了一个随机直方图，并想创建一个前缀和（例如，在统计应用程序或基数排序中）。您有一个 `int` 数组 `histogram`，并对其进行转换，使每个条目都包含其前面值的总和。核心循环如下所示

tmp = histogram[i] + sum;
histogram[i] = sum;
sum = tmp;

我测量了三种不同的计算方法。第一种是“正常”的，就像使用标准的 `for` 循环一样

int sum = 0;
int tmp;
for (int i = 0; i < N; ++i) {
    tmp = histogram[i] + sum;
    histogram[i] = sum;
    sum = tmp;
}

第二种，循环完全展开

int sum = 0;
int tmp;
int* hist = histogram;
UnrollerL<0, N>::step( [&] (int i) {
    tmp = hist[i] + sum;
    hist[i] = sum;
    sum = tmp;
}
);

（注意：`histogram` 超出了作用域，所以我别名了一个指针并将其捕获在 lambda 中。）最后，有点像混合展开循环，或部分展开。整个循环没有被展开，但 `H` 次迭代被硬编码，因此循环只需要循环 `N/H` 次。

template<int H>
//...
int sum = 0;
int tmp;
int *hist = histogram;
for (int i = 0; i < N; i += H) {
    UnrollerL<0, H>::step( [&] (int j) {
        tmp = hist[i+j] + sum;
        hist[i+j] = sum;
        sum = tmp;
    }
    );
}
//Note: lokes like a nested loop

在第一张图中，依次测量了正常、完全展开和混合展开。数组大小分别为 16、32、64、128、256；混合展开因子为 8

对于较小的迭代次数，完全展开效果很好。对于 128 个元素，混合展开已经稍好一些。对于 256 个元素，完全展开就爆炸了。这是因为代码膨胀得太厉害，以至于无法适应程序缓存，并且太大而无法快速排序。

由于 256 次完全展开的迭代花费了太长时间，因此在下面的图中省略了这种类型。左侧是正常循环，右侧是部分展开循环。前三个是 1024 个元素，混合展开因子为 8、16、32；接下来的三个是 2048 个元素，使用相同的混合因子；然后是 4096、8192 和 16384。然后，列大小减半，并测量了 32768 个元素，混合展开因子为 8、16、32 和 64；最后，测量了 65536 个元素，内部混合展开因子为 8、16、32、64 和 128。

混合展开总是比正常循环好。在最后一种情况下，原始数字从 3086 到 2463。

部分展开器

在使用 unroller 一段时间后，我看到了创建部分 unroller 的必要性。如前所述，具有 256 次迭代的展开循环是极端的。混合解决方案更稳定，因为它展开了循环的内部部分。它也太严格了，您必须在编译时知道总迭代次数。因此解决方案是 `UnrollerP`

//UnrollerP: loops over given size, partial unrolled
template<int InnerUnroll = 8, int Begin = 0>
public:
struct UnrollerP {
    template<typename Lambda>
    static void step(size_t N, Lambda& func) {
        size_t i = Begin;
        for (; i < N - InnerUnroll; i += InnerUnroll) {
            UnrollerInternal<>::step(func, i);
        }
        for (; i < N; ++i) {
            func(i);
        }
        
    }
private:
    //start of UnrollerInternal
    template<size_t Offset = 0>
    struct UnrollerInternal {
        template<typename Lambda>
        static void step(Lambda& func, size_t i) {
            func(i + Offset);
            UnrollerInternal<Offset + 1>::step(func, i);
        }
    };

    //end of UnrollerInternal
    template<>
    struct UnrollerInternal<InnerUnroll> {
        template<typename Lambda>
        static void step(Lambda& func, size_t i) {
        }
    };

};

`UnrollerP` 作为启动器，实际展开由 `UnrollerInternal` 完成（它像普通的一样工作）。`UnrollerP` 本身有一个带有 `InnerUnroll` 作为增量的循环，并且在此循环体内，lambda 以该因子展开。对于奇数大小，最后的步骤正常迭代。请注意，总迭代次数 `N` 是动态演变的，因此这更灵活。像这样使用它

int numbers;
//get 'numbers' at runtime

int *arr = new int[numbers];
//assign some values

int sum = 0, tmp;

//unroll the loop 8 times, offset is 0 so 
//the range is from 0 to numbers
UnrollerP<8>::step(numbers, [&] (size_t i) {
    arr[i] = i;
    tmp = arr[i] + sum;
    arr[i] = sum;
    sum = tmp;
}
);

统一

感谢 UsYer，他发现 `UnrollerS`、`UnrollerD`、`UnrollerM` 和 `UnrollerL` 可以合并成一个单一的、符合 STL 标准的模板

template<int Begin, int End, int Step = 1>
struct Unroller {
    template<typename Action>
    static void step(Action& action) {
        action(Begin);
        Unroller<Begin+Step, End, Step>::step(func);
    }
};

template<int End, int Step>
struct Unroller<End, End, Step> {
    template<typename Action>
    static void step(Action& action) {
    }
};

唯一的要求是 `Action` 必须是可调用的，所以任何带有 `operator()(int)` 的函数或仿函数都可以。之前的示例可以这样工作

int numbers[] = {1, 22, 333, 4444, 55555};

//use a free or static function:
Unroller<10, 20, 2>::step(Printer::act);

//use a functor:    
ArrayPrinter ap;
ap.items = numbers;
Unroller<1, 4>::step(ap);

//Unroll memberfunction with the help of c++1x's std::bind 
MultiFunctor mfunc;
Unroller<20,0,-1>::step(std::bind(&MultiFunctor::twice,&mfunc,std::placeholders::_1));

//Of course lambdas are supported as well, because they are callable too(std::function)
Unroller<0, 5>::step([&numbers](int i){cout << numbers[i] << ' ';});

结论

循环展开是一个花哨的工具，可以提高几个百分点的速度。如果您做得太过头（迭代次数太大），您的代码可能会变慢（程序代码不适合 L1 缓存）。只需亲自测试一下，感受其中的差异。有了这些模板，您就可以释放其强大的功能并保持代码的可维护性。

历史

v2.1：添加了部分展开和统一一些 unroller，感谢 UsYer。
v2.01：可下载示例项目。
v2.0：添加了对 lambda 函数、性能测试和示例项目等的支持。
v1.1：添加了成员函数指针和可变步长。
v1.0：初始版本。