JIT 优化性能 Visual Studio 2013 高级 C# 5.0 开发 .NET C#

深入探讨 .NET 循环性能、边界检查、迭代和展开

Ben M Watson

4.90/5 (60投票s)

2014年11月20日

CPOL

12分钟阅读

62062

深入了解 .NET 数组访问优化，以及如何通过模式化代码来提高效率。

引言

在本文中，我将详细介绍 .NET 如何处理涉及数组和集合访问的循环，以及您可以期望哪些类型的优化。我之前的一篇文章中提到过其中一些优化，但在本文中，我们将深入探讨并展示更多示例。我们将深入研究生成的 IL 以及实际执行的 x64 汇编代码。

这些优化对您的程序有影响吗？只有当您的程序是 CPU 密集型的，并且集合迭代是您处理的核心部分时，才会有影响。正如您将看到的，如果不小心，有一些方法会损害您的性能，但这只有在它占您程序很大一部分的情况下才有意义。我的理念是：大多数人永远不需要了解这些，但如果您需要，那么理解系统的每一层都很重要，这样您才能做出明智的选择。这就是我写我的书《Writing High-Performance .NET Code》的原因，也是我写这些文章的原因——为您提供有关 .NET 程序编译和执行时实际发生的情况的信息。

其中一些信息您可以在其他地方找到，但我希望向您展示如何自己获取这些信息，这样您就不必依赖别人公开这些信息，尤其是在新版本的 JIT 发布时。

我使用的工具：ILSpy 用于查看 IL，Visual Studio 2013 用于调试和查看带有 .NET 4.5 的程序集。我以 Release 模式编译代码，明确针对 x64。调试时，不要从 Visual Studio 开始调试——那样无法准确反映您的代码。相反，单独启动程序并附加调试器。我通过在 Main() 开始处调用 Debugger.Launch() 来轻松实现这一点，然后通过 Ctrl-F5 执行程序。

关于程序集的说明。我选择 x64 是因为如今它应该是开发的首选。对于下面提到的寄存器名称，请理解 eax 和 rax 等指的是同一个寄存器——rax 是 eax 的 64 位版本，本质上是 eax 的符号扩展版本，但它们都指向同一个物理位置。ecx/rdx、ecx/rcx 等也是如此。

边界检查

在 .NET 中存在边界检查是因为 CLR 必须始终确保托管代码的内存安全。它必须确保托管代码不可能损坏已定义数据结构外部的内存。这意味着以下代码不允许执行

int[] array = new int[100];
array[100] = 1;

尝试执行此操作应该会导致异常，而不是允许发生内存损坏。这意味着 CLR 必须在数组访问中注入边界检查。如果未执行某些优化，这可能会使重复的数组访问速度加倍。这种知识使一些人不愿意使用 .NET 来编写高性能代码，但我们将看到 JIT 编译器确实会在某些情况下进行这些优化（而在其他情况下则会不足）。但是，请记住我在引言中说过的话：这些类型的优化只有在您的程序是 CPU 密集型的，并且这些代码位于最内层、最常调用的代码中时才有意义。即便如此，即使您不必担心这种级别的性能，了解如何调查这些类型的细节也很有启发性。

基准循环

首先，让我们看看最基本的循环

[MethodImpl(MethodImplOptions.NoInlining)]
private static long GetSum(int[] array)
{ 
    long sum = 0; 
    for (int i =0; i < array.Length; i++)
    { 
        sum += array[i]; 
    } 
    return sum; 
}

它只是遍历整个数组并对所有元素求和。我明确禁用了内联，因为这可能会使分析有些混乱，而我们只想专注于循环优化，而不是 JIT 编译器此处可以且将会进行的其他改进。

在 IL 中，此方法如下所示（带有我自己的注释）

.method private hidebysig static 
	int64 GetSum (
		int32[] 'array'
	) cil managed noinlining
{
	// Method begins at RVA 0x20a4
	// Code size 26 (0x1a)
	.maxstack 3
	.locals init (
		[0] int64 sum,
		[1] int32 i
	)

	IL_0000: ldc.i4.0    // Push 0 onto evaluation stack as a 4-byte integer
	IL_0001: stloc.0     // Pop from evaluation stack and store into local variable at location 0, sum.
	IL_0002: ldc.i4.0    // Push 0 to evaluation stack
	IL_0003: stloc.1     // Pop and store in local variable i.
	IL_0004: br.s IL_0012     //  Jump to IL_0012 (unconditionally)
	// loop start (head: IL_0010)
		IL_0006: ldloc.0    // Push sum onto stack
		IL_0007: ldarg.0    // Push array onto stack
		IL_0008: ldloc.1    // Push i onto stack
		IL_0009: ldelem.i4  // Pop array and i off the stack and push array[i] onto the stack
		IL_000a: add        // Pop array[i] and sum off the stack, add, and push result to stack
		IL_000b: stloc.0    // Pop result off stack and store in sum
		IL_000c: ldloc.1    // Push i onto stack
		IL_000d: ldc.i4.1   // Push 1 onto stack
		IL_000e: add        // Pop i and 1 off of stack, add, and push result to stack
		IL_000f: stloc.1    // Pop result off stack and store in i

		IL_0010: ldloc.1    // Push i onto stack
		IL_0011: ldarg.0    // Push array onto stack
		IL_0012: ldlen      // Pop array off of stack, get length, and push length onto stack
		IL_0013: conv.i4    // Convert length to 4-byte integer
		IL_0014: blt.s IL_0006    // Pop i and length off of stack. If i < length, jump to IL_0006
	// end loop

	IL_0016: ldloc.0    // Push sum onto stack
	IL_0017: ret        // Return
} // end of method Program::GetSum

执行循环似乎做了很多工作。IL 是一种简单的基于堆栈的语言，非常冗长，但 JIT 编译器会将其转换为实际的机器代码，该代码利用了高效的 CPU 指令和寄存器。生成的程序集是这样的

00007FFADFCC02D0  mov         rdx,rcx                 // Copy array address to rdx
00007FFADFCC02D3  xor         ecx,ecx                 // Set ecx (sum) to 0
00007FFADFCC02D5  mov         r10,qword ptr [rdx+8]   // Copy array length to r10
00007FFADFCC02D9  test        r10d,r10d               // See if length is 0
00007FFADFCC02DC  jle         00007FFADFCC0303        // If so, skip the loop
00007FFADFCC02DE  mov         r8d,ecx                 // Copy 0 to r8 (source of ecx is not relevant)
00007FFADFCC02E1  xor         r9d,r9d                 // Set r9 to 0 (memory offset into array)
00007FFADFCC02E4  nop         word ptr [rax+rax]      // NOP
00007FFADFCC02F0  mov         eax,dword ptr [rdx+r9+10h]  // Copy array[i] to eax
00007FFADFCC02F5  add         ecx,eax                 // Add value to sum
00007FFADFCC02F7  inc         r8d                     // i++
00007FFADFCC02FA  add         r9,4                    // Increment memory address by 4 bytes
00007FFADFCC02FE  cmp         r8d,r10d                // i < length ?
00007FFADFCC0301  jl          00007FFADFCC02F0        // If so, loop back a few statements for another iteration
00007FFADFCC0303  mov         eax,ecx                 // Copy sum to return register
00007FFADFCC0305  ret                                 // Return

这里有几点需要注意

代码同时跟踪索引值 i 和数组值的直接内存地址，这比每次循环迭代执行乘法和偏移要高效得多。
没有数组边界检查！这有时是 .NET 新手担心的问题——我们如何在没有边界检查的情况下维护内存安全？在这种情况下，JIT 编译器认为不需要边界检查，因为它知道此循环的所有约束。

重新引入边界检查

现在让我们引入一些可变性。如果我们改用一些边界来迭代，这会改变情况吗？

[MethodImpl(MethodImplOptions.NoInlining)]
private static int GetSum(int[] array, int start, int end)
{
    int sum = 0;
    for (int i = start; i < end; i++)
    {
        sum += array[i];
    }
    return sum;
}
.
.
.
// Called with:
GetSum(array, 25, 75);

IL 的差异并不太有趣，所以我不会展示它，但机器指令肯定会改变

00007FFADFCC0320  sub         rsp,28h               // Adjust stack pointer
00007FFADFCC0324  mov         r9,rcx                // Copy array address to r9
00007FFADFCC0327  xor         ecx,ecx               // Set ecx (sum) to 0
00007FFADFCC0329  cmp         edx,r8d               // edx has i in it, initialized with start (25). Compare that to r8, which has the end value (75)
00007FFADFCC032C  jge         00007FFADFCC0348      // If start >= end, skip loop.
00007FFADFCC032E  mov         r10,qword ptr [r9+8]  // Copy array length to r10
00007FFADFCC0332  movsxd      rax,edx               // Copy start to rax
00007FFADFCC0335  cmp         rax,r10               // Compare i to array length
00007FFADFCC0338  jae         00007FFADFCC0350      // If i >= length, jump to this address (below)
00007FFADFCC033A  mov         eax,dword ptr [r9+rax*4+10h]  // Copy array[i] to eax
00007FFADFCC033F  add         ecx,eax               // sum += array[i]
00007FFADFCC0341  inc         edx                   // i++
00007FFADFCC0343  cmp         edx,r8d               // Compare i to end value
00007FFADFCC0346  jl          00007FFADFCC0332      // If i < jump to beginning of loop (above)
00007FFADFCC0348  mov         eax,ecx               // Copy sum to return register
00007FFADFCC034A  add         rsp,28h               // Fix up stack pointer
00007FFADFCC034E  ret                               // Return

00007FFADFCC0350  call        00007FFB3F7E6590      // Throw an exception!

最显著的区别是，现在在每次循环迭代中，它都会将数组索引与数组长度进行比较，如果越界，我们会得到一个异常。通过此更改，我们将循环的运行时间加倍了。边界检查就在那里。

那么为什么现在要这样做呢？是因为我们正在处理子数组吗？这很容易测试。您可以修改本文开头的原始函数，使其从 25 迭代到 75，您会发现编译器没有插入边界检查。

不，在这种情况下进行边界检查的原因是因为编译器无法证明输入值是 100% 安全的。

好的，那么。在循环之前检查开始和结束变量怎么样？我怀疑这很快会导致理解代码的麻烦。一旦约束不是常量，您就必须处理变量别名、副作用以及跟踪每个值的用法——这对 JIT 编译器来说并不实用。当然，在我这个简单的示例方法中，这可能很容易，但对于通用情况来说，如果不是不可能的话，也可能非常困难。

其他获取边界检查的方法

即使是对索引变量进行简单的修改，您都可以从逻辑上证明它不会超出数组边界，也可能导致出现边界检查

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Manipulate(int[] array)
{
    for (int i=0; i < array.Length; i++)
    {
        array[i] = array[i / 2];
    }
}

i / 2 不可能超出 array.Length，但边界检查仍然存在

00007FFADFCE03B0  sub         rsp,28h              // Adjust stack pointer  
00007FFADFCE03B4  mov         r8,qword ptr [rcx+8] // Copy array length to r8
00007FFADFCE03B8  test        r8d,r8d              // Test if length is 0
00007FFADFCE03BB  jle         00007FFADFCE03E8     // If 0, skip loop
00007FFADFCE03BD  xor         r9d,r9d              // Set i to 0
00007FFADFCE03C0  xor         r10d,r10d            // Set destination index to 0
00007FFADFCE03C3  mov         eax,r9d              // Set eax (source array index) to 0
00007FFADFCE03C6  cdq                              // sign extends eax into edx (needed for division)
00007FFADFCE03C7  sub         eax,edx              // eax = eax - edx 
00007FFADFCE03C9  sar         eax,1                // Shift right (divide by two)
00007FFADFCE03CB  movsxd      rax,eax              // copy 32-bit eax into 64-bit rax and sign extend
00007FFADFCE03CE  cmp         rax,r8               // Compare source index against array length
00007FFADFCE03D1  jae         00007FFADFCE03F0     // If greater than or equal, jump and throw exception
00007FFADFCE03D3  mov         eax,dword ptr [rcx+rax*4+10h]  // Copy value from array[i/2] to eax
00007FFADFCE03D7  mov         dword ptr [rcx+r10+10h],eax    // Copy value from eax to array[i]
            
00007FFADFCE03DC  inc         r9d                  // Increment i
00007FFADFCE03DF  add         r10,4                // Increment destination address by 4 bytes
00007FFADFCE03E3  cmp         r9d,r8d              // Compare i against array length
00007FFADFCE03E6  jl          00007FFADFCE03C3     // If less, loop again
00007FFADFCE03E8  add         rsp,28h              // Adjust stack
00007FFADFCE03EC  ret                              // return

00007FFADFCE03F0  call        00007FFB3F7E6590     // Throw exception

所以，即使我们可以证明它是安全的，编译器也不会。

还有一种方式，有点烦人。还记得第一个基准示例吗，那个没有边界检查的。如果我们有完全相同的代码，但它操作的是一个静态值而不是参数，会怎么样？

00007FFADFCE0500  mov         rax,0A42A235780h  
00007FFADFCE050A  mov         rax,qword ptr [rax]  
00007FFADFCE050D  movsxd      r9,ecx  
00007FFADFCE0510  mov         r8,qword ptr [rax+8]  
00007FFADFCE0514  cmp         r9,r8                 // bounds check!
00007FFADFCE0517  jae         00007FFADFCE0530  
00007FFADFCE0519  mov         eax,dword ptr [rax+r9*4+10h]  
00007FFADFCE051E  add         edx,eax  
00007FFADFCE0520  inc         ecx  
00007FFADFCE0522  cmp         ecx,r8d  
00007FFADFCE0525  jl          00007FFADFCE0500  
00007FFADFCE0527  mov         eax,edx  
00007FFADFCE0529  add         rsp,28h  
00007FFADFCE052D  ret  
                            
00007FFADFCE0530  call        00007FFB3F7E6590

边界检查就在那里。

好的，那么如果我们很聪明，并将静态数组传递给我们的基准方法，而它不知道它是静态的呢？在这种情况下，没有边界检查——它与基准情况完全相同。

JIT 无法优化但可以优化的另一个情况是向后迭代数组

[MethodImpl(MethodImplOptions.NoInlining)]
private static int GetBackwardsSum(int[] array)
{
    int sum = 0;
    for (int i = array.Length - 1; i >= 0; i--)
    {
        sum += array[i];
    }
    return sum;
}

不幸但真实——尽管这看起来像基准示例，但它没有优化。

最后，考虑重复访问的情况

for (int i = 2;i < array.Length - 1; i++)
{
    int x = array[i-2];
    int y = array[i-2];
}

在这种情况下，编译器将发出一个边界检查，该检查足以满足两个访问。

使用 fixed 消除边界检查

消除边界检查的另一种方法是使用 fixed 关键字。我不太推荐它，因为它存在安全性和安全性方面的考虑，您绝对应该仔细考虑。但是，如果您确实需要对具有某种非平凡访问模式但否则会触发边界检查的数组进行最佳访问，这可能适合您。要使用 fixed，您必须为项目启用 unsafe 代码。

以下是使用 fixed 迭代数组并避免边界检查的两种选择

[MethodImpl(MethodImplOptions.NoInlining)]
unsafe private static int GetBackwardsSumFixedSubscript(int[] array)
{
    int sum = 0;
    fixed (int* p = array)
    {
        for (int i = array.Length - 1; i >= 0; i--)
        {
            sum += p[i];
        }
    }
    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
unsafe private static int GetBackwardsSumFixedPointer(int[] array)
{
    int sum = 0;
    fixed (int* p = &array[array.Length - 1])
    {
        int* v = p;
        for (int i = array.Length - 1; i >= 0; i--)
        {
            sum += *v;
            v--;
        }
    }
    return sum;
}

任何一种都可以工作，但请记住，unsafe 确实意味着不安全！您可能会在此方法中踩到内存。我建议将所有 unsafe 代码隔离到其自己的模块、类或其他受限范围内，并在代码审查期间对其进行额外审查。您还可能需要处理部署 unsafe 代码的组织安全问题。如果您可以使用纯托管代码消除边界检查，请强烈优先考虑该方法。

Foreach 与 For

现在，让我们将焦点转移到一个抽象级别。在 .NET 中，您可以使用 foreach 关键字迭代大多数集合。在大多数情况下，foreach 是调用 GetEnumerator()、MoveNext() 等方法的语法糖。在这方面，可以公平地假设 foreach 的总体性能比简单的 for 或 while 循环要差。

然而，情况并非完全如此。在某些情况下，编译器会识别一个简单的数组 for 循环的本质，而不管使用的语法如何。

数组上的 Foreach

如果我们以如下方式将上面的基准循环转换为 foreach

private static int GetSumForeach(int[] array)
{
    int sum = 0;
    foreach(var val in array)
    {
        sum += val;
    }
    return sum;
}

在这种情况下生成的 IL 是什么？

.method private hidebysig static 
	int32 GetSumForeach (
		int32[] 'array'
	) cil managed 
{
	// Method begins at RVA 0x2098
	// Code size 28 (0x1c)
	.maxstack 2
	.locals init (
		[0] int32 sum,
		[1] int32 val,
		[2] int32[] CS$6$0000,
		[3] int32 CS$7$0001
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldarg.0
	IL_0003: stloc.2
	IL_0004: ldc.i4.0
	IL_0005: stloc.3
	IL_0006: br.s IL_0014
	// loop start (head: IL_0014)
		IL_0008: ldloc.2
		IL_0009: ldloc.3
		IL_000a: ldelem.i4
		IL_000b: stloc.1
		IL_000c: ldloc.0
		IL_000d: ldloc.1
		IL_000e: add
		IL_000f: stloc.0
		IL_0010: ldloc.3
		IL_0011: ldc.i4.1
		IL_0012: add
		IL_0013: stloc.3

		IL_0014: ldloc.3
		IL_0015: ldloc.2
		IL_0016: ldlen
		IL_0017: conv.i4
		IL_0018: blt.s IL_0008
	// end loop

	IL_001a: ldloc.0
	IL_001b: ret
} // end of method Program::GetSumForeach

将其与基准循环的 IL 进行比较。唯一的区别是存在一些额外的、生成的局部变量来存储数组和索引变量。否则，它看起来就像一个循环。事实上，程序集代码也看起来完全相同（包括消除了边界检查）

00007FFADFCB01F0  mov         rdx,rcx  
00007FFADFCB01F3  xor         ecx,ecx  
00007FFADFCB01F5  mov         r10,qword ptr [rdx+8]  
00007FFADFCB01F9  test        r10d,r10d  
00007FFADFCB01FC  jle         00007FFADFCB0223  
00007FFADFCB01FE  mov         r8d,ecx  
00007FFADFCB0201  xor         r9d,r9d  
00007FFADFCB0204  nop         word ptr [rax+rax]  
00007FFADFCB0210  mov         eax,dword ptr [rdx+r9+10h]  
00007FFADFCB0215  add         ecx,eax  
00007FFADFCB0217  inc         r8d  
00007FFADFCB021A  add         r9,4  
00007FFADFCB021E  cmp         r8d,r10d  
00007FFADFCB0221  jl          00007FFADFCB0210  
00007FFADFCB0223  mov         eax,ecx  
00007FFADFCB0225  ret

List<T> 上的 For 循环

在看到 foreach 操作 List<T> 之前，我们应该看看 for 是如何工作的

private static int GetSumFor(List<int> array)
{
    int sum = 0;
    for (int i = 0; i < array.Count; i++)
    {
        sum += array[i];
    }
    return sum;
}

以及相应的 IL

.method private hidebysig static 
	int32 GetSumFor (
		class [mscorlib]System.Collections.Generic.List`1 'array'
	) cil managed 
{
	// Method begins at RVA 0x20a0
	// Code size 31 (0x1f)
	.maxstack 3
	.locals init (
		[0] int32 sum,
		[1] int32 i
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldc.i4.0
	IL_0003: stloc.1
	IL_0004: br.s IL_0014
	// loop start (head: IL_0014)
		IL_0006: ldloc.0
		IL_0007: ldarg.0
		IL_0008: ldloc.1
		IL_0009: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1::get_Item(int32)
		IL_000e: add
		IL_000f: stloc.0
		IL_0010: ldloc.1
		IL_0011: ldc.i4.1
		IL_0012: add
		IL_0013: stloc.1

		IL_0014: ldloc.1
		IL_0015: ldarg.0
		IL_0016: callvirt instance int32 class [mscorlib]System.Collections.Generic.List`1::get_Count()
		IL_001b: blt.s IL_0006
	// end loop

	IL_001d: ldloc.0
	IL_001e: ret
} // end of method Program::GetSumFor

它看起来与数组上的 for 循环非常相似，但检索值和长度 (Count) 时使用了一些方法调用而不是直接内存访问。

然而，由此生成的程序集代码更有趣，并显示了一些微妙而重要的区别

00007FFADFCD0BE0  push        rbx  	// Stack frame setup
00007FFADFCD0BE1  push        rsi  
00007FFADFCD0BE2  push        rdi  
00007FFADFCD0BE3  sub         rsp,20h  
00007FFADFCD0BE7  mov         rdx,rcx    // Copy array address to rdx
00007FFADFCD0BEA  xor         ecx,ecx    // Set i to 0. i is used only for the loop condition.
00007FFADFCD0BEC  xor         r11d,r11d  // Set j to 0. j is used as the iterator on the internal array inside List
00007FFADFCD0BEF  mov         r8d,ecx    // Set sum to 0
00007FFADFCD0BF2  mov         ebx,dword ptr [rdx+18h]    // Copy length to ebx
// LOOP START
00007FFADFCD0BF5  cmp         ecx,ebx                    // Compare i to length
00007FFADFCD0BF7  jl          00007FFADFCD0C00           // if less, jump below
00007FFADFCD0BF9  mov         eax,r8d                    // else copy sum to return value
00007FFADFCD0BFC  jmp         00007FFADFCD0C26           // jump to end of function
00007FFADFCD0BFE  xchg        ax,ax                      // No-op

00007FFADFCD0C00  mov         eax,dword ptr [rdx+18h]    // Copy list's length to eax
00007FFADFCD0C03  cmp         ecx,eax                    // Compare i to length
00007FFADFCD0C05  jae         00007FFADFCD0C35           // if greater than or equal, jump and throw exception
00007FFADFCD0C07  mov         r9,qword ptr [rdx+8]       // Copy List's array address to r9
00007FFADFCD0C0B  movsxd      r10,r11d                   // Copy j  to r10 and sign extend.
00007FFADFCD0C0E  mov         rax,qword ptr [r9+8]       // Copy array's length to rax
00007FFADFCD0C12  cmp         r10,rax                    // Compare j to array' length
00007FFADFCD0C15  jae         00007FFADFCD0C30           // If greater than or equal, jump and throw exception
00007FFADFCD0C17  mov         eax,dword ptr [r9+r10*4+10h]  // copy array[j] (list[i]) to eax
00007FFADFCD0C1C  add         r8d,eax                    // sum += array[j]
00007FFADFCD0C1F  inc         ecx                        // i++
00007FFADFCD0C21  inc         r11d                       // j++
00007FFADFCD0C24  jmp         00007FFADFCD0BF5           // Jump to beginning of loop 
00007FFADFCD0C26  add         rsp,20h                    // Clean up stack
00007FFADFCD0C2A  pop         rdi  
00007FFADFCD0C2B  pop         rsi  
00007FFADFCD0C2C  pop         rbx  
00007FFADFCD0C2D  ret                                    // return

00007FFADFCD0C30  call        00007FFB3F7E6590           // Throw exception

00007FFADFCD0C35  mov         ecx,0Dh                    
00007FFADFCD0C3A  call        00007FFB3CD990F8           // Throw exception

以下是发生的情况

List<int> 上的方法调用已被内联——我们获得了对 List<int> 维护的底层数组的直接内存访问。
有一个对 i 的迭代，它将其与 List<int> 的长度进行比较。
有一个单独的对 j（我在注释中称之为 j）的迭代，它将其与 List<int> 包装的内部数组的长度进行比较。
- 其中任何一个迭代失败都会导致抛出异常。

重复的迭代实际上并没有使循环比纯数组慢两倍——仍然只有一次内存访问，其余所有操作都在寄存器上进行，这非常高效。但是，如果您在数组和 List<T> 之间进行选择，那么对于纯粹的效率，很难超越数组。

List<T> 上的 Foreach

最后，让我们将注意力转向 List<T> 上的 foreach。编译器（或 JIT）是否会将其转换为简单的 for 循环？

不会。这是 IL

.method private hidebysig static 
	int32 GetSumForeach (
		class [mscorlib]System.Collections.Generic.List`1 list
	) cil managed 
{
	// Method begins at RVA 0x20f4
	// Code size 50 (0x32)
	.maxstack 2
	.locals init (
		[0] int32 sum,
		[1] int32 val,
		[2] valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator CS$5$0000
	)

	IL_0000: ldc.i4.0
	IL_0001: stloc.0
	IL_0002: ldarg.0
	IL_0003: callvirt instance valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator<!--0--> class [mscorlib]System.Collections.Generic.List`1::GetEnumerator()
	IL_0008: stloc.2
	.try
	{
		IL_0009: br.s IL_0017
		// loop start (head: IL_0017)
			IL_000b: ldloca.s CS$5$0000
			IL_000d: call instance !0 valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator::get_Current()
			IL_0012: stloc.1
			IL_0013: ldloc.0
			IL_0014: ldloc.1
			IL_0015: add
			IL_0016: stloc.0

			IL_0017: ldloca.s CS$5$0000
			IL_0019: call instance bool valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator::MoveNext()
			IL_001e: brtrue.s IL_000b
		// end loop

		IL_0020: leave.s IL_0030
	} // end .try
	finally
	{
		IL_0022: ldloca.s CS$5$0000
		IL_0024: constrained. valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator
		IL_002a: callvirt instance void [mscorlib]System.IDisposable::Dispose()
		IL_002f: endfinally
	} // end handler

	IL_0030: ldloc.0
	IL_0031: ret
} // end of method Program::GetSumForeach

我不会展示此程序的程序集代码，但可以说，它也相当不同。

(IEnumerable<T>)array 上的 Foreach

您会期望这样的代码发生什么？

private static int GetSumForeach(int[] array)
{
    var collection = array as IEnumerable<int>;
    int sum = 0;
    foreach (var val in collection)
    {
        sum += val;
    }
    return sum;
}

在这种情况下，数组优化将丢失，您将得到与 List<T> 上的 foreach 相同的结果。

循环展开

循环展开是通过每次迭代执行更多工作来减少循环开销的做法，从而减少迭代次数。对于循环繁重的代码，它可以产生很大的影响。

您可以自己进行这种展开，例如通过以下示例

private static int GetSum(int[] array)
{
    int sum = 0;
    for (int i = 0; i < array.Length-4; i+=4)
    {
        sum += array[i];
        sum += array[i+1];
        sum += array[i+2];
        sum += array[i+3];
    }
    return sum;
}

在某些情况下，编译器可以为您完成这种优化，但要进行循环展开，必须遵守一个主要限制：循环约束必须是静态可确定的，以便编译器知道如何修改循环。

这就是为什么我们在之前示例的程序集中看不到任何循环展开的原因。循环的长度始终是可变的。如果我们改用如下内容呢？

const int ArrayLength = 100;

private static int GetSum(int[] array)
{
    int sum = 0;
    for (int i = 0; i < ArrayLength; i++)
    {
        sum += array[i];
    }
    return sum;
}

此方法的 IL 在外观上与之前相似，但程序集显示了优化

00007FFADFCE01C0  sub         rsp,28h              // Setup stack frame
00007FFADFCE01C4  mov         rdx,rcx              // Copy array address to rdx
00007FFADFCE01C7  xor         ecx,ecx              // Set sum to 0
00007FFADFCE01C9  xor         r8d,r8d              // Set address offset to 0
00007FFADFCE01CC  mov         rax,qword ptr [rdx+8]// Copy array length to rax
00007FFADFCE01D0  mov         r9d,60h              // Copy value 96 to r9
00007FFADFCE01D6  cmp         r9,rax               // Compare 96 to length
00007FFADFCE01D9  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01DB  mov         r9d,61h              // Copy value 97 to r9
00007FFADFCE01E1  cmp         r9,rax               // Compare to length
00007FFADFCE01E4  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01E6  mov         r9d,62h              // Copy value 98 to r9
00007FFADFCE01EC  cmp         r9,rax               // Compare to length
00007FFADFCE01EF  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01F1  mov         r9d,63h              // Copy value 99 to r9
00007FFADFCE01F7  cmp         r9,rax               // Compare to length
00007FFADFCE01FA  jae         00007FFADFCE0230     // If greater than or equal, jump and throw exception
00007FFADFCE01FC  nop         dword ptr [rax]      // Nop
// LOOP START
00007FFADFCE0200  mov         eax,dword ptr [rdx+r8+10h]  // Copy array[i] to eax
00007FFADFCE0205  add         ecx,eax                     //sum += array[i]
00007FFADFCE0207  mov         eax,dword ptr [rdx+r8+14h]  // Copy array[i+1] to eax
00007FFADFCE020C  add         ecx,eax                     // sum += array[i+1]
00007FFADFCE020E  mov         eax,dword ptr [rdx+r8+18h]  // Copy array[i+2] to eax
00007FFADFCE0213  add         ecx,eax                     // sum += array[i+2]
00007FFADFCE0215  mov         eax,dword ptr [rdx+r8+1Ch]  // Copy array[i+3] to eax
00007FFADFCE021A  add         ecx,eax                     // sum += array[i+3]
00007FFADFCE021C  add         r8,10h                      // Increment address offset by 16 bytes (4 integers)
00007FFADFCE0220  cmp         r8,190h                     // Compare address offset to 400
00007FFADFCE0227  jl          00007FFADFCE0200            // If less, loop around
00007FFADFCE0229  mov         eax,ecx                     // Copy sum to return value
00007FFADFCE022B  add         rsp,28h                     // Fix up stack
00007FFADFCE022F  ret                                     // return

00007FFADFCE0230  call        00007FFB3F7E6590            // throw exception

这基本操作是

检查我们是否可以安全地对 N 个值进行增量并溢出数组（在本例中 N 为 4）。
每个循环体执行 N 个操作。

然而，一旦您偏离了静态确定性，就无法依赖此优化。但是，如果您对数组有特殊的了解（例如，它们的长度始终是 4 的倍数），那么您可以在源代码中自己进行循环展开。

就当前的 JIT 编译器而言，如果您将此示例中的数组大小更改为 101，您将失去此优化。某些编译器会对此进行拆分——一个展开的循环，末尾附加剩余部分。JIT 编译器目前不这样做（请记住，它受到严格的时间限制）。

当然，最终的循环展开就是消除循环本身，并将每个操作单独列出。对于小型操作，这可能是可取的——这只是您必须在代码维护和性能之间做出的权衡。

摘要

如果您需要在性能更关键的组件上获得最后一点性能，尤其是在 CPU 密集型的情况下，了解代码中的微小差异如何转化为更有效或更低效的机器代码非常重要。

在所有集合中，数组的访问效率最高，但 List<T> 也非常接近。在迭代数组时，您应该尽量在可能的情况下消除边界检查。在决定 for 和 foreach 之间时，请记住，除了纯数组之外，foreach 可能涉及方法调用。

了解允许循环展开的条件。如果您能够构建代码以允许它，并且它能产生影响，那就去做吧，或者手动展开循环以获得性能提升。

最重要的是，找出如何自己回答这些类型的问题，这样下次出现深入的性能问题时，您就知道如何快速确定答案。