处理大型字节数组

bauemeise

4.81/5 (13投票s)

2009 年 2 月 9 日

LGPL3

4分钟阅读

68712

643

大型 byte[] 的分配和复制

下载源代码 - 5.07 KB

引言

本文比较了在托管代码中使用大型byte[]的各种分配和复制方法。

背景

有时您需要处理频繁分配和复制的大型字节数组。尤其是在视频处理中，320x240像素图片的RGB24格式需要一个大小为320*240*3 = 230400字节的byte[]。选择正确的内存分配和内存复制策略可能对您的项目至关重要。

Using the Code

在我当前的项目中，我需要在多核服务器上实时处理数百个未压缩的RGB24帧。为了能够选择最适合我项目的架构，我比较了不同的内存分配和复制机制。

因为我知道衡量标准的重要性，所以我决定做一个非常简单的测试并获得一个原始的可比结果。我只是运行一个循环10秒钟，然后计算循环次数。

分配

环顾四周，我发现了5种分配大型字节数组的方法

new byte[]
Marshal.AllocHGlobal()
Marshal.AllocCoTaskMem()
CreateFileMapping() // 这是共享内存
stackalloc byte[]

new byte[]

这是显示新的byte[]的典型循环

private static void newbyte()
{
    Console.Write("new byte[]: ");
    long start = DateTime.UtcNow.Ticks;
    int i = 0;
    while ((start + duration) > DateTime.UtcNow.Ticks)
    {
        byte[] buf = new byte[bufsize];
        i++;
    }
    Console.WriteLine(i);
}

new byte[]是完全托管的代码。

Marshal.AllocHGlobal()

从进程的非托管内存中分配内存。

IntPtr p = Marshal.AllocHGlobal(bufsize);
Marshal.FreeHGlobal(p);

Marshal.AllocHGlobal()返回一个IntPtr，但仍然不需要是`unsafe`代码。但是，当您想访问分配的内存时，您最常需要`unsafe`代码。

Marshal.AllocCoTaskMem()

从COM任务内存分配器分配指定大小的内存块。

IntPtr p = Marshal.AllocCoTaskMem(bufsize);
Marshal.FreeCoTaskMem(p);

与Marshal.AllocHGlobal()一样需要`unsafe`代码。

CreateFileMapping()

为了在托管代码项目中使用共享内存，我编写了一个小的辅助类来使用CreateFileMapping()函数。

使用共享内存非常简单

using (SharedMemory mem = new SharedMemory("abc", bufsize, true))
// use mem;

mem有一个指向缓冲区的void*和一个length属性。在另一个进程内部，您可以通过在构造函数中使用false（以及相同的名称）来访问相同的内存。

SharedMem使用`unsafe`。

stackalloc byte[]

在堆栈上分配了一个byte[]。因此，当您从当前方法返回时，它将被释放。如果您不谨慎使用堆栈，可能会导致堆栈溢出。

unsafe static void stack()
{
    byte* buf = stackalloc byte[bufsize];
}

使用stackalloc也需要使用`unsafe`。

测试结果

我不想讨论单核/多核、NUMA/非NUMA架构等。因此，我只打印一些有趣的结果。欢迎您在您的机器上运行测试！

Debug/Release

在Debug和Release模式下运行测试，10秒内的循环次数差异巨大。

Release

new byte[]:            425340907   100%
Marshal.AllocHGlobal:   19680751     5%
Marshal.AllocCoTaskMem: 21062645     5%
stackalloc:            341525631    80%
SharedMemory:             792007   0.2%

Debug

new byte[]:                71004   0.3%
Marshal.AllocHGlobal:   22660829    89%
Marshal.AllocCoTaskMem: 25557756   100%
stackalloc:               558497     2%
SharedMemory:             785470     3%

正如您所看到的，new byte[]和stackalloc byte[]在很大程度上取决于Debug/Release开关。而其他三种则不依赖于此。这可能是因为它们主要由内核管理。

而new byte[]和stackalloc byte[]在Release模式下是托管代码中最快的，在Debug模式下是最慢的。但请记住，垃圾回收器也必须处理new byte[]。

PC/服务器

这两个运行是在我的PC（Intel双核，Vista 64位）上完成的。所以，让我们将其与典型的服务器（双Xeon四核，Windows Server 2008 64位）在Release模式下进行比较。

                         Server      Workstation
new byte[]:            553541729      425340907   
Marshal.AllocHGlobal:   26460746       19680751 
Marshal.AllocCoTaskMem: 28294494       21062645 
stackalloc:            466980755      341525631    
SharedMemory:             817317         792007

因为我们是单线程的，所以核心数量无关紧要。请记住，垃圾回收器有自己的线程。

32位/64位

让我们比较一下32位和64位（Release模式）。

                         x86-32bit   x64-64bit
new byte[]:             1046577767   516441931
Marshal.AllocHGlobal:     21034715    25152330
Marshal.AllocCoTaskMem:   23467574    27787971
stackalloc:               83956017   416630753
SharedMemory:               728858      793750

Marshal.*和SharedMemory在x64上速度稍快一些。new byte[]在x86上的速度是x64上的两倍。而stackalloc byte[]在x64上的速度是x86上的5倍。我没想到会是这个结果！
在我的服务器上也是同样的结果。

结论

所以在决定选择哪种分配方法和目标平台之前，请三思！

MemCopy

现在让我们来看一些内存复制（memcopy）的变体。我使用相同的算法进行测量。让一个线程循环将一个byte[]复制到另一个byte[]，持续10秒钟，并计算复制次数。

Array.Copy()
Marshal.Copy()
Kernel32.dll CopyMemory()
Buffer.BlockCopy()
MemCopyInt()
MemCopyLong()

Release/Debug

                                  Release    Debug
Array.Copy:                       360741    361740
Marshal.Copy:                     360680    359712
Kernel32NativeMethods.CopyMemory: 361314    358927
Buffer.BlockCopy:                 375440    374004
OwnMemCopyInt:                    217736     33833
OwnMemCopyLong:                   295372     54601

正如预期的那样，只有我自己的MemCopy在Debug模式下慢了很多。让我们看看我自己的MemCopy。

static readonly int sizeOfInt = Marshal.SizeOf(typeof(int));
static public unsafe void MemCopy(IntPtr pSource, IntPtr pDest, int Len)
{
    unchecked
    {
        int size = sizeOfInt;
        int count = Len / size;
        int rest = Len % count;
        int* ps = (int*)pSource.ToPointer(), pd = (int*)pDest.ToPointer();
        // Loop over the cnt in blocks of 4 bytes, 
        // copying an integer (4 bytes) at a time:
        for (int n = 0; n < count; n++)
        {
            *pd = *ps;
            pd++;
            ps++;
        }
        // Complete the copy by moving any bytes that weren't moved in blocks of 4:
        if (rest > 0)
        {
            byte* ps1 = (byte*)ps;
            byte* pd1 = (byte*)pd;
            for (int n = 0; n < rest; n++)
            {
                *pd1 = *ps1;
                pd1++;
                ps1++;
            }
        }
    }
}

即使您使用`unchecked unsafe`代码，内置的复制函数在Debug模式下的性能也远高于您自己编写循环进行复制。

32位/64位

                                   32Bit    64Bit
Array.Copy:                       230788   360741    
Marshal.Copy:                     460061   360680
Kernel32NativeMethods.CopyMemory: 365850   361314
Buffer.BlockCopy:                 368212   375440 
OwnMemCopyInt:                    218438   217736  
OwnMemCopyLong:                   286321   295372

在32位x86代码中，Marshal.Copy的速度明显快于64位代码。Array.Copy在32位下的速度比64位慢很多。而我自己的memcopy循环使用32位整数，因此速度相同。而内核方法不受此设置的影响。

结论

使用内置的memcopy函数是一个好主意。

关注点

在您的机器上尝试源代码并比较结果。

历史

2009 年 2 月 9 日：初始发布