比 Zip 算法更好的压缩内存中数据算法

Andrey Dryazgov

4.08/5 (42投票s)

2006年5月24日

CPOL

6分钟阅读

187740

5195

关于 .NET 内存数据压缩引擎的文章

引言

有时您可能需要在内存中压缩数据而不创建磁盘文件。例如，可以压缩 XML 文件 CDATA 部分中的二进制数据。此外，您可能希望压缩通过低带宽连接在客户端和服务器计算机之间传输的数据。

建议的类实现了原始压缩算法。此方法基于“DEFLATE 压缩数据格式规范”(RFC 1951) 并进行了一些增强。其效果比常用的 zip 压缩略好。该算法在 Centrino 和其他具有 2 MB L2 缓存的平台上尤其有效。上图显示了在配备 Pentium M 处理器(1.8 GHz) 的计算机上压缩/解压缩 mscorlib.xml 文件的结果。在配备 AMD Athlon 2100+ 处理器(256 KB L2 缓存) 的计算机上运行相同的测试，压缩时间为 721 毫秒，解压缩时间为 40 毫秒。有趣的是，Intel 和 AMD 平台在不同操作上的性能差异如此明显(此处 Pentium M 上的 110 毫秒数据相当不准确，但由于精度很大)。

您可以将此压缩引擎与 .NET 2.0 中添加的标准 DeflateStream 压缩器进行比较。为此，请打开演示项目(Form1.cs) 的源代码，然后注释/取消注释几个代码块。它们在源代码中已标记。然后，尝试使用各种数据文件运行演示项目，并将结果与 MemCompressor 引擎进行比较。您会发现，标准的 DeflateStream 引擎相当糟糕。

使用 MemCompressor

MemCompressor 程序集包含两个主要类：AcedDeflator 和 AcedInflator。AcedDeflator 类有一个 Compress 方法。此方法接受一个字节数组，并返回一个包含相同数据的压缩形式的新字节数组。例如：

private int _originalSize;
private byte[] _originalData;

private int _originalChecksum;
private byte[] _compData;
private int _compSize;

private void CompressData()
{
    _originalChecksum = AcedUtils.Adler32(_originalData, 0, _originalSize);
    _compData = AcedDeflator.Instance.Compress(_originalData, 0, _originalSize,
        AcedCompressionLevel.Normal, 0, 0);
    _compSize = _compData.Length;
}

AcedInflator 类的工作方式完全相反。它有一个 Decompress 方法，该方法接受一个压缩数据的二进制数组。此方法将原始数据恢复到输出数组中。

private byte[] _decompData;
private int _decompSize;
private int _decompChecksum;

private void DecompressData()
{
    _decompData = AcedInflator.Instance.Decompress(_compData, 0, 0, 0);
    _decompSize = _decompData.Length;
    _decompChecksum = AcedUtils.Adler32(_decompData, 0, _decompSize);
    Debug.Assert(_decompChecksum == _originalChecksum);
}

AcedDeflator.Compress 方法除其他参数外，还接受 compressionLevel 参数。此参数用于选择速度和压缩率之间的平衡。您可以在此处传递以下值之一：

Store (简单地将数据复制到输出数组而不进行压缩)
Fastest (压缩率较低，但速度最快)
Fast (对冗余度较低的数据最佳)
Normal (对文本等冗余度较高的数据是更好的选择)
Maximum (速度较慢，但压缩率最高)

AcedDeflator 类

public class AcedDeflator

提供使用 LZ77 算法和霍夫曼编码相结合压缩二进制数据的方法，类似于 RFC 1951 中描述的方法。在多线程应用程序中，您应该同步对此类方法的调用。

public AcedDeflator()

创建 AcedDeflator 类的新实例。不建议直接调用此构造函数。如果可能，请使用 Instance static 属性返回的单个缓存实例 AcedDeflator。

public static AcedDeflator Instance { get; }

返回 AcedDeflator 类的缓存实例。 such an instance occupies a lot of memory (a little more than 2 MB). So, a single instance is created and cached in an internal static field of this class. This instance can be used each time when you want to compress some data. The only exception may be a multithreaded application where you compress data in several threads simultaneously. If so, you should create each instance with the regular constructor.

public byte[] Compress(byte[] sourceBytes, int sourceIndex, int sourceLength,
    AcedCompressionLevel compressionLevel, int beforeGap, int afterGap)

压缩 sourceBytes 数组中从 sourceIndex 开始、长度为 sourceLength 字节的二进制数据。压缩率取决于 compressionLevel 参数。此方法创建一个足够存储压缩数据的新字节数组。您还可以为压缩数据块之前保留 beforeGap 字节，为压缩数据块之后保留 afterGap 字节。如果您想为压缩数据提供自定义标头和/或尾部，并且不希望不必要地重新分配内存，这将非常有用。结果字节数组由 (beforeGap + Compressed_Data_Length + afterGap) 字节组成。

public int Compress(byte[] sourceBytes, int sourceIndex, int sourceLength,
    AcedCompressionLevel compressionLevel, byte[] destinationBytes,
    int destinationIndex)

此方法与前一个类似，但它不为输出数组分配内存。输出数组作为 destinationBytes 参数传递。压缩数据将从 destinationIndex 开始存储在该数组中。压缩数据的最大长度为 (sourceLength + 4) 字节。输出数组必须有足够的空间。此方法返回存储在输出数组中的字节数。如果将 null 传递给 destinationBytes 参数，则该方法会执行“虚拟”压缩。通过这种方式，您可以找出输出数据的长度。但是，“虚拟”压缩所需的时间几乎与常规压缩相同。

public static void Release()

调用此方法可释放对 AcedDeflator 类缓存实例的内部 static 引用。之后，您可以调用 GC.Collect() 来释放该缓存实例占用的内存。

AcedInflator 类

public class AcedInflator

提供使用 AcedDeflator 类压缩的二进制数据进行解压缩的方法。在多线程应用程序中，您应该同步对此类方法的调用。

public AcedInflator()

创建 AcedInflator 类的新实例。不建议直接调用此构造函数。如果可能，为了避免不必要的内存重新分配，请使用 Instance static 属性返回的单个缓存实例 AcedInflator。

public static AcedInflator Instance { get; }

返回 AcedInflator 类的缓存实例。当您首次访问此属性时，会创建 such an instance. Then, the instance is cached in an internal static field of this class. You can use it to decompress any data in a single-threaded application. In a multithreaded application, please create an AcedInflator for each the thread with the regular constructor.

public static int GetDecompressedLength(byte[] sourceBytes, int sourceIndex)

返回指定的压缩二进制数组(sourceBytes) 中用于存储解压缩数据的二进制数组的最小长度(以字节为单位)。sourceIndex 参数指定压缩数据开始的 sourceBytes 数组中的起始索引。

public byte[] Decompress(byte[] sourceBytes, int sourceIndex,
    int beforeGap, int afterGap)

解压缩 sourceBytes 数组中从 sourceIndex 开始的二进制数据。此方法创建一个足够存储解压缩数据的新字节数组。您还可以为解压缩数据块之前保留 beforeGap 字节，为解压缩数据块之后保留 afterGap 字节。如果您想为解压缩数据提供自定义标头和/或尾部，并且不希望不必要地重新分配内存，这将非常有用。结果数组将由 (beforeGap + Decompressed_Data_Length + afterGap) 字节组成。

public int Decompress(byte[] sourceBytes, int sourceIndex,
    byte[] destinationBytes, int destinationIndex)

与前一个方法相同，但必须在调用此方法之前为输出数组分配内存。您应该将目标数组传递给 destinationBytes 参数。destinationIndex 指定该数组中解压缩数据将开始存储的偏移量。输出数组必须有足够的空间来存储所有数据。要获取解压缩数据的大小，请调用此类中的 GetDecompressedLength 方法。Decompress 方法返回存储在输出(destinationBytes) 数组中的字节数。如果您将 null 传递给 destinationBytes 参数，则此方法不执行任何操作，仅返回解压缩数据的大小(以字节为单位)。

public static void Release()

调用此方法可释放对 AcedInflator 类缓存实例的内部 static 引用。之后，您可以调用 GC.Collect() 来释放该缓存实例占用的内存。

AcedUtils 类

public class AcedUtils

此类有一个 public static 方法来计算 Adler-32 校验和。

public static int Adler32(byte[] bytes, int offset, int length)

根据 RFC 1950，计算 bytes 数组中从 offset 索引开始的 length 字节的 Adler-32 校验和。

结论

当您考虑为应用程序添加压缩功能时，只需牢记这些简单的类。

历史

2006 年 5 月 23 日：初始帖子