C 语言中最慢的 LZSS 压缩器

Sanmayce

4.79/5 (5投票s)

2015 年 2 月 20 日

CPOL

6分钟阅读

37846

535

C 语言中高度优化的 LZSS 解压缩研究

引言

我分享我最新的LZSS解压缩器Nakamichi'Loonette'的愿望，是希望将来有人能在此基础上进行改进。

没有理智的人需要最慢的速度，但为了探索深度LZSS（针对英文文本）的潜力，我确实深入研究了。

背景

主要目标之一是通过超快的解压缩将I/O带宽（线性读取）提高2倍或3倍。NTFS使用某种LZSS来实现这一点。最近，我看到m.2 SSD的线性读取速度突破了1GB/s的障碍，像LzTurbo和LZ4这样的超强实现使得1024MB/s看起来像是70年代的汽车。现在，我们追求十倍的速度。

压缩方法的概览

直面ZIP的对决

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir

02/20/2015  06:00 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015  03:46 AM        13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015  05:37 PM        11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015  09:04 PM        11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>timer32.exe 
            7za.exe t Agatha_Christie_89-ebooks_TXT.tar.zip

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
Processing archive: Agatha_Christie_89-ebooks_TXT.tar.zip
Testing     Agatha_Christie_89-ebooks_TXT.tar
Everything is Ok
Size:       33258496
Compressed: 11173343

Kernel  Time =     0.015 =    3%
User    Time =     0.421 =   92%
Process Time =     0.436 =   95%    Virtual  Memory =      2 MB
Global  Time =     0.457 =  100%    Physical Memory =      4 MB

战马7-zip的解压缩速度大约为33,258,496/0.436/1024/1024= 72MB/s，而Nakamichi'Loonette'的速度为256MB/s（如下文所示）。当然，ZIP看到的是64KB的缓冲区，而'Loonette'是512MB，这意味着在BIG&REDUNDANT TEXTS（大而冗余的文本）上，压缩率会从3:1提高到4:1。

如前所述，它之所以最慢，是因为只使用了粗暴的memmem()匹配查找。选择12字节或8字节的匹配长度，如果实现哈希，可以简化压缩。

选择的顺序是按大小排列的

// 12:2 = 6
// 8:2 =  4
// 12:3 = 4
// 12:4 = 3
// 8:3 =  2.6
// 8:4 =  2

第二个数字代表用于编码匹配偏移的字节数，即2、3或4。
压缩基于经典的LZSS（Lempel–Ziv–Storer–Szymanski），我借鉴了一位日本程序员（非常感谢Ito）的蓝图，并简化了编码器和解码器。
在下载区附加的存档中包含C源代码及其由最新的Intel C优化器生成的汇编对应代码。

解压缩方法的概览

我对超快的文本处理的痴迷导致了我所知的第三快的（LzTurbo和LZ4分别是第一和第二）文本解压缩器的出现。它被称为Nakamichi 'Tengu'，在i5 4690K @3.5GHz（无涡轮）下，可以解压缩1GB的英文文本，速度如下：

>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, 
    babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>lz4 -9 -Sx -b -T1 DDETT-Definitive_Decompression_English_Texts_Torture.tar

Nb of threads = 1 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 864 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 905969664 -> 315340335 ( 34.81%), 22.6 MB/s , 2116.8 MB/s

>lz4 -9 -Sx -b DDETT-Definitive_Decompression_English_Texts_Torture.tar

Nb of threads = 4 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 896 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 939524096 -> 329287352 ( 35.05%), 85.9 MB/s , 8265.6 MB/s

LZ4的作者Yann写了一个近乎完美的解压缩器，据我所知，它只使用了单线程2116/10394和多线程8265/4????的绝对最大值的1/5，太棒了！

然而，我想要说的是Nakamichi'Loonette'，它具有8KB/2MB/512MB或（16-3）位/（24-3）位/（32-3）位滑动窗口，匹配偏移为2/3/4字节。只使用了12（最多16）和8的匹配长度。

因此，这篇练习的重点是编写一个高度优化的LZSS解压缩函数，它能够以极高的速度提取压缩的英文文本，而忽略当今RAM和CPU缓存的延迟（以及带宽）。抛开那些枯燥的（在速度和大小方面）内存限制，看看Loonette解压缩的简洁性吧！

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
    char* retLOCAL = ret;
    char* srcLOCAL = src;
    char* srcEndLOCAL = src+srcSize;
    unsigned int DWORDtrio;
    while (srcLOCAL < srcEndLOCAL) {
        DWORDtrio = *(unsigned int*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_4096
        _mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|L|xxxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 0b means Long MatchLength, (12-L) or 12
// L = 1b means Short MatchLength, (12-L) or 8
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 2 bytes long i.e. Sliding Window is 2*8-L-OO=(1+OO)*8-3=13 or   8KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 3 bytes long i.e. Sliding Window is 3*8-L-OO=(1+OO)*8-3=21 or   2MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 
// 4 bytes long i.e. Sliding Window is 4*8-L-OO=(1+OO)*8-3=29 or 512MB     
        DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
        if ( (DWORDtrio & 0x03) == 0x00 ) {                                                       
                #ifdef _N_GP
        memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
                #endif
                #ifdef _N_XMM
        SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
                #endif
        retLOCAL+= (DWORDtrio>>3);
        srcLOCAL+= (DWORDtrio>>3)+1;
        } else {
                #ifdef _N_GP
            memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), 16);
                #endif
                #ifdef _N_XMM
            SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), retLOCAL );
                #endif
        srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
        retLOCAL+= 12-(DWORDtrio&0x04);
        }
    }        
    return (unsigned int)(retLOCAL - ret);
}

/*
; 'Loonette' decompression loop, 76-14+2=100 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications 
; running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00014 41 0f 18 8a 00 
        10 00 00         prefetcht0 BYTE PTR [4096+r10]         
  0001c b8 ff ff ff ff   mov eax, -1                            
  00021 41 8b 12         mov edx, DWORD PTR [r10]               
  00024 89 d1            mov ecx, edx                           
  00026 83 f1 03         xor ecx, 3                             
  00029 c1 e1 03         shl ecx, 3                             
  0002c d3 e8            shr eax, cl                            
  0002e 23 d0            and edx, eax                           
  00030 89 d0            mov eax, edx                           
  00032 83 e0 03         and eax, 3                             
  00035 75 18            jne .B7.5 
.B7.4::                         
  00037 f3 41 0f 6f 42 
        01               movdqu xmm0, XMMWORD PTR [1+r10]       
  0003d c1 ea 03         shr edx, 3                             
  00040 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00045 4c 03 ca         add r9, rdx                            
  00048 ff c2            inc edx                                
  0004a 4c 03 d2         add r10, rdx                           
  0004d eb 24            jmp .B7.6 
.B7.5::                         
  0004f 89 d1            mov ecx, edx                           
  00051 83 e2 04         and edx, 4                             
  00054 c1 e9 03         shr ecx, 3                             
  00057 f7 da            neg edx                                
  00059 ff c0            inc eax                                
  0005b 48 f7 d9         neg rcx                                
  0005e 83 c2 0c         add edx, 12                            
  00061 49 03 c9         add rcx, r9                            
  00064 4c 03 d0         add r10, rax                           
  00067 f3 0f 6f 01      movdqu xmm0, XMMWORD PTR [rcx]         
  0006b f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00070 4c 03 ca         add r9, rdx                            
.B7.6::                         
  00073 4d 3b d0         cmp r10, r8                            
  00076 72 9c            jb .B7.3 
*/

一百字节的美！

与LZ4（在我笔记本Core 2 Q9550s @2833MHz上）的快速对决，看看现有的CPU-RAM子系统有多么拥挤！

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15
>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Usage: Nakamichi filename

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>Nakamichi_Loonette_XMM_PREFETCH_4096.exe 
Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4220
NumberOf(Short)Matches[Short]Window  (8): 412312
NumberOf(Short)Matches[Medium]Window (8): 848071
NumberOf(Short)Matches[Big]Window    (8): 353660
NumberOf(Long)Matches[Short]Window   (12): 191049
NumberOf(Long)Matches[Medium]Window  (12): 794178
NumberOf(Long)Matches[Big]Window     (12): 595339
RAM-to-RAM performance: 1839 bytes/s.

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir

02/20/2015  06:09 AM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015  03:46 AM        13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015  05:37 PM        11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015  03:06 AM            86,636 Nakamichi_Loonette.c
02/20/2015  03:07 AM           445,206 Nakamichi_Loonette_XMM_PREFETCH_4096.cod
02/20/2015  12:21 PM           117,248 Nakamichi_Loonette_XMM_PREFETCH_4096.exe

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>
Nakamichi_Loonette_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source, 
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 11745883 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000001040080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 192239 clocks or 2.727MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%

D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>lz4 -9 -Sx -b -T1  
Agatha_Christie_89-ebooks_TXT.tar
Nb of threads = 1 ; Compression Level = 9
Agatha_Christie :  33258496 ->  13365090 ( 40.19%),   12.7 MB/s ,  909.2 MB/s

Nakamichi'Loonette'有自己的主页：http://www.sanmayce.com/Hayabusa/。

关注点

希望有人和我一样热衷于改进它，这就是游戏的名称——'追求卓越。'

法国艺术家Rachelle Bartel，又名Lillycat，创作了这幅充满灵魂的艺术品。

2015年2月23日的补充

很高兴分享一个更精炼的'Loonette'，名为'Loonette-Hoshimi'。

鉴于阿加莎·克里斯蒂是英文散文女王，她完整的作品自然是英文文本压缩的最佳数据集。

《吉尼斯世界纪录》将克里斯蒂列为有史以来最畅销的小说家。她的小说销量约为20亿册.../维基百科/

D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 171
Legend: MatchLengths: 4=Tiny, 8=Short, 12=Medium, 16=Long; WindowSizes: 2/3/4=Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 325551
NumberOf(Short)Matches[Short]Window (8)[2]: 240654
NumberOf(Medium)Matches[Short]Window (12)[2]: 80607
NumberOf(Long)Matches[Short]Window (16)[2]: 52966
NumberOf(Tiny)Matches[Medium]Window (4)[3]: 382909
NumberOf(Short)Matches[Medium]Window (8)[3]: 677654
NumberOf(Medium)Matches[Medium]Window (12)[3]: 440648
NumberOf(Long)Matches[Medium]Window (16)[3]: 239806
NumberOf(Short)Matches[Long]Window (8)[4]: 272624
NumberOf(Medium)Matches[Long]Window (12)[4]: 461732
NumberOf(Long)Matches[Long]Window (16)[4]: 266965
RAM-to-RAM performance: 1249 bytes/s.

D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 10854012 bytes ...
RAM-to-RAM performance: 320 MB/s.
Memory pool starting address: 0000000000E60080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 194229 clocks or 2.699MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 11%

D:\Nakamichi_Loonette_Intel15_Hoshimi>7za.exe a -tzip -mx9 Agatha_Christie_89-ebooks_TXT.tar.zip Agatha_Christie_89-ebooks_TXT.tar
D:\Nakamichi_Loonette_Intel15_Hoshimi>tangelo_32bit.exe c Agatha_Christie_89-ebooks_TXT.tar Agatha_Christie_89-ebooks_TXT.tar.tangelo
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method0 Agatha_Christie_89-ebooks_TXT.tar -method 0
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method1 Agatha_Christie_89-ebooks_TXT.tar -method 1
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method2 Agatha_Christie_89-ebooks_TXT.tar -method 2
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method3 Agatha_Christie_89-ebooks_TXT.tar -method 3
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method4 Agatha_Christie_89-ebooks_TXT.tar -method 4
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method5 Agatha_Christie_89-ebooks_TXT.tar -method 5

D:\Nakamichi_Loonette_Intel15_Hoshimi>dir

02/23/2015  03:33 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/23/2015  03:52 PM        33,277,230 Agatha_Christie_89-ebooks_TXT.tar.method0.zpaq
02/23/2015  03:52 PM        11,709,902 Agatha_Christie_89-ebooks_TXT.tar.method1.zpaq
02/23/2015  03:52 PM         9,557,649 Agatha_Christie_89-ebooks_TXT.tar.method2.zpaq
02/23/2015  03:53 PM         6,462,103 Agatha_Christie_89-ebooks_TXT.tar.method3.zpaq
02/23/2015  03:54 PM         6,387,670 Agatha_Christie_89-ebooks_TXT.tar.method4.zpaq
02/23/2015  03:56 PM         6,036,509 Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq
02/23/2015  03:03 PM        10,854,012 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/23/2015  03:10 PM         6,475,761 Agatha_Christie_89-ebooks_TXT.tar.tangelo
02/23/2015  03:12 PM        11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
02/23/2015  05:51 AM            99,469 Nakamichi_Loonette_Hoshimi.c
02/23/2015  05:51 AM           771,399 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.cod
02/23/2015  05:51 AM           118,784 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe

D:\Nakamichi_Loonette_Intel15_Hoshimi>

在提高压缩率的同时，我幸运地提高了解压缩速度并减少了2字节的代码。

; Nakamichi 'Loonette-Hoshimi' decompression loop, 74-14+2=98 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00014 41 0f 18 8a 00 
        10 00 00         prefetcht0 BYTE PTR [4096+r10]         
  0001c b8 ff ff ff ff   mov eax, -1                            
  00021 41 8b 12         mov edx, DWORD PTR [r10]               
  00024 89 d1            mov ecx, edx                           
  00026 83 f1 03         xor ecx, 3                             
  00029 c1 e1 03         shl ecx, 3                             
  0002c d3 e8            shr eax, cl                            
  0002e 23 d0            and edx, eax                           
  00030 89 d0            mov eax, edx                           
  00032 83 e0 03         and eax, 3                             
  00035 75 18            jne .B7.5 
.B7.4::                         
  00037 f3 41 0f 6f 42 
        01               movdqu xmm0, XMMWORD PTR [1+r10]       
  0003d c1 ea 04         shr edx, 4                             
  00040 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  00045 4c 03 ca         add r9, rdx                            
  00048 ff c2            inc edx                                
  0004a 4c 03 d2         add r10, rdx                           
  0004d eb 22            jmp .B7.6 
.B7.5::                         
  0004f 89 d1            mov ecx, edx                           
  00051 83 e2 0c         and edx, 12                            
  00054 c1 e9 04         shr ecx, 4                             
  00057 ff c0            inc eax                                
  00059 83 c2 04         add edx, 4                             
  0005c 48 f7 d9         neg rcx                                
  0005f 49 03 c9         add rcx, r9                            
  00062 4c 03 d0         add r10, rax                           
  00065 f3 0f 6f 01      movdqu xmm0, XMMWORD PTR [rcx]         
  00069 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  0006e 4c 03 ca         add r9, rdx                            
.B7.6::                         
  00071 4d 3b d0         cmp r10, r8                            
  00074 72 9e            jb .B7.3

XMM传输在Core 2上速度很慢，它们是不对齐的，并且惩罚很高，尤其是在访问RAM时。因此，在Intel第三/四/五代上，Nakamichi 'Loonette-Hoshimi'将轻松超过1GB/s。下面我将尝试将Nakamichi 'Loonette-Hoshimi'与一款压缩巨头放在一起，或者说进行并列比较。

关于Dr. Matt Mahoney撰写的、已进入公有领域的出色的ZPAQ方法的说明

方法0不仅仅是tar，它将文件连接在一起，中间有一些头。它仍然是zpaq存档格式，其中文件被分片并去重，然后打包成块。这些块以非上下文建模格式存储，这意味着它们被分成更小的块（64K），并带有4字节的头来指示其大小。有单独的块存储碎片SHA-1哈希和文件名。因此，与tar相比，它有更多的开销，但如果存在重复文件或碎片，有时也会节省一些空间。

方法1使用LZ77，通过用指向先前出现的指针替换重复的字符串来压缩。

方法2相同，但花费更多时间寻找更好的匹配（使用后缀数组而不是哈希表）。

方法3使用BWT（上下文排序）或LZ77进行长匹配，以及一个一阶上下文模型和算术编码来处理文字，具体取决于文件类型。

方法4使用LZ77、BWT或高阶上下文模型。

方法5使用一种复杂的、高阶的上下文混合模型，具有超过20个预测组件。

此外，Dr. Mahoney写了一个非常可爱的工具，名为'fv'，它能绘制出给定文件的结构图，信息量非常大。它可以让你看到字符串重复的热点：图例（如何解读图像）

以下图表显示了长度为1（黑色）、2（红色）、4（绿色）和8（蓝色）的字符串匹配的分布。横轴表示文件中的位置。纵轴以对数刻度显示到上一个匹配的后向距离。向上阅读的主刻度线分别为1、10、100、1000等。

我将尝试解释3个不同的文件——原始文件、Hoshimi存档和最强的ZPAQ存档。

第一个'fv-Agatha_Christie_89-ebooks_TXT.tar.bmp'

顶部的蓝色条带表明，长度为8的匹配通常相隔至少10^4字节（4个主刻度线）直到文件的整个长度。
更准确地说，是3x10^4=30,000。
红色条带表明，长度为2的匹配相隔大约4x10到4x100字节。

第二个'fv-Agatha_Christie_89-ebooks_TXT.tar.Nakamichi.bmp'

一个好的压缩器应该能从图片中消除所有蓝色和绿色，但在这里，'Hoshimi'未能压缩一些模糊的绿色（长度为4字节的匹配）在1x10^6和1x10^7区域，也就是说，存在一些未压缩的4字节块，它们之间相隔1MB到10MB。
红色条带表明，长度为2的匹配相隔大约2x10^3到5x10^5字节。
没有颜色的垂直条带是'问题'，在文件的这些部分，压缩器被某些东西蒙蔽了，看不见任何相关性。

第三个'fv-Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq.bmp'

红色条带表明，长度为2的匹配相隔大约3x10^3到3x10^5字节。

（星見 - hoshimi）的字面翻译是“看星星”，或者更诗意地说“观星”。

花見（花見，字面意思是“赏花”）是日本传统的欣赏花朵短暂之美的习俗，……

在佛教中，月見（月見 - tsukimi）也很重要，还有一种瑜伽分支叫做Surya Yoga（拜日式瑜伽？-？）。
如果灵感来了，下一个变种将命名为'Tsukimi'。

2015年3月23日的补充

我不喜欢'if-else'分支，所以'Loonette-Hoshimi'变成了无分支，遗憾的是代码大小从98字节增加到125字节。至于解压缩速度，Q9550s的表现令人失望（请参阅下面的'enwik8'测试），然而，我并不在意这些惩罚，这个练习仍然很好。

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
	char* retLOCAL = ret;
	char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while (srcLOCAL < srcEndLOCAL) {
		DWORDtrio = *(unsigned int*)srcLOCAL;
//#ifndef _N_GP
//#ifdef _N_prefetch_4096
//		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
//#endif
//#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|LL|xxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 00b means 04 MatchLength, (1+LL)<<2)
// L = 01b means 08 MatchLength, (1+LL)<<2)
// L = 10b means 12 MatchLength, (1+LL)<<2)
// L = 11b means 16 MatchLength, (1+LL)<<2)
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 2 bytes long i.e. Sliding Window is 2*8-LL-OO=(1+OO)*8-4=12 or   4KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 3 bytes long i.e. Sliding Window is 3*8-LL-OO=(1+OO)*8-4=20 or   1MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 4 bytes long i.e. Sliding Window is 4*8-LL-OO=(1+OO)*8-4=28 or 256MB     

/*
// Branchfull [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		if ( (DWORDtrio & 0x03) == 0x00 ) {                                                       
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
				#endif
				#ifdef _N_XMM
		SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
				#endif
		retLOCAL+= (DWORDtrio>>4);
		srcLOCAL+= (DWORDtrio>>4)+1;
		} else {
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), 16);
				#endif
				#ifdef _N_XMM
			SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), retLOCAL );
				#endif
		srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
		retLOCAL+= (1+((DWORDtrio>>2)&0x03))<<2; // 4/8/12/16
		}
// Branchfull ]
*/
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!(DWORDtrio & 0x03);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		SlowCopy128bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
//		memcpy(retLOCAL, (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), 16);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)((1+((DWORDtrio>>2)&0x03))<<2)&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}        
	return (unsigned int)(retLOCAL - ret);
}

/*
; 'Nakamichi_Loonette-Hoshimi_branchless' decompression loop, bb-40+2=125 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";

.B7.3::                         
  00040 44 8b 32         mov r14d, DWORD PTR [rdx]              
  00043 44 89 f1         mov ecx, r14d                          
  00046 83 f1 03         xor ecx, 3                             
  00049 41 bf ff ff ff 
        ff               mov r15d, -1                           
  0004f c1 e1 03         shl ecx, 3                             
  00052 33 ff            xor edi, edi                           
  00054 41 d3 ef         shr r15d, cl                           
  00057 45 23 f7         and r14d, r15d                         
  0005a 4c 89 c9         mov rcx, r9                            
  0005d 45 89 f5         mov r13d, r14d                         
  00060 44 89 f5         mov ebp, r14d                          
  00063 41 83 e5 03      and r13d, 3                            
  00067 49 89 d7         mov r15, rdx                           
  0006a 0f 44 f8         cmove edi, eax                         
  0006d 41 83 e6 0c      and r14d, 12                           
  00071 c1 ed 04         shr ebp, 4                             
  00074 ff cf            dec edi                                
  00076 41 ff c5         inc r13d                               
  00079 41 89 ec         mov r12d, ebp                          
  0007c 49 89 fb         mov r11, rdi                           
  0007f 49 2b cc         sub rcx, r12                           
  00082 49 f7 d3         not r11                                
  00085 48 ff c9         dec rcx                                
  00088 4d 23 fb         and r15, r11                           
  0008b 48 23 cf         and rcx, rdi                           
  0008e ff c5            inc ebp                                
  00090 41 83 c6 04      add r14d, 4                            
  00094 4d 23 e3         and r12, r11                           
  00097 49 23 eb         and rbp, r11                           
  0009a 4c 23 ef         and r13, rdi                           
  0009d f3 42 0f 6f 44 
        39 01            movdqu xmm0, XMMWORD PTR [1+rcx+r15]   
  000a4 4c 23 f7         and r14, rdi                           
  000a7 49 03 ed         add rbp, r13                           
  000aa 4d 03 e6         add r12, r14                           
  000ad 48 03 d5         add rdx, rbp                           
  000b0 f3 41 0f 7f 01   movdqu XMMWORD PTR [r9], xmm0          
  000b5 4d 03 cc         add r9, r12                            
  000b8 49 3b d0         cmp rdx, r8                            
  000bb 72 83            jb .B7.3 
*/

在core 2 Q9550s上进行的'enwik8'测试

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette-Hoshimi_branchless.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 192 MB/s.
Memory pool starting address: 0000000002620080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193784 clocks or 2.706MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 7%

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2  enwik8

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000002710080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193453 clocks or 2.710MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2  enwik8

D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>

我相信我无法进一步优化它了。

历史

第一个Nakamichi是在一年前编写的，截至2015年2月20日，'Loonette'是最新的、最精炼的。