C 语言中最慢的 LZSS 压缩器
C 语言中高度优化的 LZSS 解压缩研究
- 下载 Nakamichi_Loonette-Hoshimi_branchless.zip - 3.3 MB
- 下载 Nakamichi_Loonette_Intel15_Hoshimi.zip - 179.2 KB
引言
我分享我最新的LZSS解压缩器Nakamichi'Loonette'的愿望,是希望将来有人能在此基础上进行改进。
没有理智的人需要最慢的速度,但为了探索深度LZSS(针对英文文本)的潜力,我确实深入研究了。
背景
主要目标之一是通过超快的解压缩将I/O带宽(线性读取)提高2倍或3倍。NTFS使用某种LZSS来实现这一点。最近,我看到m.2 SSD的线性读取速度突破了1GB/s的障碍,像LzTurbo和LZ4这样的超强实现使得1024MB/s看起来像是70年代的汽车。现在,我们追求十倍的速度。
压缩方法的概览
直面ZIP的对决
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir
02/20/2015 06:00 PM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015 03:46 AM 13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015 05:37 PM 11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015 09:04 PM 11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>timer32.exe
7za.exe t Agatha_Christie_89-ebooks_TXT.tar.zip
7-Zip (A) 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
Processing archive: Agatha_Christie_89-ebooks_TXT.tar.zip
Testing Agatha_Christie_89-ebooks_TXT.tar
Everything is Ok
Size: 33258496
Compressed: 11173343
Kernel Time = 0.015 = 3%
User Time = 0.421 = 92%
Process Time = 0.436 = 95% Virtual Memory = 2 MB
Global Time = 0.457 = 100% Physical Memory = 4 MB
战马7-zip的解压缩速度大约为33,258,496/0.436/1024/1024= 72MB/s,而Nakamichi'Loonette'的速度为256MB/s(如下文所示)。当然,ZIP看到的是64KB的缓冲区,而'Loonette'是512MB,这意味着在BIG&REDUNDANT TEXTS(大而冗余的文本)上,压缩率会从3:1提高到4:1。
如前所述,它之所以最慢,是因为只使用了粗暴的memmem()
匹配查找。选择12字节或8字节的匹配长度,如果实现哈希,可以简化压缩。
选择的顺序是按大小排列的
// 12:2 = 6
// 8:2 = 4
// 12:3 = 4
// 12:4 = 3
// 8:3 = 2.6
// 8:4 = 2
第二个数字代表用于编码匹配偏移的字节数,即2、3或4。
压缩基于经典的LZSS(Lempel–Ziv–Storer–Szymanski),我借鉴了一位日本程序员(非常感谢Ito)的蓝图,并简化了编码器和解码器。
在下载区附加的存档中包含C源代码及其由最新的Intel C优化器生成的汇编对应代码。
解压缩方法的概览
我对超快的文本处理的痴迷导致了我所知的第三快的(LzTurbo和LZ4分别是第一和第二)文本解压缩器的出现。它被称为Nakamichi 'Tengu',在i5 4690K @3.5GHz(无涡轮)下,可以解压缩1GB的英文文本,速度如下:
>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi
Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%
>lz4 -9 -Sx -b -T1 DDETT-Definitive_Decompression_English_Texts_Torture.tar
Nb of threads = 1 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 864 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 905969664 -> 315340335 ( 34.81%), 22.6 MB/s , 2116.8 MB/s
>lz4 -9 -Sx -b DDETT-Definitive_Decompression_English_Texts_Torture.tar
Nb of threads = 4 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 896 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 939524096 -> 329287352 ( 35.05%), 85.9 MB/s , 8265.6 MB/s
LZ4的作者Yann写了一个近乎完美的解压缩器,据我所知,它只使用了单线程2116/10394和多线程8265/4????的绝对最大值的1/5,太棒了!
然而,我想要说的是Nakamichi'Loonette',它具有8KB/2MB/512MB或(16-3)位/(24-3)位/(32-3)位滑动窗口,匹配偏移为2/3/4字节。只使用了12(最多16)和8的匹配长度。
因此,这篇练习的重点是编写一个高度优化的LZSS解压缩函数,它能够以极高的速度提取压缩的英文文本,而忽略当今RAM和CPU缓存的延迟(以及带宽)。抛开那些枯燥的(在速度和大小方面)内存限制,看看Loonette解压缩的简洁性吧!
unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
char* retLOCAL = ret;
char* srcLOCAL = src;
char* srcEndLOCAL = src+srcSize;
unsigned int DWORDtrio;
while (srcLOCAL < srcEndLOCAL) {
DWORDtrio = *(unsigned int*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_4096
_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
// |1stLSB |2ndLSB |3rdLSB |
// -------------------------------
// |OO|L|xxxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit 16bit] 24bit]
// L = 0b means Long MatchLength, (12-L) or 12
// L = 1b means Short MatchLength, (12-L) or 8
// OO = 00b means Literal
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO),
// 2 bytes long i.e. Sliding Window is 2*8-L-OO=(1+OO)*8-3=13 or 8KB
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO),
// 3 bytes long i.e. Sliding Window is 3*8-L-OO=(1+OO)*8-3=21 or 2MB
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO),
// 4 bytes long i.e. Sliding Window is 4*8-L-OO=(1+OO)*8-3=29 or 512MB
DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
if ( (DWORDtrio & 0x03) == 0x00 ) {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
#endif
retLOCAL+= (DWORDtrio>>3);
srcLOCAL+= (DWORDtrio>>3)+1;
} else {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), retLOCAL );
#endif
srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
retLOCAL+= 12-(DWORDtrio&0x04);
}
}
return (unsigned int)(retLOCAL - ret);
}
/*
; 'Loonette' decompression loop, 76-14+2=100 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications
; running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";
.B7.3::
00014 41 0f 18 8a 00
10 00 00 prefetcht0 BYTE PTR [4096+r10]
0001c b8 ff ff ff ff mov eax, -1
00021 41 8b 12 mov edx, DWORD PTR [r10]
00024 89 d1 mov ecx, edx
00026 83 f1 03 xor ecx, 3
00029 c1 e1 03 shl ecx, 3
0002c d3 e8 shr eax, cl
0002e 23 d0 and edx, eax
00030 89 d0 mov eax, edx
00032 83 e0 03 and eax, 3
00035 75 18 jne .B7.5
.B7.4::
00037 f3 41 0f 6f 42
01 movdqu xmm0, XMMWORD PTR [1+r10]
0003d c1 ea 03 shr edx, 3
00040 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
00045 4c 03 ca add r9, rdx
00048 ff c2 inc edx
0004a 4c 03 d2 add r10, rdx
0004d eb 24 jmp .B7.6
.B7.5::
0004f 89 d1 mov ecx, edx
00051 83 e2 04 and edx, 4
00054 c1 e9 03 shr ecx, 3
00057 f7 da neg edx
00059 ff c0 inc eax
0005b 48 f7 d9 neg rcx
0005e 83 c2 0c add edx, 12
00061 49 03 c9 add rcx, r9
00064 4c 03 d0 add r10, rax
00067 f3 0f 6f 01 movdqu xmm0, XMMWORD PTR [rcx]
0006b f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
00070 4c 03 ca add r9, rdx
.B7.6::
00073 4d 3b d0 cmp r10, r8
00076 72 9c jb .B7.3
*/
一百字节的美!
与LZ4(在我笔记本Core 2 Q9550s @2833MHz上)的快速对决,看看现有的CPU-RAM子系统有多么拥挤!
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15
>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Usage: Nakamichi filename
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4220
NumberOf(Short)Matches[Short]Window (8): 412312
NumberOf(Short)Matches[Medium]Window (8): 848071
NumberOf(Short)Matches[Big]Window (8): 353660
NumberOf(Long)Matches[Short]Window (12): 191049
NumberOf(Long)Matches[Medium]Window (12): 794178
NumberOf(Long)Matches[Big]Window (12): 595339
RAM-to-RAM performance: 1839 bytes/s.
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir
02/20/2015 06:09 AM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015 03:46 AM 13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015 05:37 PM 11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015 03:06 AM 86,636 Nakamichi_Loonette.c
02/20/2015 03:07 AM 445,206 Nakamichi_Loonette_XMM_PREFETCH_4096.cod
02/20/2015 12:21 PM 117,248 Nakamichi_Loonette_XMM_PREFETCH_4096.exe
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>
Nakamichi_Loonette_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 11745883 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000001040080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 192239 clocks or 2.727MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>lz4 -9 -Sx -b -T1
Agatha_Christie_89-ebooks_TXT.tar
Nb of threads = 1 ; Compression Level = 9
Agatha_Christie : 33258496 -> 13365090 ( 40.19%), 12.7 MB/s , 909.2 MB/s
Nakamichi'Loonette'有自己的主页:http://www.sanmayce.com/Hayabusa/。
关注点
希望有人和我一样热衷于改进它,这就是游戏的名称——'追求卓越。'
法国艺术家Rachelle Bartel,又名Lillycat,创作了这幅充满灵魂的艺术品。
2015年2月23日的补充
很高兴分享一个更精炼的'Loonette',名为'Loonette-Hoshimi'。
鉴于阿加莎·克里斯蒂是英文散文女王,她完整的作品自然是英文文本压缩的最佳数据集。
《吉尼斯世界纪录》将克里斯蒂列为有史以来最畅销的小说家。她的小说销量约为20亿册.../维基百科/
D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 171
Legend: MatchLengths: 4=Tiny, 8=Short, 12=Medium, 16=Long; WindowSizes: 2/3/4=Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 325551
NumberOf(Short)Matches[Short]Window (8)[2]: 240654
NumberOf(Medium)Matches[Short]Window (12)[2]: 80607
NumberOf(Long)Matches[Short]Window (16)[2]: 52966
NumberOf(Tiny)Matches[Medium]Window (4)[3]: 382909
NumberOf(Short)Matches[Medium]Window (8)[3]: 677654
NumberOf(Medium)Matches[Medium]Window (12)[3]: 440648
NumberOf(Long)Matches[Medium]Window (16)[3]: 239806
NumberOf(Short)Matches[Long]Window (8)[4]: 272624
NumberOf(Medium)Matches[Long]Window (12)[4]: 461732
NumberOf(Long)Matches[Long]Window (16)[4]: 266965
RAM-to-RAM performance: 1249 bytes/s.
D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 10854012 bytes ...
RAM-to-RAM performance: 320 MB/s.
Memory pool starting address: 0000000000E60080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 194229 clocks or 2.699MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 11%
D:\Nakamichi_Loonette_Intel15_Hoshimi>7za.exe a -tzip -mx9 Agatha_Christie_89-ebooks_TXT.tar.zip Agatha_Christie_89-ebooks_TXT.tar
D:\Nakamichi_Loonette_Intel15_Hoshimi>tangelo_32bit.exe c Agatha_Christie_89-ebooks_TXT.tar Agatha_Christie_89-ebooks_TXT.tar.tangelo
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method0 Agatha_Christie_89-ebooks_TXT.tar -method 0
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method1 Agatha_Christie_89-ebooks_TXT.tar -method 1
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method2 Agatha_Christie_89-ebooks_TXT.tar -method 2
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method3 Agatha_Christie_89-ebooks_TXT.tar -method 3
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method4 Agatha_Christie_89-ebooks_TXT.tar -method 4
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method5 Agatha_Christie_89-ebooks_TXT.tar -method 5
D:\Nakamichi_Loonette_Intel15_Hoshimi>dir
02/23/2015 03:33 PM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/23/2015 03:52 PM 33,277,230 Agatha_Christie_89-ebooks_TXT.tar.method0.zpaq
02/23/2015 03:52 PM 11,709,902 Agatha_Christie_89-ebooks_TXT.tar.method1.zpaq
02/23/2015 03:52 PM 9,557,649 Agatha_Christie_89-ebooks_TXT.tar.method2.zpaq
02/23/2015 03:53 PM 6,462,103 Agatha_Christie_89-ebooks_TXT.tar.method3.zpaq
02/23/2015 03:54 PM 6,387,670 Agatha_Christie_89-ebooks_TXT.tar.method4.zpaq
02/23/2015 03:56 PM 6,036,509 Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq
02/23/2015 03:03 PM 10,854,012 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/23/2015 03:10 PM 6,475,761 Agatha_Christie_89-ebooks_TXT.tar.tangelo
02/23/2015 03:12 PM 11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
02/23/2015 05:51 AM 99,469 Nakamichi_Loonette_Hoshimi.c
02/23/2015 05:51 AM 771,399 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.cod
02/23/2015 05:51 AM 118,784 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe
D:\Nakamichi_Loonette_Intel15_Hoshimi>
在提高压缩率的同时,我幸运地提高了解压缩速度并减少了2字节的代码。
; Nakamichi 'Loonette-Hoshimi' decompression loop, 74-14+2=98 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";
.B7.3::
00014 41 0f 18 8a 00
10 00 00 prefetcht0 BYTE PTR [4096+r10]
0001c b8 ff ff ff ff mov eax, -1
00021 41 8b 12 mov edx, DWORD PTR [r10]
00024 89 d1 mov ecx, edx
00026 83 f1 03 xor ecx, 3
00029 c1 e1 03 shl ecx, 3
0002c d3 e8 shr eax, cl
0002e 23 d0 and edx, eax
00030 89 d0 mov eax, edx
00032 83 e0 03 and eax, 3
00035 75 18 jne .B7.5
.B7.4::
00037 f3 41 0f 6f 42
01 movdqu xmm0, XMMWORD PTR [1+r10]
0003d c1 ea 04 shr edx, 4
00040 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
00045 4c 03 ca add r9, rdx
00048 ff c2 inc edx
0004a 4c 03 d2 add r10, rdx
0004d eb 22 jmp .B7.6
.B7.5::
0004f 89 d1 mov ecx, edx
00051 83 e2 0c and edx, 12
00054 c1 e9 04 shr ecx, 4
00057 ff c0 inc eax
00059 83 c2 04 add edx, 4
0005c 48 f7 d9 neg rcx
0005f 49 03 c9 add rcx, r9
00062 4c 03 d0 add r10, rax
00065 f3 0f 6f 01 movdqu xmm0, XMMWORD PTR [rcx]
00069 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
0006e 4c 03 ca add r9, rdx
.B7.6::
00071 4d 3b d0 cmp r10, r8
00074 72 9e jb .B7.3
XMM传输在Core 2上速度很慢,它们是不对齐的,并且惩罚很高,尤其是在访问RAM时。因此,在Intel第三/四/五代上,Nakamichi 'Loonette-Hoshimi'将轻松超过1GB/s。下面我将尝试将Nakamichi 'Loonette-Hoshimi'与一款压缩巨头放在一起,或者说进行并列比较。
关于Dr. Matt Mahoney撰写的、已进入公有领域的出色的ZPAQ方法的说明
方法0不仅仅是tar,它将文件连接在一起,中间有一些头。它仍然是zpaq存档格式,其中文件被分片并去重,然后打包成块。这些块以非上下文建模格式存储,这意味着它们被分成更小的块(64K),并带有4字节的头来指示其大小。有单独的块存储碎片SHA-1哈希和文件名。因此,与tar相比,它有更多的开销,但如果存在重复文件或碎片,有时也会节省一些空间。
方法1使用LZ77,通过用指向先前出现的指针替换重复的字符串来压缩。
方法2相同,但花费更多时间寻找更好的匹配(使用后缀数组而不是哈希表)。
方法3使用BWT(上下文排序)或LZ77进行长匹配,以及一个一阶上下文模型和算术编码来处理文字,具体取决于文件类型。
方法4使用LZ77、BWT或高阶上下文模型。
方法5使用一种复杂的、高阶的上下文混合模型,具有超过20个预测组件。
此外,Dr. Mahoney写了一个非常可爱的工具,名为'fv',它能绘制出给定文件的结构图,信息量非常大。它可以让你看到字符串重复的热点:图例(如何解读图像)
以下图表显示了长度为1(黑色)、2(红色)、4(绿色)和8(蓝色)的字符串匹配的分布。横轴表示文件中的位置。纵轴以对数刻度显示到上一个匹配的后向距离。向上阅读的主刻度线分别为1、10、100、1000等。
我将尝试解释3个不同的文件——原始文件、Hoshimi存档和最强的ZPAQ存档。
第一个'fv-Agatha_Christie_89-ebooks_TXT.tar.bmp'
顶部的蓝色条带表明,长度为8的匹配通常相隔至少10^4字节(4个主刻度线)直到文件的整个长度。
更准确地说,是3x10^4=30,000。
红色条带表明,长度为2的匹配相隔大约4x10到4x100字节。
第二个'fv-Agatha_Christie_89-ebooks_TXT.tar.Nakamichi.bmp'
一个好的压缩器应该能从图片中消除所有蓝色和绿色,但在这里,'Hoshimi'未能压缩一些模糊的绿色(长度为4字节的匹配)在1x10^6和1x10^7区域,也就是说,存在一些未压缩的4字节块,它们之间相隔1MB到10MB。
红色条带表明,长度为2的匹配相隔大约2x10^3到5x10^5字节。
没有颜色的垂直条带是'问题',在文件的这些部分,压缩器被某些东西蒙蔽了,看不见任何相关性。
第三个'fv-Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq.bmp'
红色条带表明,长度为2的匹配相隔大约3x10^3到3x10^5字节。
(星見 - hoshimi)的字面翻译是“看星星”,或者更诗意地说“观星”。
花見(花見,字面意思是“赏花”)是日本传统的欣赏花朵短暂之美的习俗,……
在佛教中,月見(月見 - tsukimi)也很重要,还有一种瑜伽分支叫做Surya Yoga(拜日式瑜伽?-?)。
如果灵感来了,下一个变种将命名为'Tsukimi'。
2015年3月23日的补充
我不喜欢'if-else'分支,所以'Loonette-Hoshimi'变成了无分支,遗憾的是代码大小从98字节增加到125字节。至于解压缩速度,Q9550s的表现令人失望(请参阅下面的'enwik8'测试),然而,我并不在意这些惩罚,这个练习仍然很好。
unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
char* retLOCAL = ret;
char* srcLOCAL = src;
char* srcEndLOCAL = src+srcSize;
unsigned int DWORDtrio;
unsigned int Flag;
uint64_t FlagMASK; //= 0xFFFFFFFFFFFFFFFF;
uint64_t FlagMASKnegated; //=0x0000000000000000;
while (srcLOCAL < srcEndLOCAL) {
DWORDtrio = *(unsigned int*)srcLOCAL;
//#ifndef _N_GP
//#ifdef _N_prefetch_4096
// _mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
//#endif
//#endif
// |1stLSB |2ndLSB |3rdLSB |
// -------------------------------
// |OO|LL|xxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit 16bit] 24bit]
// L = 00b means 04 MatchLength, (1+LL)<<2)
// L = 01b means 08 MatchLength, (1+LL)<<2)
// L = 10b means 12 MatchLength, (1+LL)<<2)
// L = 11b means 16 MatchLength, (1+LL)<<2)
// OO = 00b means Literal
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 2 bytes long i.e. Sliding Window is 2*8-LL-OO=(1+OO)*8-4=12 or 4KB
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 3 bytes long i.e. Sliding Window is 3*8-LL-OO=(1+OO)*8-4=20 or 1MB
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 4 bytes long i.e. Sliding Window is 4*8-LL-OO=(1+OO)*8-4=28 or 256MB
/*
// Branchfull [
DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
if ( (DWORDtrio & 0x03) == 0x00 ) {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
#endif
retLOCAL+= (DWORDtrio>>4);
srcLOCAL+= (DWORDtrio>>4)+1;
} else {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), retLOCAL );
#endif
srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2
retLOCAL+= (1+((DWORDtrio>>2)&0x03))<<2; // 4/8/12/16
}
// Branchfull ]
*/
// Branchless [
DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
Flag=!(DWORDtrio & 0x03);
// In here Flag=0|1
FlagMASKnegated= Flag - 1; // -1|0
FlagMASK= ~FlagMASKnegated;
#ifdef _N_XMM
// SlowCopy128bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
SlowCopy128bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
#endif
#ifdef _N_GP
// memcpy(retLOCAL, (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), 16);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16);
#endif
retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) + ((uint64_t)((1+((DWORDtrio>>2)&0x03))<<2)&FlagMASKnegated) ;
srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
}
return (unsigned int)(retLOCAL - ret);
}
/*
; 'Nakamichi_Loonette-Hoshimi_branchless' decompression loop, bb-40+2=125 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";
.B7.3::
00040 44 8b 32 mov r14d, DWORD PTR [rdx]
00043 44 89 f1 mov ecx, r14d
00046 83 f1 03 xor ecx, 3
00049 41 bf ff ff ff
ff mov r15d, -1
0004f c1 e1 03 shl ecx, 3
00052 33 ff xor edi, edi
00054 41 d3 ef shr r15d, cl
00057 45 23 f7 and r14d, r15d
0005a 4c 89 c9 mov rcx, r9
0005d 45 89 f5 mov r13d, r14d
00060 44 89 f5 mov ebp, r14d
00063 41 83 e5 03 and r13d, 3
00067 49 89 d7 mov r15, rdx
0006a 0f 44 f8 cmove edi, eax
0006d 41 83 e6 0c and r14d, 12
00071 c1 ed 04 shr ebp, 4
00074 ff cf dec edi
00076 41 ff c5 inc r13d
00079 41 89 ec mov r12d, ebp
0007c 49 89 fb mov r11, rdi
0007f 49 2b cc sub rcx, r12
00082 49 f7 d3 not r11
00085 48 ff c9 dec rcx
00088 4d 23 fb and r15, r11
0008b 48 23 cf and rcx, rdi
0008e ff c5 inc ebp
00090 41 83 c6 04 add r14d, 4
00094 4d 23 e3 and r12, r11
00097 49 23 eb and rbp, r11
0009a 4c 23 ef and r13, rdi
0009d f3 42 0f 6f 44
39 01 movdqu xmm0, XMMWORD PTR [1+rcx+r15]
000a4 4c 23 f7 and r14, rdi
000a7 49 03 ed add rbp, r13
000aa 4d 03 e6 add r12, r14
000ad 48 03 d5 add rdx, rbp
000b0 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
000b5 4d 03 cc add r9, r12
000b8 49 3b d0 cmp rdx, r8
000bb 72 83 jb .B7.3
*/
在core 2 Q9550s上进行的'enwik8'测试
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette-Hoshimi_branchless.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 192 MB/s.
Memory pool starting address: 0000000002620080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193784 clocks or 2.706MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 7%
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2 enwik8
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000002710080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193453 clocks or 2.710MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2 enwik8
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>
我相信我无法进一步优化它了。
历史
第一个Nakamichi是在一年前编写的,截至2015年2月20日,'Loonette'是最新的、最精炼的。