最快的 strstr 类函数在 C 中!?
用于在 haystack 中搜索 needle 的优化函数
- 下载 Benchmark_source_(Two-Way_vs_Railgun_Trolldom).zip - 260.8 KB
- 下载 Railgun_Trolldom.c.zip - 36.5 KB
- 下载 Railgun_Swampshine_BailOut.c.zip - 9.5 KB
- 下载 Horspool_Ennearch_DecumanusBITified_in_ASSEMBLY.zip - 21.9 KB
- 下载 Horspool_Ennearch_DecumanusBITified_in_C.zip - 5.4 KB
- 下载 strstr_SHORT-SHOWDOWN_r6___.zip - 17.9 KB
2016年8月22日的补充
就我所知,最快的 memmem() 已经是经过优化的 Railgun_Swampshine_BailOut —— 它避免了BMH2和伪BMH4中的第二次模式比较。它被称为“Trolldom”,以表示其次要目的——通过“巨魔之地”(即导致大多数Boyer-Moore-Horspool变体出现二次行为的丑陋/变形的干草堆)快速遍历。GNU/GLIBC memmem() 有时是线性IC和亚线性IC,而 Railgun_Trolldom 则大部分是亚线性IC。
Railgun_Trolldom 比 Railgun_Swampshine 稍快(在Intel和GCC上都是),这主要是因为增加了一个位操作的BMH 2阶(8KB表开销而不是64KB)路径——对于长度小于77777字节的干草堆——预热更快。
因此,在Core 2 Q9550s @2.83GHz DDR2 @666MHz / i5-2430M @3.00GHz DDR3 @666MHz上的结果
--------------------------------------------------------------------------------------------------------------------------------
| Searcher | GNU/GLIBC memmem() | Railgun_Swampshine | Railgun_Trolldom |
|--------------------------------------------------------------------------------------------------|---------------------------|
| Testfile\Compiler | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 27,703 bytes | 4506/- | 5330/14725 | 13198/- | 11581/15171 | 19105/22449 | 15493/21642 |
| Name: An_Interview_with_Carlos_Castaneda.TXT | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 308,062 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 3,242,492,648 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 2,347,772 bytes | 190/- | 226/244 | 1654/- | 1729/1806 | 1794/1822 | 1743/1809 |
| Name: Gutenberg_EBook_Don_Quixote_996_(ANSI).txt | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 14,316,954 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 6,663,594,719,173 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 899,425 bytes | 582/- | 760/816 | 3094/- | 2898/3088 | 3255/3289 | 2915/3322 |
| Name: Gutenberg_EBook_Dokoe_by_Hakucho_Masamune_(Japanese_UTF8).txt | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 3,465,806 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 848,276,034,315 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 4,487,433 bytes | 104/- | 109/116 | 445/- | 458/417 | 450/411 | 467/425 |
| Name: Dragonfly_genome_shotgun_sequence_(ACGT_alphabet).fasta | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 20,540,375 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 13,592,530,857,131 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 954,035 bytes | 99/- | 144/144 | 629/- | 580/682 | 634/807 | 585/725 |
| Name: LAOTZU_Wu_Wei_(BINARY).pdf | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 27,594,933 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 8,702,455,122,519 | | | | |
|------------------------------------------------------------------------------------------------------------------------------|
| Size: 15,583,440 bytes | -/- | -/- | -/- | 663/771 | 675/778 | 663/757 |
| Name: Arabian_Nights_complete.html | | | | |
| LATENCY-WISE: Number of 'memmem()' Invocations: 72,313,262 | | | | |
| THROUGHPUT-WISE: Number of Total bytes Traversed: 105,631,163,854,099 | | | | |
--------------------------------------------------------------------------------------------------------------------------------
基准测试可在我的网盘和我的网站下载
https://1drv.ms/u/s!AmWWFXGMzDmEgl3izdwhSnIq0Lv5 www.sanmayce.com/Downloads/Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK.zip
注0:Railgun_Trolldom比Railgun_Swampshine稍快(在Intel和GCC上都是),这主要是因为增加了一个位操作的BMH 2阶(8KB表开销而不是64KB)路径——对于长度小于77777字节的干草堆——预热更快。
注1:这些数字表示长度为4、5、6、8、9、10、12、13、14、16、17、18、24字节的模式/针头在4KB、256KB、1MB、256MB长的干草堆中进行memmem操作的速率(字节/秒)。
事实上,这些数字是使用LZSS和memmem()作为匹配查找器的压缩速度。
注2:《一千零一夜》可在此下载
https://ebooks.adelaide.edu.au/b/burton/richard/b97b/complete.html
注3:在i5-2430M上,TW正在追赶,因为此CPU处理指令更快,而RAM速度几乎相同,Railgun受到缓慢的RAM获取影响——预取器等很糟糕。
注4:使用简单的英文文本《1001夜故事集》,长度为15,583,440字节,遍历的干草堆数据累积大小接近100TB,即105,631,163,854,099 ~ 1024^4 = 1,099,511,627,776。
注5:使用简单的法文文本“Agatha_Christie_85-ebooks_(French)_TXT.tar”,长度为32,007,168字节,遍历的干草堆数据累积大小接近200TB ~ 234,427,099,834,376。
由于它是纯C代码,未使用任何内在函数,因此它可能在大多数*nix系统上无问题地工作。
我将基准测试C源代码及其28页的PDF格式列表包含在附件中,以便搜索技术爱好者更容易打印和消化。享受吧!
在此,我离题了,因为我想分享我的失望,我没有看到 Hamid 的 Turbo Family 超快速练习被利用/欣赏。接下来的转储揭示了附加的基准测试(使用Railgun_Trolldom,它在此“折磨”中解析了200多TB)的功能以及它与当今许多优化压缩器的比较。
由以下工具提供的文本解压缩图:
- TurboBench Copyright (c) 2013-2016 Powturbo Feb 21 2016;
- zstd 命令行界面 64位 v0.8.1, Yann Collet;
- Nakamichi 'Kintaro++' Kaze编写。
这个基准测试(你可以用它重现下面的转储)位于
https://1drv.ms/u/s!AmWWFXGMzDmEgl6-_1mrjcTj07B1
一旦打包的85部阿加莎法语小说被基准测试压缩,我们来看看在i5-2430M @3.00GHz DDR3 @666MHz上,解压速度与许多“常见嫌疑犯”(即压缩器)相比如何
D:\Nakamichi_Kintaro++_source_executables_64bit_(GCC510-vs-Intel150)_(TW-vs-RG)_BENCHMARK>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Compressing 32007168 bytes ...
|; Each rotation means 64KB are encoded; Done 100%; Compression Ratio: 3.53:1
NumberOfFullLiterals (lower-the-better): 164
NumberOfFlushLiteralsHeuristic (bigger-the-better): 184323
Legend: WindowSizes: 1/2/3/4=Tiny/Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 226869
NumberOf(Short)Matches[Short]Window (8)[2]: 119810
NumberOf(Medium)Matches[Short]Window (12)[2]: 71202
NumberOf(Long)Matches[Short]Window (16)[2]: 31955
NumberOf(MaxLong)Matches[Short]Window (24)[2]: 7078
NumberOf(Tiny)Matches[Medium]Window (5)[3]: 257313
NumberOf(Short)Matches[Medium]Window (9)[3]: 526493
NumberOf(Medium)Matches[Medium]Window (13)[3]: 285579
NumberOf(Long)Matches[Medium]Window (17)[3]: 158873
NumberOf(MaxLong)Matches[Medium]Window (24)[3]: 51276
NumberOf(Tiny)Matches[Long]Window (6)[4]: 41075
NumberOf(Short)Matches[Long]Window (10)[4]: 240454
NumberOf(Medium)Matches[Long]Window (14)[4]: 258653
NumberOf(Long)Matches[Long]Window (18)[4]: 209007
NumberOf(MaxLong)Matches[Long]Window (24)[4]: 190929
RAM-to-RAM performance: 605 bytes/s.
Compressed to 9076876 bytes.
LATENCY-WISE: Number of 'memmem()' Invocations: 102,091,852
THROUGHPUT-WISE: Number of Total bytes Traversed: 234,427,099,834,376
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>dir
08/22/2016 07:07 AM 9,076,876 Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi
08/22/2016 07:44 AM 124,928 Nakamichi_Kintaro++_Intel_15.0_64bit.exe
08/16/2016 01:55 PM 9,191,476 turbobenchs.exe
08/18/2016 07:24 AM 841,297 zstd-windows-v0.8.1_win64.exe
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>RUNME.BAT
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>Nakamichi_Kintaro++_Intel_15.0_64bit.exe Agatha_Christie_85-ebooks_(French)_TXT.tar.Nakamichi /report
Nakamichi 'Kintaro++', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note1: This compile can handle files up to 1711MB.
Note2: The matchfinder/memmem() is Railgun_Trolldom.
Current priority class is HIGH_PRIORITY_CLASS.
Decompressing 9076876 bytes ...
RAM-to-RAM performance: 512 MB/s.
Compression Ratio (bigger-the-better): 3.53:1
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b12 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
12#_(French)_TXT.tar : 32007168 -> 8965791 (3.570), 4.0 MB/s , 409.2 MB/s
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>zstd-windows-v0.8.1_win64.exe -b22 "Agatha_Christie_85-ebooks_(French)_TXT.tar"
22#_(French)_TXT.tar : 32007168 -> 6802321 (4.705), 1.0 MB/s , 267.4 MB/s
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobench_singlefile_noNakamichi_2G.bat "Agatha_Christie_85-ebooks_(French)_TXT.tar"
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe "Agatha_Christie_85-ebooks_(French)_TXT.tar" -ezlib,9/fastlz,2/lzturbo,19,29,39/bzip2/chameleon,2/snappy_c/zstd,12,21/density,3/lz4,16/lz5,15/lzham,4/brieflz/brotli,1
1/crush,2/lzma,9/zpaq,2/lzf/yappy/trle/memcpy/lzsse2,1,16 -g -B2G
Benchmark: 1 from 3
6739693 21.1 0.89 74.96 lzma 9 Agatha_Christie_85-ebooks_(French)_TXT.tar
6758120 21.1 0.75 159.97 lzham 4 Agatha_Christie_85-ebooks_(French)_TXT.tar
6801784 21.3 0.88 380.07 lzturbo 39 Agatha_Christie_85-ebooks_(French)_TXT.tar
6834372 21.4 1.15 280.52 zstd 21 Agatha_Christie_85-ebooks_(French)_TXT.tar
6920292 21.6 0.34 187.25 brotli 11 Agatha_Christie_85-ebooks_(French)_TXT.tar
7680438 24.0 8.73 19.14 bzip2 Agatha_Christie_85-ebooks_(French)_TXT.tar
8897625 27.8 3.15 56.09 zpaq 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
8964510 28.0 4.17 397.51 zstd 12 Agatha_Christie_85-ebooks_(French)_TXT.tar
8990840 28.1 0.89 590.17 lzturbo 29 Agatha_Christie_85-ebooks_(French)_TXT.tar
9076876 ? ? 512.00 Nakamichi Kintaro++ outside TurboBench
9147309 28.6 0.13 223.49 crush 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
10104473 31.6 1.25 351.10 lz5 15 Agatha_Christie_85-ebooks_(French)_TXT.tar
10691287 33.4 4.92 2013.23 lzsse2 16 Agatha_Christie_85-ebooks_(French)_TXT.tar
10973383 34.3 10.13 214.94 zlib 9 Agatha_Christie_85-ebooks_(French)_TXT.tar
12107431 37.8 0.94 2136.70 lzturbo 19 Agatha_Christie_85-ebooks_(French)_TXT.tar
12266943 38.3 15.08 1615.99 lz4 16 Agatha_Christie_85-ebooks_(French)_TXT.tar
13238993 41.4 75.25 149.55 brieflz Agatha_Christie_85-ebooks_(French)_TXT.tar
14364428 44.9 255.83 234.91 density 3 Agatha_Christie_85-ebooks_(French)_TXT.tar
15448532 48.3 12.27 1503.99 lzsse2 1 Agatha_Christie_85-ebooks_(French)_TXT.tar
17262978 53.9 56.14 1158.91 yappy Agatha_Christie_85-ebooks_(French)_TXT.tar
17678955 55.2 172.49 344.66 lzf Agatha_Christie_85-ebooks_(French)_TXT.tar
17868118 55.8 177.74 303.39 fastlz 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
17915390 56.0 926.24 1851.55 chameleon 2 Agatha_Christie_85-ebooks_(French)_TXT.tar
18694206 58.4 228.88 622.50 snappy_c Agatha_Christie_85-ebooks_(French)_TXT.tar
31952980 99.8 124.68 1230.32 trle Agatha_Christie_85-ebooks_(French)_TXT.tar
32007172 100.0 6096.18 5338.21 memcpy Agatha_Christie_85-ebooks_(French)_TXT.tar
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>turbobenchs.exe -l2
...
Brotli v16-02 Apache license https://github.com/google/brotli
Bzip2 v1.06 BSD like http://www.bzip.org/downloads.html https://github.com/asimonov-im/bzip2
Chameleon v15-03 Public Domain http://cbloomrants.blogspot.de/2015/03/03-25-15-my-chameleon.html
Crush v1.0.0 Public Domain http://sourceforge.net/projects/crush
Density v0.13.0 BSD license https://github.com/centaurean/density
FastLz v0.1.0 BSD like http://fastlz.org https://github.com/ariya/FastLZ
Lz4 v1.7.1 BSD license https://github.com/Cyan4973/lz4
Lz5 v1.4.1 BSD license https://github.com/inikep/lz5
Lzham v1.1 MIT license https://github.com/richgel999/lzham_codec_devel
Lzma v9.35 Public Domain http://7-zip.org https://github.com/jljusten/LZMA-SDK
lzsse v16-02 BSD license https://github.com/ConorStokes/LZSSE
Snappy-c v1.1.2 BSD Like https://github.com/andikleen/snappy-c
Yappy v2011
zlib v1.2.8 zlib license http://zlib.net https://github.com/madler/zlib
ZSTD v0.5.1 BSD license https://github.com/Cyan4973/zstd
TurboRLE ESC v16-01 https://sites.google.com/site/powturbo
TurboRLE v16-01 https://sites.google.com/site/powturbo
LzTurbo v1.3 https://sites.google.com/site/powturbo
...
D:\Benchmark_Goldenboy_vs_LzTurbo_vs_Zstd>
就是这样,那512MB/s是选择字面长度的量化方法的结果。现在,为了欣赏一个卓越(实际上是最快)的性能者,lzturbo 39以其惊人的380.07MB/s必须与Yann最新的Zstd和超简单的Nakamichi进行比较
Agatha_Christie_85-ebooks_(French)_TXT.tar 重新创建时
6801784 -> 32,007,168 380.07 MB/s lzturbo 39
8965791 -> 32,007,168 409.20 MB/s zstd-windows-v0.8.1_win64.exe -b12
8990840 -> 32,007,168 590.17 MB/s lzturbo 29
9076876 -> 32,007,168 512.00 MB/s Nakamichi Kintaro++
像 LzTurbo 的29/39模式那样,其性能着实令人惊叹,它激励人们超越当前臃肿的软件,重新思考为什么超快速实现尚未成为主流。
2016年8月13日的补充
两件事。
首先,上次的修复存在bug,我深表歉意,现在已修复,相当令人尴尬,因为它只是一个简单的左右边界检查。它不影响速度,表现为罕见的模式命中丢失。
既然我不相信说“对不起”,而相信把事情做好,那么接下来是我进一步贬低我业余工作的尝试
两年前,我没有充分注意将“Swampwalker”启发式算法添加到Railgun_Ennearch中,我的意思是,只做了一个快速测试,没有真正的验证——这并非我的失误,也不是粗心大意,而是对“即时”编写代码能力的过度自信。确实愚蠢,然而,当一个程序员在编写简单练习时获得动力,他会开始产生掌握主题的虚假信心,这肯定不好!
希望其他程序员能学会避免这种充满疏忽的风格。
要手动修复C代码,请将此行
if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {
改为这一行
if ( ((signed int)(i-(PRIMALposition-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)]+((PRIMALlengthCANDIDATE-4+1)-1) <= pbTargetMax - 4) ) {
其次,想为搜索(即 memmem() 函数)提供最严苛的测试平台:它既有利于基准测试(实际应用中的速度),也有利于 bug 控制。
基准测试可在我的互联网硬盘下载
https://1drv.ms/u/s!AmWWFXGMzDmEglwjlUtnMJrfhosK
它也被压缩(4个测试文件没有),并附在本文章顶部。
速度对决有三个方面
- 比较 GCC 5.10 和 Intel 15.0 编译器生成的64位代码;
- 比较两组数据——英文文本搜索速度与基因组ACGT类型数据搜索速度;
- 比较经过调整的 Two-Way算法 (由Eric Blake实现并被GLIBC采用为memmem())与我的 Railgun_Swampshine。
注1:GLIBC memmem() 取自最新(2016年8月5日)的glibc 2.24 tar包
https://gnu.ac.cn/software/libc/
注2:Eric Blake说他通过添加一些次线性路径增强了Two-Way的线性度,而Railgun完全是关于次线性度的,所以请随意使用你自己的测试文件(最坏情况),只需让这样的文件喂给压缩器,然后我们就会看到线性Two-Way与Railgun_Swampshine的表现如何。
注3:只需从基准测试的源代码中复制粘贴“Railgun_Swampshine”或“Railgun_Ennearch”。
所以在Core 2 Q9550s @2.83GHz上的结果
---------------------------------------------------------------------------------------------
| testfile\Searcher | GNU/GLIBC memmem() | Railgun_Swampshine |
|-------------------------------------------------------------------------------------------|
| Compiler | Intel 15.0 | GCC 5.10 | Intel 15.0 | GCC 5.10 |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Don | 190 | 226 | 1654 | 1729 |
| _Quixote_996_(ANSI).txt | | | | |
| 2,347,772 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| The_Project_Gutenberg_EBook_of_Dokoe | 582 | 760 | 3094 | 2898 |
| _by_Hakucho_Masamune_(Japanese_UTF-8).txt | | | | |
| 899,425 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| Dragonfly_genome_shotgun_sequence | 104 | 109 | 445 | 458 |
| _(ACGT_alphabet).fasta | | | | |
| 4,487,433 bytes | | | | |
|-------------------------------------------------------------------------------------------|
| LAOTZU_Wu_Wei_(BINARY).pdf | 99 | 144 | 629 | 580 |
| 954,035 bytes | | | | |
|-------------------------------------------------------------------------------------------|
注:这些数字表示长度为4、5、6、8、9、10、12、13、14、16、17、18、24字节的模式/针头在4KB、256KB、1MB、256MB长的干草堆中进行memmem操作的速率(字节/秒)。
事实上,这些数字是使用LZSS和memmem()作为匹配查找器的压缩速度。
Two-Way明显慢于BMH 2阶,速度下降范围为
- 对于文本ANSI字母:1729/226= 7.6x
- 对于文本UTF8字母:2898/760= 3.8x
- 对于文本ACGT字母:458/109= 4.2x
- 对于二进制类字母:580/144= 4.0x
对于比我这里666MHz更快的RAM,以及数兆字节长的干草堆,加速比超过 8x。
该基准测试展示了memmem变体的真实行为(延迟和原始速度),我还添加了 Thierry Lecroq 的Two-Way实现
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
然而,Eric Blake 的实现更快,因此被选为速度对决。
我曾测量过遍历的干草堆的总长度,对于100MB以上的文件,它达到了万亿字节,即PB级别——这真是一种严峻的考验。
享受!
引言
下面介绍的C语言strstr函数(由Georgi 'Kaze'编写)名为 Railgun_Quadruplet_6ppp
,它基于著名的 Boyer-Moore-Horspool 和简化的 Karp-Rabin 算法。主要目标是:实现对各种模式/针头和字符串/干草堆长度(从2字节开始)最快的执行速度。
Railgun_Quadruplet_6ppp
现已过时,因为 Railgun_Quadruplet_7
的到来,它将BMH片段提升了超过一个百分点。请参阅下文以获取其无注释和表格化的源代码。
Railgun_Quadruplet_7
利用了CPU快速非对齐 DWORD
提取能力。
在 1TB以上 文本上进行的最严格、最精确的测试(52种模式与重复256次的197MB英文图书文本,结果耗时1000多秒,更多信息请参阅 strstr_SHORT-SHOWDOWN_r7_Heavy_Test_LOG.txt 文件)给出以下结果:
[strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe:]
仅返回首次命中偏移量的变体
Railgun_Quadruplet_6pp
52,即平均性能:1584KB/时钟周期Railgun_Quadruplet_7
52,即平均性能:1659KB/时钟周期
计数所有命中的变体
Railgun_6pp
8x2,即平均性能:1246KB/时钟周期Railgun_Quadruplet_7
8x2,即平均性能:1266KB/时钟周期
[strstr_SHORT-SHOWDOWN_Intel_v12_O3.exe:]
仅返回首次命中偏移量的变体
Railgun_Quadruplet_6pp
52,即平均性能:1773KB/时钟周期Railgun_Quadruplet_7
52,即平均性能:1685KB/时钟周期
计数所有命中的变体
Railgun_6pp
8x2,即平均性能:1054KB/时钟周期Railgun_Quadruplet_7
8x2,即平均性能:1076KB/时钟周期
注意:对于 Railgun_Quadruplet_7
在 DWORD
(相比 BYTE
)内存获取速度快的CPU上,我期望有显著提升,据我所知,Intel i7就是其中之一,可惜我没有机会使用它。
或者两条底线
计数所有命中的最快变体
Railgun_Quadruplet_7
8x2 平均性能:1266KB/时钟周期(使用 Microsoft_v16_Ox 实现了最佳性能)!
仅返回首次命中偏移量的最快变体
Railgun_Quadruplet_6pp
52 平均性能:1773KB/时钟周期(1,943,883,635,456 字节 / 1,070,220 时钟周期 = 1732 MB/s,使用 Intel_v12_O3 实现最佳性能)!
2014年4月27日的补充
我深表歉意。
“Swampwalker”启发式算法有一个讨厌的带符号/无符号错误,我在几个月前我的模糊搜索文章中说明了这个问题,现在这里也修复了。
错误是这样的(变量“i”和“PRIMALposition”是uint32_t)
下一行假设 -19 >= 0 为真
如果 ( (i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));
下一行假设 -19 >= 0 为假
如果 ( (signed int)(i-(PRIMALposition-1)) >= 0) printf ("THE NASTY-CASTY BUG: %d >= 0\n", i-(PRIMALposition-1));
实际修复
...
if ( count <= 0 ) {
// I have to add out-of-range checks...
// i-(PRIMALposition-1) >= 0
// &pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4
// i-(PRIMALposition-1)+(count-1) >= 0
// &pbTarget[i-(PRIMALposition-1)+(count-1)] <= pbTargetMax - 4
// FIX from 2014-Apr-27:
// Because (count-1) is negative, above fours are reduced to next twos:
// i-(PRIMALposition-1)+(count-1) >= 0
// &pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4
// The line below is BUGGY:
//if ( (i-(PRIMALposition-1) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) && (&pbTarget[i-(PRIMALposition-1)+(count-1)] <= pbTargetMax - 4) ) {
// The line below is OKAY:
if ( ((signed int)(i-(PRIMALposition-1)+(count-1)) >= 0) && (&pbTarget[i-(PRIMALposition-1)] <= pbTargetMax - 4) ) {
...
我又检查了一遍代码,感觉没问题,我很喜欢。
下载区中存在bug的“Swampshine”函数已被修复后的版本替换。
2014年2月2日的补充
我可能已经完成了更新,这是最新的一个。
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Swampshine_r2.7z。
以及优美的主循环
; The main loop of pseudo Boyer-Moore-Horspool order 2 in C:
; while (i <= cbTarget-cbPattern) {
; Gulliver = 1;
; if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
; if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
; if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
; count = cbPattern-4+1;
; while ( count > 0 && *(uint32_t *)(pbPattern+count-1) == *(uint32_t *)(&pbTarget[i]+(count-1)) )
; count = count-4;
; if ( count <= 0 ) return(pbTarget+i);
; }
; }
; } else Gulliver = cbPattern-(2-1);
; i = i + Gulliver;
; }
; The main loop of pseudo Boyer-Moore-Horspool order 2 in Assembly by Intel C++ Compiler XE for applications running on IA-32, Version 12.1.1.258, -O3 used:
.B2.28:
0019f 0f b7 6c 02 fe movzx ebp, WORD PTR [-2+edx+eax]
001a4 0f b6 34 2c movzx esi, BYTE PTR [esp+ebp]
001a8 85 f6 test esi, esi
001aa 0f 84 bd 00 00
00 je .B2.40
.B2.29:
001b0 0f b7 6c 02 fc movzx ebp, WORD PTR [-4+edx+eax]
001b5 0f b6 34 2c movzx esi, BYTE PTR [esp+ebp]
001b9 85 f6 test esi, esi
001bb 75 0c jne .B2.31
.B2.30:
001bd 8b ac 24 18 00
01 00 mov ebp, DWORD PTR [65560+esp]
001c4 e9 a6 00 00 00 jmp .B2.41
.B2.31:
001c9 8b b4 24 70 00
01 00 mov esi, DWORD PTR [65648+esp]
001d0 8b ac 24 14 00
01 00 mov ebp, DWORD PTR [65556+esp]
001d7 3b 2c 06 cmp ebp, DWORD PTR [esi+eax]
001da 0f 85 86 00 00
00 jne .B2.38
.B2.32:
001e0 8b b4 24 18 00
01 00 mov esi, DWORD PTR [65560+esp]
001e7 85 f6 test esi, esi
001e9 0f 8e 42 06 00
00 jle .B2.110
.B2.33:
001ef 33 ff xor edi, edi
001f1 8d 2c 10 lea ebp, DWORD PTR [eax+edx]
001f4 89 8c 24 00 00
01 00 mov DWORD PTR [65536+esp], ecx
001fb 89 94 24 04 00
01 00 mov DWORD PTR [65540+esp], edx
00202 89 9c 24 0c 00
01 00 mov DWORD PTR [65548+esp], ebx
00209 89 84 24 08 00
01 00 mov DWORD PTR [65544+esp], eax
00210 8b d7 mov edx, edi
00212 8b 8c 24 10 00
01 00 mov ecx, DWORD PTR [65552+esp]
00219 8b 9c 24 7c 00
01 00 mov ebx, DWORD PTR [65660+esp]
.B2.34:
00220 8b 44 0a fc mov eax, DWORD PTR [-4+edx+ecx]
00224 3b 44 2a fc cmp eax, DWORD PTR [-4+edx+ebp]
00228 75 11 jne .B2.36
.B2.35:
0022a 47 inc edi
0022b 8d 74 1a f9 lea esi, DWORD PTR [-7+edx+ebx]
0022f 83 c2 fc add edx, -4
00232 3b bc 24 1c 00
01 00 cmp edi, DWORD PTR [65564+esp]
00239 72 e5 jb .B2.34
.B2.36:
0023b 89 8c 24 10 00
01 00 mov DWORD PTR [65552+esp], ecx
00242 8b 8c 24 00 00
01 00 mov ecx, DWORD PTR [65536+esp]
00249 8b 94 24 04 00
01 00 mov edx, DWORD PTR [65540+esp]
00250 8b 84 24 08 00
01 00 mov eax, DWORD PTR [65544+esp]
00257 8b 9c 24 0c 00
01 00 mov ebx, DWORD PTR [65548+esp]
.B2.37:
0025e 85 f6 test esi, esi
00260 0f 8e cb 05 00
00 jle .B2.110
.B2.38:
00266 bd 01 00 00 00 mov ebp, 1
0026b eb 02 jmp .B2.41
.B2.40:
0026d 8b eb mov ebp, ebx
.B2.41:
0026f 03 c5 add eax, ebp
00271 3b c1 cmp eax, ecx
00273 0f 86 26 ff ff
ff jbe .B2.28
.B2.42:
; The main loop of pseudo Boyer-Moore-Horspool order 2: 00273-0019f+6 = 218 bytes
2014年1月29日的补充
非常感谢Harold,这位CP成员帮助我了解了i7 4770K如何执行Railgun。
最终,Railgun已经变得抗巨魔了。
还需要更多的调整,但我们不要破坏乐趣,来看看“Swampshine”修订版1有什么。
下面给出了三种“折磨”类型的结果:
1] “维基百科”折磨,XML数据
2] “基因组”折磨,“ACGT”数据
3] “最坏情况”折磨,重复数据
我的结论是:Railgun_Swampshine 是所有100字节左右模式的首选武器,无论战场多么泥泞,它都不会让你失望。
对目前的结果不满意,仍有一些校准工作要做,第一个修订版“Swampshine”中存在一些拖累
1] 使用“Swampwalker”的4阶(而不是2阶)。
2] 针头有前缀时的未优化情况(有前缀和后缀以及有后缀时性能尚可)。
3] 代码大小不理想:00a8c-00000+4 = 2704字节
4] 随着针头长度的增加,速度衰减比预期要陡峭得多。
Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 18941MB/s; PRIMALposition=1; PRIMALlength=65
Railgun_Ennearch precise performance: 18859MB/s
Railgun_DecumanusBITified precise performance: 19136MB/s
Boyer_Moore-Horspool precise performance: 6959MB/s
Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17734MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 18818MB/s
Railgun_DecumanusBITified precise performance: 18941MB/s
Boyer_Moore-Horspool precise performance: 7871MB/s
Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17791MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 19654MB/s
Railgun_DecumanusBITified precise performance: 18948MB/s
Boyer_Moore-Horspool precise performance: 7917MB/s
Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17360MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 20802MB/s
Railgun_DecumanusBITified precise performance: 19013MB/s
Boyer_Moore-Horspool precise performance: 8147MB/s
Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17486MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 22800MB/s
Railgun_DecumanusBITified precise performance: 18631MB/s
Boyer_Moore-Horspool precise performance: 9085MB/s
Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17311MB/s; PRIMALposition=1; PRIMALlength=68
Railgun_Ennearch precise performance: 23872MB/s
Railgun_DecumanusBITified precise performance: 17926MB/s
Boyer_Moore-Horspool precise performance: 9343MB/s
Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17072MB/s; PRIMALposition=79; PRIMALlength=73
Railgun_Ennearch precise performance: 23983MB/s
Railgun_DecumanusBITified precise performance: 17933MB/s
Boyer_Moore-Horspool precise performance: 8591MB/s
Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17725MB/s; PRIMALposition=79; PRIMALlength=84
Railgun_Ennearch precise performance: 26535MB/s
Railgun_DecumanusBITified precise performance: 19854MB/s
Boyer_Moore-Horspool precise performance: 9072MB/s
Doing Search for Pattern(201bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 17954MB/s; PRIMALposition=67; PRIMALlength=119
Railgun_Ennearch precise performance: 31708MB/s
Railgun_DecumanusBITified precise performance: 23392MB/s
Boyer_Moore-Horspool precise performance: 6983MB/s
Doing Search for Pattern(230bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 15113MB/s; PRIMALposition=165; PRIMALlength=66
Railgun_Ennearch precise performance: 33098MB/s
Railgun_DecumanusBITified precise performance: 25518MB/s
Boyer_Moore-Horspool precise performance: 7599MB/s
Doing Search for Pattern(255bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 14711MB/s; PRIMALposition=90; PRIMALlength=65
Railgun_Ennearch precise performance: 39911MB/s
Railgun_DecumanusBITified precise performance: 29435MB/s
Boyer_Moore-Horspool precise performance: 7602MB/s
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 7163MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance: 123574MB/s
Railgun_DecumanusBITified precise performance: 95773MB/s
Boyer_Moore-Horspool precise performance: 8370MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 3713MB/s; PRIMALposition=32; PRIMALlength=150
Railgun_Ennearch precise performance: 196321MB/s
Railgun_DecumanusBITified precise performance: 178054MB/s
Boyer_Moore-Horspool precise performance: 11102MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 2723MB/s; PRIMALposition=1506; PRIMALlength=125
Railgun_Ennearch precise performance: 287063MB/s
Railgun_DecumanusBITified precise performance: 267550MB/s
Boyer_Moore-Horspool precise performance: 13990MB/s
Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 800MB/s; PRIMALposition=2018; PRIMALlength=266
Railgun_Ennearch precise performance: 227827MB/s
Railgun_DecumanusBITified precise performance: 240171MB/s
Boyer_Moore-Horspool precise performance: 11565MB/s
Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Swampshine precise performance: 492MB/s; PRIMALposition=2714; PRIMALlength=216
Railgun_Ennearch precise performance: 267842MB/s
Railgun_DecumanusBITified precise performance: 279599MB/s
Boyer_Moore-Horspool precise performance: 11846MB/s
我能说什么呢?巨魔反击了,我低估了它们的肮脏。
'Swampwalker' 的四阶如下
// Swampwalker heuristic order 4 (Needle should be bigger than 4) [
// Needle: 1234567890qwertyuiopasdfghjklzxcv PRIMALposition=01 PRIMALlength=33 '1234567890qwertyuiopasdfghjklzxcv'
// Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv PRIMALposition=29 PRIMALlength=04 'vvvv'
// Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv PRIMALposition=08 PRIMALlength=20 'vvvBOOMSHAKALAKAvvvv'
// Needle: Trollland PRIMALposition=01 PRIMALlength=09 'Trollland'
// Needle: Swampwalker PRIMALposition=01 PRIMALlength=11 'Swampwalker'
// Needle: licenselessness PRIMALposition=01 PRIMALlength=15 'licenselessness'
// Needle: alfalfa PRIMALposition=02 PRIMALlength=06 'lfalfa'
// Needle: Sandokan PRIMALposition=01 PRIMALlength=08 'Sandokan'
// Needle: shazamish PRIMALposition=01 PRIMALlength=09 'shazamish'
// Needle: Simplicius Simplicissimus PRIMALposition=06 PRIMALlength=20 'icius Simplicissimus'
// Needle: domilliaquadringenquattuorquinquagintillion PRIMALposition=01 PRIMALlength=32 'domilliaquadringenquattuorquinqu'
// Needle: boom-boom PRIMALposition=02 PRIMALlength=08 'oom-boom'
// Needle: vvvvv PRIMALposition=01 PRIMALlength=04 'vvvv'
// Needle: 12345 PRIMALposition=01 PRIMALlength=05 '12345'
// Needle: likey-likey PRIMALposition=03 PRIMALlength=09 'key-likey'
// Needle: BOOOOOM PRIMALposition=03 PRIMALlength=05 'OOOOM'
// Needle: aaaaaBOOOOOM PRIMALposition=02 PRIMALlength=09 'aaaaBOOOO'
// Needle: BOOOOOMaaaaa PRIMALposition=03 PRIMALlength=09 'OOOOMaaaa'
PRIMALlength=0;
for (i=0+(1); i < cbPattern-((4)-1)+(1)-(1); i++) { // -(1) because the last BB order 4 has no counterpart(s)
FoundAtPosition = cbPattern - ((4)-1) + 1;
PRIMALpositionCANDIDATE=i;
while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
j = PRIMALpositionCANDIDATE + 1;
while ( j <= (FoundAtPosition-1) ) {
if ( *(uint32_t *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(uint32_t *)(pbPattern+j-(1)) ) FoundAtPosition = j;
j++;
}
PRIMALpositionCANDIDATE++;
}
PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+1 +((4)-1);
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
if (cbPattern-i+1 <= PRIMALlength) break;
}
// Swampwalker heuristic order 4 (Needle should be bigger than 4) ]
“最坏情况”折磨的结果
干草堆:在文件“WORST.TXT”中
类型:最差
大小:200,000,609字节
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WORST.TXT
[200,000,000 'Z' bytes]
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ
针头 #1:在文件“WCS_PREFIXEDandPOSTFIXED_609chars.txt”中
针头 #2:在文件“WCS_PREFIXED_309chars.txt”中
针头 #3:在文件“WCS_POSTFIXED_309chars.txt”中
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXEDandPOSTFIXED_609chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZ
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_PREFIXED_309chars.txt
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZSHAKALAKA
D:\_KAZE\_KAZE_32bit_memmem-like_showdown_Swampshine>type WCS_POSTFIXED_309chars.txt
SHAKALAKAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZ
对于文件“WCS_PREFIXEDandPOSTFIXED_609chars.txt”中的针头
Doing Search for Pattern(609bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 902MB/s
Railgun_Ennearch precise performance: 9MB/s
Railgun_DecumanusBITified precise performance: 14MB/s
Boyer_Moore-Horspool precise performance: 21MB/s
原始针头:“ZZZSHAKALAKAZZZZ”
PRIMALposition=298; PRIMALlength=16
对于文件“WCS_PREFIXED_309chars.txt”中的针头
Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 329MB/s
Railgun_Ennearch precise performance: 155MB/s
Railgun_DecumanusBITified precise performance: 102MB/s
Boyer_Moore-Horspool precise performance: 3048MB/s
原始针头:“ZZZZSHAKALAKA”
PRIMALposition=297; PRIMALlength=13
对于文件“WCS_POSTFIXED_309chars.txt”中的针头
Doing Search for Pattern(309bytes) into String(200000609bytes) as-one-line ...
Railgun_Swampshine precise performance: 902MB/s
Railgun_Ennearch precise performance: 451MB/s
Railgun_DecumanusBITified precise performance: 225MB/s
Boyer_Moore-Horspool precise performance: 223MB/s
原始针头:“SHAKALAKAZZZZ”
PRIMALposition=1; PRIMALlength=13
“Swampshine”的修订版2应该会移除中断。
一个解决方案是回退到“Gulliver”,经过转换(2阶,不是4阶)后,其2阶跳过肯定会超越Boyer_Moore-Horspool,但这只会在我无法想出其他方法时发生。
2014年1月23日的补充
最新最好的(针对非最坏情况)是“Railgun_Ennearch”——甚至比“Railgun_Decumanus”更好
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 19797MB/s
Railgun_DecumanusBITified precise performance: 17807MB/s
Boyer_Moore-Horspool precise performance: 3252MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 37175MB/s
Railgun_DecumanusBITified precise performance: 33822MB/s
Boyer_Moore-Horspool precise performance: 4473MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 138014MB/s
Railgun_DecumanusBITified precise performance: 121397MB/s
Boyer_Moore-Horspool precise performance: 5731MB/s
Doing Search for Pattern(2686bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 129381MB/s
Railgun_DecumanusBITified precise performance: 145128MB/s
Boyer_Moore-Horspool precise performance: 4541MB/s
Doing Search for Pattern(3867bytes) into String(52428800bytes) as-one-line ...
Railgun_Ennearch precise performance: 169574MB/s
Railgun_DecumanusBITified precise performance: 171716MB/s
Boyer_Moore-Horspool precise performance: 4514MB/s
对于52,428,800字节的维基百科干草堆和3,867字节的针头,Railgun_Ennearch 比 Boyer_Moore-Horspool 快了 (169574-4514)/4514*100% =3,656%,无注释。
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Ennearch.7z。
想穿越巨魔之地,奔跑而不是步行,更不是爬行。
从一开始,Railgun就被设计为文本搜索器,后来它被扩展到非文本数据,而最坏情况(Worst Case Scenarios)曾是未知的致命沼泽地带,直到现在。
我正在编写第一个W.C.S. Railgun:“Railgun_Trolldom”。
有一个非常简单的启发式算法,它不会影响其他部分的性能(速度方面)。
它将被嵌入到伪4阶部分(当针头长度大于19时,即“if ( cbPattern>NeedleThreshold2vs4Nonus )”条件)。
它的目的是找到最右边出现次数最多的2字节块,立即举一个例子:
假设我们的干草堆长度为200,000,000+33字节,其中前200,000,000字节是'v'字节,而针头是最后33字节。
Haystack: ...vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv 200,000,033 bytes long
Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv 33 bytes long
Factorizing the needle, Total Building-Blocks are 33-2+1 BBs/suffixes:
BB/Suffix01: [vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvvv
BB/Suffix02: v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvvv
...
BB/Suffix31: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[vv]v
BB/Suffix32: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvv[vv]
Distinct BBs are 33-2+1-(8+1+1+9)=13:
01: [vv] ! 8 duplicates !
02: [vB]OOMSHAKALAKAv
03: [BO]OMSHAKALAKAv
04: [OO]MSHAKALAKAv
05: [OM]SHAKALAKAv
06: [MS]HAKALAKAv
07: [SH]AKALAKAv
08: [HA]KALAKAv
09: [AK]ALAKAv
10: [KA]LAKAv
11: [AL]AKAv
12: [LA]KAv
13: [AK]Av ! 1 duplicate !
14: [KA]v ! 1 duplicate !
15: [Av]
16: [vv] ! 9 duplicates !
以下五个问题为这种(所谓的“Swamprunner”)启发式算法提供了绿灯:
问题1] “启动”(查找表中总元素/不同元素)比率是否大于1?
问题2] 不同元素的数量是否大于1?
问题3] 最常见且最右边的2字节后缀是什么(所谓的“TROLLish”构建块)?
问题4] 不包含TROLLish BB的最长且最右边的字符串是什么(所谓的“NewNeedle”)?
问题5] 不包含TROLLish BB的最长且最右边的字符串是否大于3?
清空并使用新的针头值重新加载查找表,为“NewNeedle”重新询问上述问题
让我们回答这五个问题
答案1] 是,(int)(33-2+1)/(13)=2 > 1
答案2] 是,13 > 1
答案3] 'vv'
答案4] 'vBOOMSHAKALAKAv'
答案5] 是,15 > 3
清零并用新针的值重新加载查找表,为“NewNeedle”重新回答上述问题
五盏绿灯,“沼泽跑者”正在路上。
使用2阶搜索,如果找到,则在相应的干草堆偏移处比较“LeftLeftover+NewNeedle+RightLeftover”。
“Railgun_Ennearch” + “Swamprunner”启发式算法 = “Railgun_Trolldom”
哈,我刚想起拿破仑在滑铁卢战役中最终失败的原因之一,他的炮兵由于泥泞的地形无法造成溅射伤害——炮弹直接陷入泥中。
希望我的电磁炮能在巨魔沼泽中制造泥泞的海啸,让无数巨魔乘子弹而行。
更新
我对“Swamprunner”不满意,我意识到它作用不大,必须实现更激进的启发式算法。
我不得不睡了一晚,以改进和简化过于复杂的“Swamprunner”,结果是shazamish(不是shamazing),新的启发式算法被称为“Swampwalker”。
与“Swampwalker”相比,“Swamprunner”太弱且不实用。
其思想是找到针头中最右边且最长的原始字符串(PRIMAL),原始字符串与素数非常相似,即它必须是不可分解的。
不可分解性保证了BMH 2阶和BMH伪4阶的卓越跳跃性能。
如果最长的原始字符串恰好长度为2(所有字节都相同),则执行标准搜索。
这种巧妙启发式算法的原理很简单,它应用于2阶,在每个位置搜索第一个重复项,重复项的位置构成下一次搜索的右边界,当然,下一个BB以同样的方式检查,直到达到右边界,所有这些都针对一次一个位置进行。
在每个位置,如果“PRIMALlength”等于或大于前一个“PRIMALlength”,则更新它。
最后我们得到两个值:“PRIMALposition”和“PRIMALlength”定义了PRIMAL NewNeedle。
再次,在每个步骤中,如果“PRIMALlength”等于或大于前一个“PRIMALlength”,则更新它。
Legend:
'[]' points to BB forming left or right boundary;
'{}' points to BB being searched for;
'()' position of duplicate and new right boundary;
下面有三个例子
00000000011111111112222222222333
12345678901234567890123456789012
Example #1 for Needle: 1234567890qwertyuiopasdfghjklzxcv NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'
Example #2 for Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv NewNeedle = 'vv'
Example #3 for Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv NewNeedle = 'vvBOOMSHAKALA'
示例 #1
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[12]34567890qwertyuiopasdfghjklzxc[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{12}]34567890qwertyuiopasdfghjklzxc[v?] ! Searching for '12', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_02: [1{2]3}4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
...
Step 01_30: [12]34567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_31: [12]34567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}1[23]4567890qwertyuiopasdfghjklzxc[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: 1[{23}]4567890qwertyuiopasdfghjklzxc[v?] ! Searching for '23', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_02: 1[2{3]4}567890qwertyuiopasdfghjklzxc[v?] ! Searching for '34', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
...
Step 02_29: 1[23]4567890qwertyuiopasdfghjkl{zx}c[v?] ! Searching for 'zx', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_30: 1[23]4567890qwertyuiopasdfghjklz{xc}[v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}1234567890qwertyuiopasdfghjklz[xc][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: 1234567890qwertyuiopasdfghjklz[{xc}][v?] ! Searching for 'xc', FoundAtPosition = 33, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=01 PRIMALlength=33, NewNeedle = '1234567890qwertyuiopasdfghjklzxcv'
示例 #2
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvvvvvvvvvvvvvvvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvvvvvvvvvvvvvvvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[vv][v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[{v(v}]v) ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=31 PRIMALlength=02, NewNeedle = 'vv'
示例 #3
PRIMALlength=00; FoundAtPosition=33;
Step 01_00: {}[vv]vvvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #01 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=01, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-01+(2)=33 !
Step 01_01: [{v(v}]v)vvvvvvvBOOMSHAKALAKAvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 02, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(02-1)-01+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Step 02_00: {}v[vv]vvvvvvvBOOMSHAKALAKAvvvvvvvvv[v?] ! For position #02 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=02, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-02+(2)=32 !
Step 02_01: v[{v(v}]v)vvvvvvBOOMSHAKALAKAvvvvvvvvvv ! Searching for 'vv', FoundAtPosition = 03, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(03-1)-02+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 09_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #09 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=09, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-09+(2)=25 !
Step 09_01: vvvvvvvv[{vv}]BOOMSHAKALAKA(vv)vvvvvvvv ! Searching for 'vv', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_02: vvvvvvvv[v{v]B}OOMSHAKALAKA[vv]vvvvvvvv ! Searching for 'vB', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_03: vvvvvvvv[vv]{BO}OMSHAKALAKA[vv]vvvvvvvv ! Searching for 'BO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_04: vvvvvvvv[vv]B{OO}MSHAKALAKA[vv]vvvvvvvv ! Searching for 'OO', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_05: vvvvvvvv[vv]BO{OM}SHAKALAKA[vv]vvvvvvvv ! Searching for 'OM', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_06: vvvvvvvv[vv]BOO{MS}HAKALAKA[vv]vvvvvvvv ! Searching for 'MS', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_07: vvvvvvvv[vv]BOOM{SH}AKALAKA[vv]vvvvvvvv ! Searching for 'SH', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_08: vvvvvvvv[vv]BOOMS{HA}KALAKA[vv]vvvvvvvv ! Searching for 'HA', FoundAtPosition = 24, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(24-1)-09+(2)=16 !
Step 09_09: vvvvvvvv[vv]BOOMSH{AK}AL(AK)Avvvvvvvvvv ! Searching for 'AK', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_10: vvvvvvvv[vv]BOOMSHA{KA}L[AK]Avvvvvvvvvv ! Searching for 'KA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_11: vvvvvvvv[vv]BOOMSHAK{AL}[AK]Avvvvvvvvvv ! Searching for 'AL', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
Step 09_12: vvvvvvvv[vv]BOOMSHAKA{L[A}K]Avvvvvvvvvv ! Searching for 'LA', FoundAtPosition = 21, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(21-1)-09+(2)=13 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
...
Step 31_00: {}vvvvvvvv[vv]BOOMSHAKALAKAvvvvvvvvv[v?] ! For position #31 the initial boundaries are PRIMALpositionCANDIDATE=LeftBoundary=31, RightBoundary=FoundAtPosition-1, the CANDIDATE PRIMAL string length is RightBoundary-LeftBoundary+(2)=(33-1)-31+(2)=03 !
Step 31_01: vvvvvvvvvvBOOMSHAKALAKAvvvvvvv[{v(v}]v) ! Searching for 'vv', FoundAtPosition = 32, PRIMALlengthCANDIDATE=RightBoundary-LeftBoundary+(2)=(32-1)-31+(2)=02 !
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=PRIMALpositionCANDIDATE; PRIMALlength = PRIMALlengthCANDIDATE;}
Result:
PRIMALposition=09 PRIMALlength=13, NewNeedle = 'vvBOOMSHAKALA'
展开后,这种启发式算法看起来有点复杂,但实际上,用C语言实现时,它本身就是简洁的。
PRIMALlength=0;
for (i=0+(1); i < cbPattern-2+1+(1)-(1); i++) { // -(1) because the last BB order 2 has no counterpart(s)
FoundAtPosition = cbPattern;
PRIMALpositionCANDIDATE=i;
while ( PRIMALpositionCANDIDATE <= (FoundAtPosition-1) ) {
j = PRIMALpositionCANDIDATE + 1;
while ( j <= (FoundAtPosition-1) ) {
if ( *(unsigned short *)(pbPattern+PRIMALpositionCANDIDATE-(1)) == *(unsigned short *)(pbPattern+j-(1)) ) FoundAtPosition = j;
j++;
}
PRIMALpositionCANDIDATE++;
}
PRIMALlengthCANDIDATE = (FoundAtPosition-1)-i+(2);
if (PRIMALlengthCANDIDATE >= PRIMALlength) {PRIMALposition=i; PRIMALlength = PRIMALlengthCANDIDATE;}
}
哦,在搜索10^0..10^9999范围内最长的数字名称后,下一个独一无二的43个字母的单词出现了:
domilliaquadringenquattuorquinquagintillion
它代表着
10^7365 一个 do-millia-quadringen-quattuor-quinquagin-tillion
并将一些字符串转换为它们的原始字符串
Needle: 1234567890qwertyuiopasdfghjklzxcv PRIMALposition=01 PRIMALlength=33 '1234567890qwertyuiopasdfghjklzxcv'
Needle: vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv PRIMALposition=31 PRIMALlength=02 'vv'
Needle: vvvvvvvvvvBOOMSHAKALAKAvvvvvvvvvv PRIMALposition=09 PRIMALlength=13 'vvBOOMSHAKALA'
Needle: Trollland PRIMALposition=05 PRIMALlength=05 'lland'
Needle: Swampwalker PRIMALposition=03 PRIMALlength=09 'ampwalker'
Needle: licenselessness PRIMALposition=01 PRIMALlength=13 'licenselessne'
Needle: alfalfa PRIMALposition=04 PRIMALlength=04 'alfa'
Needle: Sandokan PRIMALposition=01 PRIMALlength=07 'Sandoka'
Needle: shazamish PRIMALposition=02 PRIMALlength=08 'hazamish'
Needle: Simplicius Simplicissimus PRIMALposition=08 PRIMALlength=15 'ius Simplicissi'
Needle: domilliaquadringenquattuorquinquagintillion PRIMALposition=01 PRIMALlength=19 'domilliaquadringenq'
Needle: DODO PRIMALposition=02 PRIMALlength=03 'ODO'
Needle: DODOD PRIMALposition=03 PRIMALlength=03 'DOD'
Needle: aaaDODO PRIMALposition=02 PRIMALlength=05 'aaDOD'
Needle: aaaDODOD PRIMALposition=02 PRIMALlength=05 'aaDOD'
Needle: DODOaaa PRIMALposition=02 PRIMALlength=05 'ODOaa'
Needle: DODODaaa PRIMALposition=03 PRIMALlength=05 'DODaa'
地照
n.
地球表面反射的阳光,照亮月球未被太阳直接照射的部分。也称为地光。
/遗产/
“Swampwalker”的美妙之处在于它的就地(无需额外空间)工作,这在大针头(例如8KB)的情况下尤其有效。我相信“Railgun_Ennearch”经过“Swampwalker”转换的强化,将能够击败最坏情况,从而使其更接近线性。这个最坏情况打击器将被命名为 Railgun_Swampshine,明白了,巨魔沼泽中的射击会如此激烈/血腥,以至于产生的等离子体会照亮月球的一部分。
2014年1月12日的补充
是时候进行Decuman打击了。
到目前为止,Railgun_Decumanus 是所有Railgun中最快、最强的,其跳跃性能优越,因为它首次和第二次快速检查使用了1+(1+1+1)次伪4阶查找——仅以两个条件语句实现。
Decumanus检查干草堆的4个后缀(在窗口最右边的10个字节内)是否存在于针头中,这4个检查提供了坚实的控制,同时保持了出色的速度性能,事实上,“Decumanus”意为巨大的东西。
Railgun_Decumanus 得到了极好的强化(参见GENOMEsque折磨图),以至于它摧毁了我所知道的所有打击器。
维基百科折磨测试中,3个超出图表的值为
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 20492MB/s
Boyer_Moore-Horspool precise performance: 3605MB/s
Doing Search for Pattern(1080bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 36408MB/s
Boyer_Moore-Horspool precise performance: 4896MB/s
Doing Search for Pattern(1755bytes) into String(52428800bytes) as-one-line ...
Railgun_Decumanus precise performance: 86141MB/s
Boyer_Moore-Horspool precise performance: 6281MB/s
花一分钟比较一下86141和6281,对于1755字节长的针头,重复后缀的2阶或/和4阶数量是显著的,后缀管理必须既快速又高效。
Boyer-Moore-Horspool及其1阶查找由于大量重复字节而表现不佳,我们指的是它的原生模式——文本数据。
即使是“Gulliver”,我旧的2阶BMH Railgun,也会被如此长的针头所伤害。
更糟糕的是,Railgun_Sekireigan_Wolfram_Nozomi 达到了可怜的28238MB/s,1271%比BMH提升——这就是4个伪4阶查找的力量。
基因组基准测试
维基百科基准测试
我故意将针头长度为0..3的部分留空/未处理,SSE2在小针头上表现出色,Mischa Sandberg最清楚,128位汇编指令简直摧毁了一切。
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Decumanus.7z。
2013年11月12日的补充
我通过在第二层检查中添加额外的查找,同时不增加额外的分支,从而增强了Shockeroo的跳跃性能(同时保持相同的速度)!
包含其主循环汇编代码的C源代码可在下载区获取。打印友好版本:Railgun_Wolfram_4-pages.pdf。
“Railgun_Sekireigan_WolfRAM”的主体是这样的:
- 第1节:用于最短(2-3字节)针头的简化Karp-Rabin算法;
- 第2节:用于较短(4字节及以上)针头和“短”干草堆的Quadruplet算法;
- 第3a节:用于短(4-22字节)针头和“长”干草堆的Boyer-Moore-Horspool 2阶(2)+(2+2)算法;
- 第3b节:用于长(23-711字节)针头和“长”干草堆的Boyer-Moore-Horspool伪4阶(4)+(4)算法;
- 第4节:用于最长(712字节及以上)针头和“长”干草堆的简化Horspool伪12阶算法。
关于一项出色的跳跃性能研究的几点思考
一些“仍然”被认为是疯狂的想法:使用位操作的3阶,而不是伪3阶!2^24 = 16MB字节操作,或2^(24-3) = 2MB位操作。当搜索大型干草堆,例如使用Kazahana搜索42GB的维基百科时,这个2MB的查找表似乎没那么可怕。
// The code is like:
ulHashTarget = *(uint32_t *)&pbTarget[i+cbPattern-4]; // One memory access instead of 2
if ( bm_Horspool_Order3[ulHashTarget>>8] == 0 ) Gulliver = cbPattern-(3-1); else
if ( bm_Horspool_Order3[ulHashTarget&0xFFFFFF] == 0 ) Gulliver = cbPattern-(3-1)-1; else
{
// ...
}
当搜索大型干草堆,比如整个维基百科(一个42GB的文件)时,我期望3阶结合多线程(例如我的Kazahana工具中的16个线程),如果所有线程都使用这个共享的2MB查找表,将超越“WolfRAM”,有待完成。
2013年11月9日的补充
是时候震惊了。
这个震惊是苦乐参半的,我没能在外循环中更好地提升跳跃性能,而内循环却快得多(甚至在我的Core 2上)。“Railgun_Sekireigan_Shriekeroo”的第3a/3b节中的两大改进导致了“Railgun_Sekireigan_Shockeroo”的出现——迄今为止最快的MEMMEM函数。这种提升是双重的,通过减少两个循环(外部和内部)的迭代次数。对于外部循环,我们试图从右到左寻找更好的跳跃,但迄今为止尚未成功。对于内部循环,周期/迭代次数减少了4倍。第3a节的主循环
while (i <= cbTarget-cbPattern) {
Gulliver = 1; // 'Gulliver' is the skip
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
// This fast check ensures not missing a match (for remainder) when going under 0 in loop below:
// Order 4 [
// Let's try something "outrageous" like comparing
// with[out] overlap BBs 4bytes long instead of 1 byte back-to-back:
// Inhere we are using order 4, 'cbPattern - Order + 1' is the number
// of BBs for text 'cbPattern' bytes long, for example,
// for cbPattern=11 'fastest fox' and Order=4 we have BBs = 11-4+1=8:
//0:"fast" if the comparison failed here, 'count' is 1; 'Gulliver' is cbPattern-(4-1)-7
//1:"aste" if the comparison failed here, 'count' is 2; 'Gulliver' is cbPattern-(4-1)-6
//2:"stes" if the comparison failed here, 'count' is 3; 'Gulliver' is cbPattern-(4-1)-5
//3:"test" if the comparison failed here, 'count' is 4; 'Gulliver' is cbPattern-(4-1)-4
//4:"est " if the comparison failed here, 'count' is 5; 'Gulliver' is cbPattern-(4-1)-3
//5:"st f" if the comparison failed here, 'count' is 6; 'Gulliver' is cbPattern-(4-1)-2
//6:"t fo" if the comparison failed here, 'count' is 7; 'Gulliver' is cbPattern-(4-1)-1
//7:" fox" if the comparison failed here, 'count' is 8; 'Gulliver' is cbPattern-(4-1)
count = cbPattern-4+1;
while ( count > 0 && *(uint32_t *)(pbPattern+count-1) ==
*(uint32_t *)(&pbTarget[i]+(count-1)) )
//count--;
count = count-4; // - order, of course order 4 is much more SWEET&CHEAP - less loops
if ( count <= 0 ) return(pbTarget+i); else
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+count-1]] ==
0 ) Gulliver = count; // 1 or bigger, as it should
// Order 4 ]
}
}
// Trying to strengthen the Skip performance... here ONLY one additional
// lookup, for better/longer jumps more such lookups, unrolled to be added.
//if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
// Dummy me, the above line should play two roles (by putting
// it before the comparison cycle - which is nastily slow) instead of one:
// 1] If it is 0 that discards the need for further comparing - they cannot be equal.
// 2] Strengthen the skip.
} else Gulliver = cbPattern-(2-1);
i = i + Gulliver;
//GlobalI++; // Comment it, it is only for stats.
}
加速后有点失望,在i7上我期望不亚于“砰!”的声音。
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shockeroo.7z。
由于测试以毫秒为单位(足以满足原始目的),我添加了“Railgun_Sekireigan_Shockeroo”的总时钟周期,它执行了1024次,这是我的Core 2笔记本电脑上的结果
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/19
Railgun_Sekireigan_Shockeroo performance: 2694KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 18642.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/3
Railgun_Sekireigan_Shockeroo performance: 17066KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 2532.000000clocks
编辑
Or, 7 bytes long needle is found at 51,265,299KB/18642clocks = ((51265299/18642)*1000)/1024/1024= 2.622GB/s on Core 2 laptop!!!
...
Or, 589 bytes long needle is found at 52,419,431KB/2532clocks = ((52419431/2532)*1000)/1024/1024= 19+GB/s on Core 2 laptop!!!
在Core i5 2430M @3000MHz 3MB L3缓存笔记本电脑上进行相同的测试
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/13
Railgun_Sekireigan_Shockeroo performance: 3938KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 12345.000000clocks
...
Doing Search for Pattern(589bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 52419431 within next line:
Railgun_Sekireigan_Shockeroo_hits/Railgun_Sekireigan_Shockeroo_clocks: 0/2
Railgun_Sekireigan_Shockeroo performance: 25600KB/clock
Railgun_Sekireigan_Shockeroo TotalRoughSearchTime (lesser-the-better): 1283.000000clocks
Or, 7 bytes long needle is found at 51,265,299KB/12345clocks = ((51265299/12345)*1000)/1024/1024= 3.960GB/s on Core i5 laptop!!!
...
Or, 589 bytes long needle is found at 52,419,431KB/1283clocks = ((52419431/1283)*1000)/1024/1024= 38+GB/s on Core i5 laptop!!!
目前,我看不出如何提升“Shockeroo”,有什么想法吗?
2013年11月7日的补充
距离Railgun_MIMINO又近了一步……
BMH 2+2和BMH 4+4部分主循环中一个被忽视的优化终于完成了,所以现在“Railgun_Sekireigan_Shriekeroo”尖叫着比当前最快的更快。第3a节的主循环:
while (i <= cbTarget-cbPattern) {
Gulliver = 1;
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
//if ( Gulliver == 1 ) { // Means the Building-Block order 2 is found somewhere i.e. NO MAXIMUM SKIP
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2; else {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) return(pbTarget+i);
}
}
//}
// Trying to strengthen the Skip performance... here ONLY one additional lookup,
// for better/longer jumps more such lookups, unrolled to be added.
//if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
// Dummy me, the above line should play two roles (by putting
// it before the comparison cycle - which is nastily slow) instead of one:
// 1] If it is 0 that discards the need for further comparing - they cannot be equal.
// 2] Strengthen the skip.
} else Gulliver = cbPattern-(2-1);
i = i + Gulliver;
//GlobalI++; // Comment it, it is only for stats.
}
“Railgun_Sekireigan_Shriekeroo”的主体是这样的
- 第1节:用于最短(2-3字节)针头的简化Karp-Rabin算法;
- 第2节:用于较短(4字节及以上)针头和“短”干草堆的Quadruplet算法;
- 第3a节:Boyer-Moore-Horspool 2阶(2+2),用于短(4-22字节)针头和“长”干草堆;
- 第3b节:Boyer-Moore-Horspool伪4阶(4+4),用于长(23-711字节)针头和“长”干草堆;
- 第4节:用于最长(712字节及以上)针头和“长”干草堆的简化Horspool伪12阶算法。
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Shriekeroo.7z
嗯,我还有一个想法(向后比较并查找不匹配的位置,这样Gulliver就不再是1而是更大了),这个修订版将命名为“Railgun_Sekireigan_Shockeroo”。
2013年11月5日的补充
飞吧,小飞象,飞吧!
长话短说:对于XML/英文文本,“Railgun_Sekireigan_TchittoGritto”无处不在地占据主导地位,而对于“ACGT”即基因组文件,它比经典的 BMH 快200%-400%!
我惊呆了,“Piri”修订版2中名为“TchittoGritto”的方法是如此迅捷,以至于它甚至超越了在基因组数据方面表现出色的华丽 BNDM!
对于“ACGT”数据,遍历212MB后
Needle 20 chars: BNDM_32 performance: 1235KB/clock Railgun_Sekireigan_TchittoGritto performance: 2012KB/clock Boyer_Moore-Horspool performance: 369KB/clock Needle 30 chars: BNDM_32 performance: 1659KB/clock Railgun_Sekireigan_TchittoGritto performance: 2241KB/clock Boyer_Moore-Horspool performance: 734KB/clock
我能说什么呢,BNDM 也许是我见过的最美的研究,但 Railgun_Sekireigan_TchittoGritto 超越了它,哈希算法简直击败了DAWG——只需要一个足够大的表格就能做到!此外,对于XML/英文文本,BNDM根本无法与“TchittoGritto”匹敌。速度惊人的表现不止于此,通过适当的针头阈值调整,其提升(与BMH相比)可以达到惊人的700%,详见下文。
BNDM代表“反向非确定性DAWG匹配”。DAWG“有向无环词图”:定义(1):一个有向无环图,表示给定字符串的后缀,其中每条边都标有一个字符。从根到节点的路径上的字符是该节点表示的子字符串。定义(2):一个识别一组词的有限状态机。
出于好奇,我想将BMH 1/2/4/12阶的实现并列比较,方法是向它们扔去“智人1号染色体,HuRef替代组装,全基因组鸟枪法序列”——一个212MB的小字母表(约6个不同字符)文件。众所周知,BMH在基因组数据方面表现不佳。在这里,经过大力强化的“Railgun_Sekireigan_TchittoGritto”证明了高阶BMH的迅捷性,它是一个4阶算法,但采用配对方式,即最右边的4+4——与真正的8阶相去甚远。搜索的是位于文件末尾的20/30/40/50/60/70字符长的针头。
正如您从下图所看到的,Railgun_Sekireigan_Bari (BMH 12阶) 以惊人的速度下降,对长度为60和70字符的针头获得动量,仅仅是因为
#define NeedleThresholdBIGSekirei 12+40 // Should be bigger than 'HasherezadeOrder'. BMH2 works up to this value (inclusive).
上述阈值(52)决定从2阶切换到12阶,即对于长度为53及以上的针头使用“Hasherezade”。“Railgun_Sekireigan_TchittoGritto”在达到712(其阈值)之前不会切换到12阶,这更适合XML/英文文本——在那里(4阶)它比12阶更快。
对于“ACGT”数据
Needle 60 chars: Railgun_Sekireigan_TchittoGritto performance: 2683KB/clock Railgun_Sekireigan_Bari performance: 3813KB/clock Railgun_Sekireigan_Bari_bit performance: 3813KB/clock Boyer_Moore-Horspool performance: 442KB/clock Needle 70 chars: Railgun_Sekireigan_TchittoGritto performance: 2619KB/clock Railgun_Sekireigan_Bari performance: 3952KB/clock Railgun_Sekireigan_Bari_bit performance: 3952KB/clock Boyer_Moore-Horspool performance: 488KB/clock
对于XML数据
Needle 7 chars: Railgun_Sekireigan_TchittoGritto performance: 2133KB/clock Railgun_Sekireigan_Bari performance: 2133KB/clock Needle 13 chars: Railgun_Sekireigan_TchittoGritto performance: 3200KB/clock Railgun_Sekireigan_Bari performance: 3200KB/clock ... Needle 201 chars: Railgun_Sekireigan_TchittoGritto performance: 7314KB/clock Railgun_Sekireigan_Bari performance: 4654KB/clock Needle 230 chars: Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock Railgun_Sekireigan_Bari performance: 5120KB/clock Needle 255 chars: Railgun_Sekireigan_TchittoGritto performance: 8533KB/clock Railgun_Sekireigan_Bari performance: 5688KB/clock Needle 589 chars: Railgun_Sekireigan_TchittoGritto performance: 17066KB/clock Railgun_Sekireigan_Bari performance: 10240KB/clock
基准测试及其源代码位于:http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_TchittoGritto.7z
- “Чито Грито”是什么?
- 小鸟,小鸟,很小,没什么特别的。
/鲁比克和米米诺/
- 这是什么“Tchitto Gritto”?
- 小鸟,小鸟,不重要的一只,其实没什么特别的。
/鲁比克和米米诺/
“Mimino”在格鲁吉亚语中是მიმინო——“老鹰”,在英语中是Sparrowhawk,在拉丁语中是(Accipiter nisus);“猎鹰”是“shevardeni”。
总有一天,或者说某个晚上,我会写出最终的混合体:Railgun_MIMINO,但正如歌中所唱“我的明天永不来临”那样,它将成为我的绝唱。
2013年11月3日的补充
两件事。
首先,感谢Eiht先生再次帮助我,接下来的结果回答了几个有趣的问题:
- 现代CPU(i7 3930K @4.5GHz)如何执行“Railgun”;
- BYTE与BIT查找的行为方式;
- 两个32位(Microsoft VS2010和Intel v12.1)C优化器(Windows)能提供什么;
- Core 2与i7第三代并排比较;
Needle 7 chars: Railgun_Sekireigan_Bari performance: 5120KB/clock Railgun_Sekireigan_Bari_bit performance: 3657KB/clock Boyer_Moore-Horspool performance: 1765KB/clock Needle 13 chars: Railgun_Sekireigan_Bari performance: 6400KB/clock Railgun_Sekireigan_Bari_bit performance: 5688KB/clock Boyer_Moore-Horspool performance: 3011KB/clock Needle 21 chars: Railgun_Sekireigan_Bari performance: 7314KB/clock Railgun_Sekireigan_Bari_bit performance: 7314KB/clock Boyer_Moore-Horspool performance: 3657KB/clock Needle 28 chars: Railgun_Sekireigan_Bari performance: 8533KB/clock Railgun_Sekireigan_Bari_bit performance: 7314KB/clock Boyer_Moore-Horspool performance: 4266KB/clock Needle 37 chars: Railgun_Sekireigan_Bari performance: 8533KB/clock Railgun_Sekireigan_Bari_bit performance: 7314KB/clock Boyer_Moore-Horspool performance: 4654KB/clock Needle 51 chars: Railgun_Sekireigan_Bari performance: 10240KB/clock Railgun_Sekireigan_Bari_bit performance: 8533KB/clock Boyer_Moore-Horspool performance: 5120KB/clock
快速解答
* Railgun_Sekireigan_Bari 在所有方面都占据主导地位。
* 字节到位查找转换所需的指令仍然耗费大量时间。
* Intel 编译器为按位版本生成更快的代码,美味极了。
* Core 2 保持着优势,它仍然非常出色,考虑到我的“Bonboniera”是2200MHz,而Eiht先生的“Redemption”是4500MHz。
其次,我只是想再试一次旧想法(在右侧一次加载4字节,就像最初的Railgun在左侧做的那样),这导致了“Railgun_Sekireigan_Piri”的出现——一个针对10+长针头快得多的修订版。
“Railgun_Sekireigan_Piri”的主体是这样的
- 第1节:用于最短(2-3字节)针头的简化Karp-Rabin算法;
- 第2节:用于较短(4字节及以上)针头和“短”干草堆的Quadruplet算法;
- 第3节:用于短(4-198字节)针头和“长”干草堆的Boyer-Moore-Horspool伪4阶算法;
- 第4节:用于最长(199字节及以上)针头和“长”干草堆的简化Horspool伪12阶算法。
这个第一个修订版摇摇晃晃——通过检查前4个字节(左侧)来增强姿态(即跳过性能)应该能让Railgun发出震耳欲聋的尖叫!
第3部分可能改进
- 更强的“哈希”,仅加法最快;
- 对于短于12的针头,返回到2阶;
- 将“NeedleThresholdBIGSekireiPiri”从12+180增加到600;
- 合并两个查找数组并将4阶部分转换为位操作——这样开销就只有一个8KB数组。
基准测试及其源代码位于
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown_Piri.7z
第3节如下所示——其他三节保持不变
...
if ( cbPattern<=NeedleThresholdBIGSekireiPiri ) {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(uint32_t *)(pbPattern); // First four bytes
//ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
i=0;
//for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;} // cbPattern-(Order-1) for Horspool; 'memset' if not optimized
//for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed
for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]=0;}
// In line below we "hash" 4bytes to 2bytes i.e. 16bit table, how to compute TOTAL number of BBs, 'cbPattern - Order + 1' is the number of BBs for text 'cbPattern' bytes long, for example, for cbPattern=11 'fastest fox' and Order=4 we have BBs = 11-4+1=8:
//"fast"
//"aste"
//"stes"
//"test"
//"est "
//"st f"
//"t fo"
//" fox"
//for (j=0; j < cbPattern-4+1; j++) bm_Horspool_Order2[( *(unsigned short *)(pbPattern+j+0) + *(unsigned short *)(pbPattern+j+2) ) & ( (1<<16)-1 )]=1;
for (j=0; j < cbPattern-4+1; j++) bm_Horspool_Order2[( (*(uint32_t *)
(pbPattern+j+0)>>16)+(*(uint32_t *)
(pbPattern+j+0)&0xFFFFF) ) & ( (1<<16)-1 )]=1;
while (i <= cbTarget-cbPattern) {
Gulliver = 1;
//if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]] != 0 ) {
//if ( bm_Horspool_Order2[( *(unsigned short *)&pbTarget[i+cbPattern-1-1-2] +
// *(unsigned short *)&pbTarget[i+cbPattern-1-1] ) & ( (1<<16)-1 )] != 0 ) {
// 001d2 0f b7 6c 01 fc movzx ebp, WORD PTR [-4+ecx+eax]
// 001d7 0f b7 5c 01 fe movzx ebx, WORD PTR [-2+ecx+eax]
// 001dc 03 eb add ebp, ebx
// TO-DO: The above line should be with one only memory access i.e.,
// to fetch 4 bytes at once and then to mix the low and high part.
if ( bm_Horspool_Order2[( (*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]>>16)+
(*(uint32_t *)&pbTarget[i+cbPattern-1-1-2]&0xFFFFF) ) & ( (1<<16)-1 )] != 0 ) {
//if ( Gulliver == 1 ) { // Means the Building-Block order 2 is found somewhere i.e. NO MAXIMUM SKIP
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-
count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) return(pbTarget+i);
}
//}
// Trying to strengthen the Skip performance... here ONLY one additional
// lookup, for better/longer jumps more such lookups, unrolled to be added.
// Since 'Piri' revision not trying anymore, instead trying to reduce main loop size while going to order 4.
//if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
} else Gulliver = cbPattern-(2-1)-2; // -2 because we check the 4 rightmost bytes not 2.
i = i + Gulliver;
//GlobalI++; // Comment it, it is only for stats.
}
return(NULL);
// The above loop in Assembly:
/*
; mark_description "Intel(R) C++ Compiler XE for applications
; running on IA-32, Version 12.1.1.258 Build 20111011";
; mark_description "-O3 -FAcs -Festrstr_SHORT-SHOWDOWN_LV_WIKI_32bit_IntelV12";
.B2.29:
001d2 8b 5c 01 fc mov ebx, DWORD PTR [-4+ecx+eax]
001d6 8b eb mov ebp, ebx
001d8 c1 ed 10 shr ebp, 16
001db 03 eb add ebp, ebx
001dd 0f b7 fd movzx edi, bp
001e0 0f b6 1c 3c movzx ebx, BYTE PTR [esp+edi]
001e4 85 db test ebx, ebx
001e6 74 52 je .B2.38
.B2.30:
001e8 3b 14 06 cmp edx, DWORD PTR [esi+eax]
001eb 0f 85 3c 04 00
00 jne .B2.83
.B2.31:
001f1 8b 9c 24 08 00
01 00 mov ebx, DWORD PTR [65544+esp]
001f8 8b eb mov ebp, ebx
001fa f7 dd neg ebp
001fc 85 db test ebx, ebx
001fe 0f 84 21 04 00
00 je .B2.82
.B2.32:
00204 8b bc 24 04 00
01 00 mov edi, DWORD PTR [65540+esp]
0020b 8b b4 24 00 00
01 00 mov esi, DWORD PTR [65536+esp]
00212 03 f8 add edi, eax
.B2.33:
00214 8a 54 35 04 mov dl, BYTE PTR [4+ebp+esi]
00218 3a 54 3d 04 cmp dl, BYTE PTR [4+ebp+edi]
0021c 0f 85 ee 03 00
00 jne .B2.81
.B2.34:
00222 45 inc ebp
00223 4b dec ebx
00224 75 ee jne .B2.33
.B2.35:
00226 8b b4 24 2c 00
01 00 mov esi, DWORD PTR [65580+esp]
.B2.36:
0022d 03 c6 add eax, esi
0022f 81 c4 18 00 01
00 add esp, 65560
00235 5d pop ebp
00236 5b pop ebx
00237 5f pop edi
00238 5e pop esi
00239 c3 ret
.B2.38:
0023a 8b 9c 24 10 00
01 00 mov ebx, DWORD PTR [65552+esp]
.B2.39:
00241 03 c3 add eax, ebx
00243 3b 84 24 14 00
01 00 cmp eax, DWORD PTR [65556+esp]
0024a 76 86 jbe .B2.29
.B2.40:
0024c 33 c0 xor eax, eax
0024e 81 c4 18 00 01
00 add esp, 65560
00254 5d pop ebp
00255 5b pop ebx
00256 5f pop edi
00257 5e pop esi
00258 c3 ret
.B2.41:
...
.B2.81:
00610 89 b4 24 00 00
01 00 mov DWORD PTR [65536+esp], esi
00617 8b 94 24 0c 00
01 00 mov edx, DWORD PTR [65548+esp]
0061e 8b b4 24 2c 00
01 00 mov esi, DWORD PTR [65580+esp]
.B2.82:
00625 85 db test ebx, ebx
00627 0f 84 00 fc ff
ff je .B2.36
.B2.83:
0062d bb 01 00 00 00 mov ebx, 1
00632 e9 0a fc ff ff jmp .B2.39
; The whole section (main loop plus the exit) is 00258-001d2+1 + 00632-00610+5 = 135+39 bytes long.
*/
} else { // if ( cbPattern<=NeedleThresholdBIGSekireiPiri )
...
当夜色渐逝——寂静地退入空荡荡的殿堂
突然一阵悸动——终于重现,在我让它坠落的地方
追随一个梦的漫游——一个让我的灵魂鲜活的梦
相信一片开阔的天空——相信一份爱。
与陌生人共舞——不顾他微笑中潜藏的危险
当露珠凝结——呼吸着晨曦,像一个熟睡的孩子。
如果光明的记忆终将消逝——地平线延伸至冰冷湛蓝
直到你的心自由飞翔——那我将为你守候太阳
直到你触及开阔的天空——那我将为你守候太阳
直到我们永不道别——那我将为你守候太阳。
/Giorgio Moroder - (Theme From) Midnight Express (vocal Chris Bennett) 歌词/
致 Piri。
2013年10月16日的补充
是时候推出(在非病理情况下,Mischa更清楚)无可争议的最快MEMMEM函数,名为“Railgun_Sekireigan_Bari”。这是“Railgun_Sekireigan”自然而然的下一步:现在检查我们最右边的2字节左侧的另外2字节。这第二次检查使得跳过/跳跃不再经常是1字节,类似于两足站立的Sekirei,因此与(修订版1)相比,它允许更快的摆动,在修订版1中,Sekirei仅依赖/摆动一条腿(请参阅“评论与讨论”部分和视频,其中摆动速度显然较低)。第一版和第二版仅有一行代码不同。
if ( bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1-2]] == 0 ) Gulliver = cbPattern-(2-1)-2;
一个快速示例
Haystack: "...my friend Bari..." Pattern: "my friend Piri" Jump for r.1: "my friend Piri" +1 Jump for r.2: "my friend Piri" +11 Window: "Bari" First Order 2 check: "ri" it is within the lookup table i.e. its slot holds 1, skip is 1 position Second Order 2 check: "Ba" it is outwith the lookup table i.e. its slot holds 0, skip is cbPattern-(2-1)-2=14-(2-1)-2=11 positions Lookup table, next slots hold 1, rest 0: 'my' 'y ' ' f' 'fr' 'ri' 'ie' 'en' 'nd' 'd ' ' P' 'Pi' 'ir' 'ri' Pattern Building-Block "Ba" holds 0.
基准测试及其源代码位于
http://www.sekireigan.com/_KAZE_32bit_memmem-like_showdown.7z
这些天我和妈妈的苦涩,我们忠实的狗在它(不是它)的黄金时期去世了,只有5岁,怎么说再见呢?
此函数献给我们动物朋友巴里,一个纯洁的生命形象。
星暗月黑,
夜寂晨鸣,
告马击鼓,
主去子逝,
海深天晦,
鲸默潮平,
告盐沼击鼓,
主去子逝,
暗生明,明转暗,
三黑轿,三白车,
聚散皆因缘,
兄弟已逝,心亦随之。
...
/'Gone' 歌词由Ioanna Gika创作并演唱/
2013年10月11日的补充
是时候让MEMMEM函数中的“Hasherezade”女王退位,让“Sekirei”登场了。
直言不讳地说,“Sekireigan”意为“Sekirei之眼”。这是一种剑术技巧,能够以极快的速度移动并同时从多个角度攻击。雪村可以利用Sekireigan来获得极高的速度。至于“Sekirei”,它代表“林鹡鸰”,林鹡鸰与它的鹡鸰亲戚不同之处在于它有左右摇摆尾巴的奇怪习惯,而不是像其他鹡鸰那样上下摆动。你是否看到了横向摆动(同时完全移除垂直摆动)与Boyer-Moore-Horspool 2阶算法最大限度简化后的主循环(不分心去查找像“Hasherezade”中那样的其他跳过值)之间的联系,它以疯狂的速度左右摇摆:查看干草堆窗口最右边的2字节是否存在于64KB的字节表(如果能找到),然后从左侧开始比较,依此类推。这种巨大提升的关键在于最大限度地简化Horspool 2阶部分,主循环只有99字节!
两年前我为什么不这样做,别问我,如果你问,答案是——我不想……故意,跳跃性能很好,我以为“慢一点但更强更好”,现在这种过度杀伤已经结束了。事实上,这种简化是“Railgun_Sekireigan”与它的过度杀伤/对应物“Railgun_Hasherezade”修订版3唯一不同的地方。
嗯,唯一的弱点在于BMH 2阶循环中对于5-~28字节长的模式,步进/跳过是1,为了提升它们,64KB表格应该(像以前一样)包含构建块的跳过值,而不是像我简化后的0/1,这个范围是Volnitsky先生有300-400MB/s优势的唯一范围。这是故意的,也许在未来的修订版中,我会将BMH查找表改为位操作,即小8倍,也就是8KB——完全可以装入Vishera的16KB L1缓存。我的愿望是在不同的CPU(主要是Haswell,Vishera)上进行丰富的对决,可惜目前没有机会,如果有人愿意帮助我并运行基准测试,我很乐意在这里分享结果。
如何计算BBs的总数,“cbPattern - Order + 1”是长度为“cbPattern”字节的文本的BBs数量。例如,对于cbPattern=11的“fastest fox”和Order=8,我们有BBs = 11-8+1=4
'fastest '
'astest f'
'stest fo'
'test fox'
对于那些想要深入理解并拥有一款能够从任何文件中提取构建块的超级工具的人,我在下一个软件包中包含了我的BBs提取器LeprechaunBBhex,它用C语言编写,可在Windows/Linux上编译,并且以令人难以置信的速度运行。如果你运行批处理文件“RIP_BBs.bat”,它将创建并报告以下内容:
对于我们的测试文件
enwiki-20130904-pages-articles.7z.001: 52,428,800 bytes; 52,428,800 - Order + 1 Total 'Order' bytes long Building-Blocks
ENWIKI_50MB_order2.txt : 93,294 bytes; 15,549 Distinct Building-Blocks 2 bytes long
ENWIKI_50MB_order4.txt : 10,593,630 bytes; 1,059,363 Distinct Building-Blocks 4 bytes long
ENWIKI_50MB_order8.txt : 225,515,718 bytes; 12,528,651 Distinct Building-Blocks 8 bytes long
ENWIKI_50MB_order12.txt : 726,063,728 bytes; 27,925,528 Distinct Building-Blocks 12 bytes long
From above we can obtain the REDUNDANCY of WIKI 50MB dump:
- for order 2: 15,549/(52,428,800 - 2 + 1)*100% = 0.02%
- for order 4: 1,059,363/(52,428,800 - 4 + 1)*100% = 2.02%
- for order 8: 12,528,651/(52,428,800 - 8 + 1)*100% = 23.89%
- for order 12: 27,925,528/(52,428,800 - 12 + 1)*100% = 53.26%
百分比越低,哈希函数上的负载越小,这对于减少冲突数量来说是好事。鉴于2阶有15,549个不同的BBs,而查找表有65,536个槽位,这意味着一个好的哈希函数可以实现零冲突。至于12阶,默认哈希大小是:#define HashTableSizeSekirei 17-1
,这给出了27,925,528个键与65,536个槽位,或426:1,这不是一个令人愉快的比率。在一个测试中,当HashTableSizeSekirei
是20而不是17-1时(即26:1),速度下降(对于17-1)并不令人真正担忧。
为了了解XML(维基百科)和TXT(英文电子书文本)格式之间的区别,以下是OSHO.TXT的统计数据:
OSHO.TXT : 206,908,949 bytes; 206,908,949 - Order + 1 Total 'Order' bytes long Building-Blocks
OSHO_order2.txt : 26,544 bytes; 4,424 Distinct Building-Blocks 2 bytes long
OSHO_order4.txt : 2,480,190 bytes; 248,019 Distinct Building-Blocks 4 bytes long
OSHO_order8.txt : 161,216,928 bytes; 8,956,496 Distinct Building-Blocks 8 bytes long
OSHO_order12.txt: 1,138,861,490 bytes; 43,802,365 Distinct Building-Blocks 12 bytes long
From above we can obtain the REDUNDANCY of OSHO books:
- for order 2: 4,424/(206,908,949 - 2 + 1)*100% = 0.002%
- for order 4: 248,019/(206,908,949 - 4 + 1)*100% = 0.11%
- for order 8: 8,956,496/(206,908,949 - 8 + 1)*100% = 4.32%
- for order 12: 43,802,365/(206,908,949 - 12 + 1)*100% = 21.16%
数据显示,我们的XML和TXT文件的冗余度差异很大。粗略地说,OSHO的2阶和4阶比WIKI少四倍,这表明哈希器(哈希2/4字节键)对OSHO会更有效。
http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z
MEMMEM在扩展时性能惊人,与其这里的针头/干草堆比例13/52,428,800相比,请想象或感受一下按比例放大的情况(相同的比例),即52,428,800/211,444,543,803,076。噢,200TB。
我向所有感受到“KAT DELUNA - Unstoppable”中那种势不可挡的节奏的人致敬。
...
明天再为我感到可怜吧
那时我会像在俱乐部里砸球一样
新闻里,我将像摇滚明星一样摇摆
我势不可挡
势不可挡
势不可挡
势不可挡
如此超音速,几乎是讽刺,我把它当成一场国际象棋游戏
你想像女王一样骑乘,那么你得到了最好的
我有律动,你喜欢模仿
那是免费的,是的,那是我!嗯哼
想要我做什么疯狂的事,移动得比光速还快
给你看看一个艺人总是做得对的方式
你需要一个备用计划,因为我要给你一枪
看我翻转吧
是的,那是我!嗯哼
我的红鞋和我的装逼病入膏肓,需要医学院
我从不犯错,但我能让她在床上动起来
你喜欢我的态度,没人能达到我的高度
看,没什么能阻止我,我觉得没什么能在我之后
...
现在摇摆你的身体,
就像丛林热在你血管里奔涌
我们疯狂起来,俱乐部在你的大脑里轰鸣
我说跳,你就跳,跳,咚咚咚
我想看你像这节拍一样跳
手链上的爱之吻,有点花哨,但这就是我的玩法
你想过来看看,然后对我大喊大叫
也许我失心疯了,也许没有,因为我玩得很溜
就像燃烧的火焰,是的,那是我,嗯哼
跟-跟-跟-跟-跟着领头人
跟-跟-跟-跟-跟着领头人
跟-跟-跟-跟-跟着领头人
“……我的时尚太病态了,我需要医学院……”,哈哈哈。
2013年10月9日的补充
进行了一些调整,听了切斯特的歌,并移除了针头长度255字符的限制,最终导致了“Railgun_Hasherezade”修订版3的出现——一个成熟的MEMMEM函数。这个第三版现在可以搜索长于255字符的模式/针头,并使用Horspool 12阶算法(检查干草堆窗口最后12个字节)。这12个字节经过哈希,每个槽位为0或1,表示是否存在。哈希表可通过DEFINE调整,也可选择位操作或字节操作。
“Railgun_Hasherezade”的骨架是这样的
- 用于最短针头的简化Karp-Rabin算法;
- 用于较短针头的Quadruplet算法;
- 用于短针头的Boyer-Moore-Horspool/Boyer-Moore-Sunday 1阶算法和Boyer-Moore-Horspool 2阶算法;
- 用于最长针头的简化Horspool 12阶算法。
这种多种搜索策略使得该函数高度可定制,特别是当你需要更高的Horspool阶数,比如128,那么我所知的最快哈希函数(这里使用的是“FNV1A-penumbra”的“FNV1A_YoshimitsuTRIAD”)就会火力全开,目前12阶只使用其主循环一次。
对于大模式,阶数越大,跳过/跳跃的距离就越大。例如,如果模式是13字节,阶数是12,则跳跃距离是13-(12-1)或只有2个位置,但对于8192字节长的模式,则变为8192-(12-1)。事实上,12阶(3x4字节经过哈希)只是比较大块的开始,在下一个包中,我搜索7/13/21/28/37/51/65/83/98/113/130/140/151/162/201/230/255/589/1080/1755字节长的模式,在“enwiki-20130904-pages-articles.7z.001”中,文件长度为52,428,800字节。
http://www.sanmayce.com/Downloads/_KAZE_32bit_memmem-like_showdown.7z
每个模式都位于这个50MB文件的末尾,这意味着每次搜索只存在一次命中。
这项探索始于替代所有旧的针对短干草堆/针头(主要是逐行搜索)的“strstr”方法,并达到了该函数已成为一个成熟的“memmem”研究的程度。
当世界沉睡时
我能听到夜晚的寂静
我能感受到夜晚的跳动
我看到了一些小偷
他们偷走了我的吉他
他们不会弹奏
演奏者是我的心
不要照镜子
守住阵线
必须赢
不要浪费你的时间
只需记住所有早期的日子
现在回来
但请不要迷失方向
当世界沉睡时
我能听到城镇的声音
但这个夜晚正在坠落到地面
在午夜时分
我已杀死我所有的希望
只有我的罪行写在墙上
CHESTER - HOLD THE LINE (1987) 是如此不可思议——一首不朽的歌曲/练习曲。
在我看来,这些练习曲(无论是音乐还是编程)有很多共同之处,就像凯莉唱的《Finer Feelings》一样。
2013年10月7日的补充
好的,既然这篇文章的目的是半回答半提问,后者已经找到了答案,从而完成了对最快的strstr函数,即这里的“memmem”C函数的搜索,我想。两年前,当我完成strstr探索的最后一个修订版时,我偶然发现了Leonid Volnitsky的网站,其描述写道:“平均情况下最快的子字符串搜索算法。源代码,基准测试。”
我立即写了一封电子邮件,请他测试我的函数和他的函数,但由于某种原因没有收到回复,我对自己说“随便吧”,并放弃了冲突的整个想法。我等着看他是否会抽出时间将我的函数纳入他的长长的竞争对手列表中,但没有,最终我想为自己并列比较他的使用哈希的美妙想法(我大约在2010年有过完全相同的想法,但当时不喜欢链式哈希,现在仍然不喜欢)以及“Railgun_Gulliver”和“Railgun_Hasherezade”。我的测试平台由两个32位可执行文件组成——一个用Microsoft VS2010编译(使用/Ox),另一个用Intel 12.1编译(使用/O3)。这两个可执行文件在维基百科转储的前50MB中搜索12个长度为021/028/037/051/065/083/098/113/130/140/151/162字符的模式。所有模式都位于文件的末尾同一行中,因此所有搜索的遍历距离都是50MB。警告!软件包中使用的“volnitsky2”和“volnitsky4”不完整,它们仅用于测试。我讨厌给出结果却无法重现,因此(一如既往)我将所有内容都包含在这个基准测试软件包中:
http://www.sanmayce.com/Downloads/_KAZE_32bit_strstr-like_showdown.7z
在那里你可以找到汇编等价物以及编译器生成的C源代码。
strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_MicrosoftV16.cod
strstr_SHORT-SHOWDOWN_LV_WIKI_32bit_IntelV12.cod
显然,“volnitsky2”以高达(3413-2048)/2048*100% = 67%的提升占据主导地位,非常非常好。对于OSHO.TXT和更短的针头,提升高达280%,简直超棒!下面是维基百科50MB的转储:
32bit_MicrosoftV16 executables:
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1651KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1505KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1462KB/clock
volnitsky2 performance: 2327KB/clock
volnitsky4 performance: 1651KB/clock
Doing Search for Pattern(13bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2327KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 1969KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 1896KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3011KB/clock
Doing Search for Pattern(21bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3011KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(28bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2844KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2844KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(37bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3011KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(51bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(65bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3413KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(83bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3413KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3200KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3413KB/clock
volnitsky2 performance: 3657KB/clock
volnitsky4 performance: 3011KB/clock
Doing Search for Pattern(98bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 3200KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 3011KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 3200KB/clock
volnitsky2 performance: 3938KB/clock
volnitsky4 performance: 3657KB/clock
Doing Search for Pattern(113bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2438KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2694KB/clock
volnitsky2 performance: 2694KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(130bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 1969KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2226KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3413KB/clock
Doing Search for Pattern(140bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2226KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3200KB/clock
Doing Search for Pattern(151bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2048KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2438KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2327KB/clock
volnitsky2 performance: 3200KB/clock
volnitsky4 performance: 3200KB/clock
Doing Search for Pattern(162bytes) into String(52428800bytes) as-one-line ...
Railgun_Quadruplet_7Gulliver performance: 2133KB/clock
Railgun_Hasherezade_O4_BIT_14bit performance: 2560KB/clock
Railgun_Hasherezade_O8_BYTE_14bit performance: 2560KB/clock
volnitsky2 performance: 3413KB/clock
volnitsky4 performance: 3413KB/clock
底线是:Volnitsky写下了所有人都忽视的东西,太棒了。奇怪的是,不喜欢带探测的哈希法甚至阻止了Sanmayce进一步探索。我希望在他之前有人能做到,比如我。看到他如何不欣赏Stephen R. van den Berg神秘/超快的C语言练习,这告诉我很多,也足够了,在我再次写信给他之前,他应该学会欣赏——这是我所知道的最重要的事情。
2012年2月1日的补充
只是想通过检查窗口最右边的8字节(8阶Horspool)与针头所有8字节后缀的匹配情况来提高跳过性能,这导致了 Railgun_Quadruplet_7Hasherezade 的出现。
这里给出一个例子
对于长度为43字节的针头“and every day a continuous cleaning goes on”,我们有43-8+1个构建块,即8字节长的后缀,如下所示:
and ever 117,013 nd every 108,604 d every 155,516 every d 170,959 every da 115,291 very day 073,191 ery day 097,042 ry day a 083,793 y day a 011,244 day a c 115,855 day a co 101,797 ay a con 222,568 y a cont 029,130 a conti 020,978 a contin 258,405 continu 252,691 continuo 123,607 ontinuou 056,546 ntinuous 135,857 tinuous 015,332 inuous c 250,584 nuous cl 048,224 uous cle 106,616 ous clea 137,020 us clean 035,751 s cleani 178,989 cleanin 213,855 cleaning 063,337 leaning 097,138 eaning g 062,366 aning go 247,590 ning goe 036,571 ing goes 041,142 ng goes 228,365 g goes o 229,696 goes on 176,852
每个数字表示哪个哈希槽被占用(这里是0..262143个槽,即使用了18位)。
好的,这是一个简化例子
string: tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba-tumba-lumba needle: and every day a continuous cleaning goes on and every day a continuous cleaning goes on +14 Horspool Order 1 and every day a continuous cleaning goes on +36 Horspool Order 8
目标:当窗口最右侧的8字节“-tumba-l”看起来不像任何针头前缀时,即未找到时,进行跳跃。这个最大跳跃等于cbPattern-(Order-1)或43-(8-1)=36。
这里是Horspool的跳过表
a n d e v e r y d a y a c o n t i n u o u s c l e a n i n g g o e s o n
12 09 32 02 04 37 04 35 30 02 32 12 30 02 12 02 15 01 09 23 10 09 18 01 18 03 02 15 14 04 12 09 10 09 06 02 06 01 04 03 02 01 09
我们有 'l' 对 'n',其中 'l' 指向14,然而 'Hasherezade' 能够在 '-tumba-l' 的哈希码不等于针头任何后缀的哈希码时跳过/跳跃36个位置。 '-tumba-l' 的哈希码是083,467,它不匹配针头任何哈希码,所以我们跳过36字节。这种提升是 'Hasherezade' 和 'Gulliver' 修订版2之间的唯一区别。
// Scheherezade -> Hasherezade
// [Italian: buffa, feminine of buffo, comic.]
/*
First off: I am heavily disappointed from Speed-Performance of 'Hasherezade'
and comparing Skip-Performance of 'Gulliver' and 'Hasherezade' doesn't make me happy, either.
Pattern: fastest fox
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 1,683KB/clock / 00,775%, 26,672,940 skips/iterations
Railgun_Quadruplet_7 performance: 1,642KB/clock / 00,669%, 30,925,578 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 0,944KB/clock / 00,912%, 22,663,583 skips/iterations
Railgun_Quadruplet_7deuce performance: 0,801KB/clock / 00,797%, 25,945,709 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 0,704KB/clock / 00,931%, 22,213,101 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,104KB/clock / 00,977%, 21,166,516 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 1,496KB/clock / 00,980%, 21,112,657 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,646 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 1,642KB/clock / 00,980%, 21,112,735 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 1,057KB/clock / 00,669%, 30,925,578 skips/iterations
Pattern: and every day a continuous cleaning goes on
Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,557KB/clock / 01,495%, 13,832,201 skips/iterations
Railgun_Quadruplet_7 performance: 2,590KB/clock / 01,404%, 14,731,326 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 1,888KB/clock / 02,222%, 09,308,136 skips/iterations
Railgun_Quadruplet_7deuce performance: 1,741KB/clock / 01,971%, 10,494,460 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 1,642KB/clock / 02,229%, 09,279,871 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 3,157KB/clock / 03,795%, 05,450,890 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,322KB/clock / 04,100%, 05,046,049 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,971KB/clock / 04,100%, 05,046,445 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 04,090%, 05,058,667 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 1,683KB/clock / 01,447%, 14,294,963 skips/iterations
Pattern: And let this be your very fundamental insight... about everything. Just for one year, don't choose.
Doing Search for Pattern(99bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,845KB/clock / 02,248%, 09,200,917 skips/iterations
Railgun_Quadruplet_7 performance: 2,928KB/clock / 02,105%, 09,827,389 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 2,270KB/clock / 03,283%, 06,302,096 skips/iterations
Railgun_Quadruplet_7deuce performance: 2,149KB/clock / 03,196%, 06,472,407 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 2,172KB/clock / 03,282%, 06,303,154 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,928KB/clock / 07,820%, 02,645,814 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,494KB/clock / 09,575%, 02,160,890 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,568%, 02,162,493 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 3,015KB/clock / 09,438%, 02,192,204 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 2,349KB/clock / 02,135%, 09,689,008 skips/iterations
Pattern: Then, singing among the savage branches, it impales itself upon the longest,
sharpest spine. And, dying, it rises above its own agony to outcarol the lark and the nightingale.
Doing Search for Pattern(175bytes) into String(206908949bytes) as-one-line ...
Railgun_Quadruplet_7sun performance: 2,658KB/clock / 03,142%, 06,583,682 skips/iterations
Railgun_Quadruplet_7 performance: 2,658KB/clock / 03,095%, 06,685,179 skips/iterations
Railgun_Quadruplet_7sunhorse performance: 2,377KB/clock / 04,776%, 04,331,865 skips/iterations
Railgun_Quadruplet_7deuce performance: 2,349KB/clock / 04,728%, 04,376,097 skips/iterations
Railgun_Quadruplet_7Elsiane performance: 2,525KB/clock / 04,778%, 04,330,200 skips/iterations
Railgun_Quadruplet_7Gulliver performance: 2,245KB/clock / 12,024%, 01,720,711 skips/iterations
Railgun_Quadruplet_7Hasherezade performance: 2,434KB/clock / 17,127%, 01,208,072 skips/iterations 18bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 17,084%, 01,211,064 skips/iterations 14bit HashTable
Railgun_Quadruplet_7Hasherezade performance: 2,590KB/clock / 16,379%, 01,263,238 skips/iterations 10bit HashTable
Boyer_Moore_Flensburg performance: 2,525KB/clock / 03,198%, 06,469,280
skips/iterations Five times more main-cycles and faster than 'Hasherezade', pshaw!
*/
#define ROL(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define HashTableSize 18
// Revision: 1, 2012-Feb-01, the main disadvantage: the preprocessing overhead PLUS a hasher.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Hasherezade_count_hits (char * pbTarget,
char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; //BMH needed
int a, j;
unsigned int bm_bc[256]; //BMH needed
unsigned int bm_bc2nd[256]; //BMS needed
unsigned char bm_Horspool_Order2[256*256]; //BMHSS(Elsiane) needed,
// 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
unsigned char bm_Hasherezade_HASH[1<<(HashTableSize)];
// Jesteressing 8bytes (Horspool Order 8) for fast lookup, should
// be bitwise (i.e. 8times smaller) since it says yes/no for presence.
uint32_t hash32;
unsigned long Gulliver; // or unsigned char or unsigned short
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++; //return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
// it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
/*
// A better strstr, with no asm code
// Written by Mischa Sandberg
// http://mischasan.wordpress.com
// static char const *
// scanstrm(char const *tgt, char const *pat, int len)
// {
// uint32_t head = MSBF32(pat), wind = 0, next;
//
// pat += 4, len -= 4;
// while ((next = *(uint8_t const*)tgt++)) {
// wind = ( wind << 8 ) + next;
// if (wind == head && !memcmp(tgt, pat, len))
// return tgt - 4;
// }
// return NULL;
//}
ulHashPattern = 0;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
ulHashPattern = ( ulHashPattern << 8 ) + *(uint8_t const*)pbPattern++;
AdvanceHopperGrass = 0;
cbPattern -= 4;
while ((ulHashTarget = *(uint8_t const*)pbTarget++)) {
AdvanceHopperGrass = ( AdvanceHopperGrass << 8 ) + ulHashTarget;
if (AdvanceHopperGrass == ulHashPattern && !memcmp(pbTarget, pbPattern, cbPattern))
Railgunhits++; //return pbTarget - 4;
}
return NULL;
*/
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
// Three unnecessary comparisons here, but 'AdvanceHopperGrass'
// must be calculated - it has a higher priority.
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
ulHashPattern = *(unsigned long *)(pbPattern); // First four bytes
//ulHashTarget = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
AdvanceHopperGrass = 0;
i=0;
// Elsiane r.2 [
for (a=0; a < 256*256; a++) {bm_Horspool_Order2[a]= cbPattern-1;}
// cbPattern-(Order-1) for Horspool; 'memset' if not optimized
// alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa)
// fast 4 0-1-2 fa as st
for (j=0; j < cbPattern-1; j++) bm_Horspool_Order2[*(unsigned short *)
(pbPattern+j)]=j; // Rightmost appearance/position is needed
// Elsiane r.2 ]
if ( cbPattern>10) {
// Hasherezade r.1 [
// OSHO.TXT has 00,046,486 03bytes distinct BBs
// OSHO.TXT has 00,248,019 04bytes distinct BBs
// OSHO.TXT has 00,855,682 05bytes distinct BBs
// OSHO.TXT has 02,236,138 06bytes distinct BBs
// OSHO.TXT has 04,803,152 07bytes distinct BBs
// OSHO.TXT has 08,956,496 08bytes distinct BBs
// to be hashed in 18bit i.e. 256KB i.e. 262,144 slots i.e. 34 vs 1.
// OSHO.TXT has 15,006,172 09bytes distinct BBs
// OSHO.TXT has 22,992,127 10bytes distinct BBs
// Note: BB stands for Building-Block (also suffix)
for (a=0; a < 1<<(HashTableSize); a++) {bm_Hasherezade_HASH[a]= 0;}
// to-do: bit to replace byte; 'memset' if not optimized
// cbPattern - Order + 1 i.e. number of BBs for 11 'fastest fox'
// 11-8+1=4: 'fastest ', 'astest f', 'stest fo', 'test fox'
for (j=0; j < cbPattern-8+1; j++) {
hash32 = (2166136261 ^ (ROL(*(uint32_t *)(pbPattern+j),5)^*(uint32_t *)(pbPattern+j+4))) * 709607;
bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]=1;
/*
for (a=0; a<8; a++)
printf("%c",*(char *)(pbPattern+j+a) );
printf(" %lu\n",( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 ));
//Input Pattern(up to 19+2000 chars): and every day a continuous cleaning goes on
//Doing Search for Pattern(43bytes) into String(206908949bytes) as-one-line ...
BBs Slot(HashCode for 18bit HashTable)
and ever 117013
nd every 108604
d every 155516
every d 170959
every da 115291
very day 73191
ery day 97042
ry day a 83793
y day a 11244
day a c 115855
day a co 101797
ay a con 222568
y a cont 29130
a conti 20978
a contin 258405
continu 252691
continuo 123607
ontinuou 56546
ntinuous 135857
tinuous 15332
inuous c 250584
nuous cl 48224
uous cle 106616
ous clea 137020
us clean 35751
s cleani 178989
cleanin 213855
cleaning 63337
leaning 97138
eaning g 62366
aning go 247590
ning goe 36571
ing goes 41142
ng goes 228365
g goes o 229696
goes on 176852
*/
}
// Hasherezade r.1 ]
while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
//i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
Gulliver = bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
Gulliver = bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
} else if ( Gulliver != cbPattern-1 ) // CASE #2: if equal means the pair (char order 2)
// is not found i.e. Gulliver remains intact, skip the whole pattern
// and fall back (Order-1) chars i.e. one char for Order 2
Gulliver = cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
// The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window do
// not look like any of Needle prefixes i.e. are not to be found. This maximum
// jump equals cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small
// one but for Needle 31 bytes the jump equals 31-(8-1)=24
if (Gulliver < cbPattern-(8-1)) {
hash32 = (2166136261 ^ (ROL(*(uint32_t *)
(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;
if ( bm_Hasherezade_HASH[( hash32 ^ (hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
Gulliver = cbPattern-(8-1);
}
i = i + Gulliver;
AdvanceHopperGrass++;
/*
; 4155 : // The goal: to jump when the rightmost 8bytes (Order 8 Horspool) of window
// do not look like any of Needle prefixes i.e. are not to be found. This maximum jump equals
// cbPattern-(Order-1) or 11-(8-1)=4 for 'fastest fox' - a small one
// but for Needle 31 bytes the jump equals 31-(8-1)=24
; 4156 : if (Gulliver < cbPattern-(8-1)) {
01f16 8d 43 f9 lea eax, DWORD PTR [ebx-7]
01f19 3b c8 cmp ecx, eax
01f1b 73 30 jae SHORT $LN18@Railgun_Qu@8
; 4157 : hash32 = (2166136261 ^ (ROL(*(uint32_t *)
(pbTarget+i+cbPattern-8),5)^*(uint32_t *)(pbTarget+i+cbPattern-8+4))) * 709607;
01f1d 8b 44 32 f8 mov eax, DWORD PTR [edx+esi-8]
01f21 c1 c0 05 rol eax, 5
01f24 33 44 32 fc xor eax, DWORD PTR [edx+esi-4]
01f28 35 c5 9d 1c 81 xor eax, -2128831035 ; 811c9dc5H
01f2d 69 c0 e7 d3 0a
00 imul eax, 709607 ; 000ad3e7H
; 4158 : if ( bm_Hasherezade_HASH[( hash32 ^
(hash32 >> 16) ) & ( (1<<(HashTableSize))-1 )]==0 )
01f33 8b f8 mov edi, eax
01f35 c1 ef 10 shr edi, 16 ; 00000010H
01f38 33 f8 xor edi, eax
01f3a 81 e7 ff ff 03
00 and edi, 262143 ; 0003ffffH
01f40 80 bc 3c 28 08
01 00 00 cmp BYTE PTR _bm_Hasherezade_HASH$[esp+edi+329776], 0
01f48 75 03 jne SHORT $LN18@Railgun_Qu@8
; 4159 : Gulliver = cbPattern-(8-1);
01f4a 8d 4b f9 lea ecx, DWORD PTR [ebx-7]
$LN18@Railgun_Qu@8:
; 4160 : }
; 4161 : i = i + Gulliver;
; 4162 : AdvanceHopperGrass++;
*/
}
} else {
while (i <= cbTarget-cbPattern-1) { // -1 because Sunday is used
Gulliver = bm_Horspool_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
//i = i + 1; // r.1, obviuosly this is the worst skip so turning to 'SunHorse': lines below
if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] )
Gulliver = bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]];
else
Gulliver = bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
} else if ( Gulliver != cbPattern-1 )
// CASE #2: if equal means the pair (char order 2) is not found
// i.e. Gulliver remains intact, skip the whole pattern
// and fall back (Order-1) chars i.e. one char for Order 2
Gulliver = cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
i = i + Gulliver;
// 32323218 Order 1 Horspool Skip-table A
// 01234568 Order 1 Horspool Skip-table B
// fa af fa af fa as st Order 2 Horspool Skip-table B
// 0 1 2 3 4 5 6
// HIKARIfast
// fafafast
// fafafast +2 Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-SkipB-Order = 8-5-1 = 2) Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-SkipB-Order = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3
// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// longpace
// longpace +2 Order 1 'a' vs 'e'
// longpace +7 = (cbPattern-(Order-1) = 8-(2-1) = 7) Order 2 'fa' vs 'ce' i.e. CASE #2
AdvanceHopperGrass++;
}
} //if ( cbPattern>10) {
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
速度性能令人失望,我也看不出如何加速这种哈希增强的变体。有什么方法可以加快速度吗?
2012年1月26日的补充
进入 Railgun_Quadruplet_7Gulliver …最有前途的变体。
我很高兴分享我最新的练习曲——这是我一周之久的副任务的结果,任务中充满了音乐和编程的混乱。这是第一个(毫不费力地)突破3GB/s障碍的搜索函数,在一台内存写入速度为3GB/s的计算机上。
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [
/*
Raising Horspool/Sunday order from 1 to 2 came quite naturally after some carefree wander of mine.
I consider adding a new 'else if' block for haystacks longer than 960 in order to avoid the
big skip table overhead, something like 512*1024, thus fastest r7sun will operate
on 961..512*1024 long haystacks whereas 'Gulliver' on bigger than 512*1024.
I am not happy with Skip-Performance of revision 1: this should be amended...
'Gulliver' makes giant skips/jumps on patterns some 8+ chars long but for short
ones falls behind. Here r.1 is given being a Horspool order 2, of course Sunday order 2 is coming.
I restrain myself of dreaming about power behind order 3,
for bigger orders I expect nothing less than Dream-Performance.
Of course the enhancements await advent, they are very easy
to be seen, but they will appear in next revisions of 'Gulliver'.
Tuning continues but the skeleton is built, I see 'Gulliver' as a really High-Performance etude.
And not to be empty-handed here the Gulliver's swiftness is benchmarked on String(206,908,949bytes) as-one-line:
Pattern: fast
Railgun_Quadruplet_7sun: 1,074KB/clock / 0,456%, 45,330,622 iterations
Railgun_Quadruplet_7: 1,000KB/clock / 0,377%, 54,788,054 iterations
Railgun_Quadruplet_7sunhorse: 0,651KB/clock / 0,480%, 43,103,056 iterations
Railgun_Quadruplet_7deuce: 0,575KB/clock / 0,389%, 53,138,919 iterations
Railgun_Quadruplet_7Elsiane: 0,511KB/clock / 0,551%, 37,541,955 iterations
Railgun_Quadruplet_7Gulliver: 0,649KB/clock / 0,298%, 69,403,843 iterations
Boyer_Moore_Flensburg: 0,491KB/clock / 0,377%, 54,788,139 iterations
Pattern: faster
Railgun_Quadruplet_7sun: 1,347KB/clock / 0,591%, 34,996,936 iterations
Railgun_Quadruplet_7: 1,329KB/clock / 0,514%, 40,194,194 iterations
Railgun_Quadruplet_7sunhorse: 0,762KB/clock / 0,656%, 31,504,148 iterations
Railgun_Quadruplet_7deuce: 0,662KB/clock / 0,567%, 36,434,006 iterations
Railgun_Quadruplet_7Elsiane: 0,538KB/clock / 0,710%, 29,101,626 iterations
Railgun_Quadruplet_7Gulliver: 1,010KB/clock / 0,491%, 42,066,438 iterations
Boyer_Moore_Flensburg: 0,687KB/clock / 0,514%, 40,194,282 iterations
Pattern: fastest
Railgun_Quadruplet_7sun: 1,554KB/clock / 0,687%, 30,084,306 iterations
Railgun_Quadruplet_7: 1,507KB/clock / 0,599%, 34,540,430 iterations
Railgun_Quadruplet_7sunhorse: 0,918KB/clock / 0,761%, 27,188,853 iterations
Railgun_Quadruplet_7deuce: 0,786KB/clock / 0,663%, 31,175,827 iterations
Railgun_Quadruplet_7Elsiane: 0,635KB/clock / 0,818%, 25,281,493 iterations
Railgun_Quadruplet_7Gulliver: 1,209KB/clock / 0,592%, 34,946,434 iterations
Boyer_Moore_Flensburg: 0,798KB/clock / 0,621%, 33,278,240 iterations
Pattern: fastest fox
Railgun_Quadruplet_7sun: 1,727KB/clock / 0,775%, 26,672,940 iterations
Railgun_Quadruplet_7: 1,669KB/clock / 0,669%, 30,925,578 iterations
Railgun_Quadruplet_7sunhorse: 0,957KB/clock / 0,912%, 22,663,583 iterations
Railgun_Quadruplet_7deuce: 0,821KB/clock / 0,797%, 25,945,709 iterations
Railgun_Quadruplet_7Elsiane: 0,719KB/clock / 0,931%, 22,213,101 iterations Gulliver outskips Elsiane!
Railgun_Quadruplet_7Gulliver: 1,804KB/clock / 0,977%, 21,167,180 iterations monstrous!
Boyer_Moore_Flensburg: 1,080KB/clock / 0,669%, 30,925,649 iterations
Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun: 2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7: 2,767KB/clock / 1,511%, 13,689,243 iterations
Railgun_Quadruplet_7sunhorse: 1,788KB/clock / 2,138%, 09,677,267 iterations
Railgun_Quadruplet_7deuce: 1,656KB/clock / 2,053%, 10,075,650 iterations
Railgun_Quadruplet_7Elsiane: 1,542KB/clock / 2,143%, 09,652,548 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 2,917%, 07,091,357 iterations over the top!
Boyer_Moore_Flensburg: 1,820KB/clock / 1,554%, 13,307,181 iterations
Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun: 2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7: 2,658KB/clock / 1,447%, 14,292,419 iterations
Railgun_Quadruplet_7sunhorse: 1,870KB/clock / 2,234%, 09,259,505 iterations
Railgun_Quadruplet_7deuce: 1,757KB/clock / 2,011%, 10,287,584 iterations
Railgun_Quadruplet_7Elsiane: 1,656KB/clock / 2,240%, 09,236,188 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 3,821%, 05,413,684 iterations unstoppable!
Boyer_Moore_Flensburg: 1,712KB/clock / 1,540%, 13,431,751 iterations
Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun: 3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7: 3,157KB/clock / 2,742%, 07,545,141 iterations
Railgun_Quadruplet_7sunhorse: 2,557KB/clock / 4,138%, 04,999,777 iterations
Railgun_Quadruplet_7deuce: 2,494KB/clock / 4,029%, 05,135,444 iterations
Railgun_Quadruplet_7Elsiane: 2,494KB/clock / 4,141%, 04,995,463 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 7,218%, 02,866,342 iterations simply amazing!
Boyer_Moore_Flensburg: 2,767KB/clock / 2,745%, 07,536,097 iterations
Note:
[
In the above benchmark (on Pentium Merom 2166Mhz) 'Gulliver' is limited by memory bandwidth only, due to:
Pattern: fastest fox with biggest strides
Railgun_Quadruplet_7sun: 2,658KB/clock / 1,584%, 13,060,463 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 2,917%, 07,091,357 iterations
Pattern: fastest fox with biggest strides known to me
Railgun_Quadruplet_7sun: 2,590KB/clock / 1,548%, 13,363,356 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 3,821%, 05,413,684 iterations
Pattern: fastest fox with biggest strides known to me up to 2012 January 26 namely 'Gulliver'
Railgun_Quadruplet_7sun: 3,108KB/clock / 2,890%, 07,159,321 iterations
Railgun_Quadruplet_7Gulliver: 3,108KB/clock / 7,218%, 02,866,342 iterations
You can see how Railgun_Quadruplet_7sun struggles to catch up, for longest pattern
it achieves the maximum 3,108KB/clock but at a nasty cost: 7,159,321 iterations
whereas 'Gulliver' achieves same (in fact much higher) speed with heart-touching 2,866,342 iterations only.
For my Pentium T3400 2166 MHz Toshiba Satellite L305 Dual DDR2-667 5-5-5-13, 'Everest' benchmark program gives:
Memory Read: 4,605 MB/s
Memory Write: 3,323 MB/s
'Gulliver' would run at 5GB/s without any problem if it were not for
the hardware limitation, a fantastic performance in my eyes.
]
*/
//
// Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit):
// Two tests were done - the first/second '6x2'/'52' searches 12/52 patterns
// (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text:
//
// Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary:
//
// '6x2' test (12 patterns 4-19 chars in length)
//
// Railgun_Quadruplet_7sun 6x2 i.e. average performance: 2,092KB/clock / 0,153,936% / 005,572,202,064
// Railgun_Quadruplet_7 6x2 i.e. average performance: 2,032KB/clock / 0,137,568% / 006,454,416,320
// Railgun_Quadruplet_7sunhorse 6x2 i.e. average performance: 1,262KB/clock / 0,173,840% / 005,145,288,080
// Railgun_Quadruplet_7deuce 6x2 i.e. average performance: 1,134KB/clock / 0,155,744% / 005,995,597,776
// Railgun_Quadruplet_7Elsiane 6x2 i.e. average performance: 0,938KB/clock / 0,184,096% / 004,710,045,936
// Boyer_Moore_Flensburg 6x2 i.e. average performance: 1,163KB/clock / 0,139,168% / 006,428,106,496
// Railgun_Quadruplet_7Gulliver 6x2 i.e. average performance: 1,552KB/clock / 0,152,912% / 006,959,446,480
// Brute_Force_Dummy 6x2 i.e. average performance: 0,343KB/clock / 0,019,200% / 039,726,516,640
//
// '52' test (52 patterns 4-175 chars in length)
//
// Railgun_Quadruplet_7sun 52 i.e. average performance: 1,652KB/clock / 0,768,416% / 022,362,482,640
// Railgun_Quadruplet_7 52 i.e. average performance: 1,636KB/clock / 0,708,768% / 025,368,049,712
// Railgun_Quadruplet_7sunhorse 52 i.e. average performance: 1,052KB/clock / 0,939,520% / 020,187,534,736
// Railgun_Quadruplet_7deuce 52 i.e. average performance: 0,929KB/clock / 0,864,048% / 023,292,952,992
// Railgun_Quadruplet_7Elsiane 52 i.e. average performance: 0,809KB/clock / 0,979,072% / 018,535,811,280
// Boyer_Moore_Flensburg 52 i.e. average performance: 0,928KB/clock / 0,720,272% / 025,227,269,760
// Railgun_Quadruplet_7Gulliver 52 i.e. average performance: 1,296KB/clock / 1,145,808% / 026,131,772,688
// Brute_Force_Dummy 52 i.e. average performance: 0,279KB/clock / 0,083,200% / 172,148,232,496
//
// Revision: 1, 2012-Jan-26, the main disadvantage: the preprocessing overhead.
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7Gulliver_count_hits (char * pbTarget,
char * pbPattern, unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
register unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; //BMH needed
int a, j;
unsigned int bm_bc[256]; //BMH needed
unsigned int bm_bc2nd[256]; //BMS needed
//BMHSS(Elsiane) needed, 'char' limits patterns to 255, if 'long' then table becomes 256KB, grrr.
unsigned char bm_Sunday_Order2[256*256];
unsigned long Gulliver; // or unsigned char or unsigned short
if (cbPattern > cbTarget)
return(NULL);
if ( cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
if ( cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
Railgunhits++; //return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) { // This value is arbitrary(don't know how exactly),
// it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
// Three unnecessary comparisons here, but 'AdvanceHopperGrass'
// must be calculated - it has a higher priority.
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
countSTATIC = cbPattern-2-2;
for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;}
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j;
// Elsiane r.2 [
for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= cbPattern-1;} // 'memset' if not optimized
// alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa)
// fast 4 0-1-2 fa as st
for (j=0; j < cbPattern-1; j++)
bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=j; // Rightmost appearance/position is needed
// Elsiane r.2 ]
ulHashPattern = *(unsigned short *)(pbPattern+cbPattern-1-1); // Last two bytes
ulHashTarget = *(unsigned long *)(pbPattern); // First four bytes
AdvanceHopperGrass = 0;
i=0;
while (i <= cbTarget-cbPattern-1-1) {
Gulliver = bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern-1-1]];
if ( Gulliver == cbPattern-2 ) { // CASE #1: means the pair (char order 2) is found
if ( *(unsigned long *)&pbTarget[i] == ulHashTarget) {
count = countSTATIC; // Last two chars already matched, to be fixed with -2
while ( count !=0 && *(char *)(pbPattern+
(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
i = i + 1;
} else if ( Gulliver == cbPattern-1 ) // CASE #2: means the pair (char order 2) is not found
i = i + Gulliver; // the pair is not found, skip the whole pattern and fall back one char
else
i = i + cbPattern - Gulliver - 2;
// CASE #3: the pair is found and not as suffix i.e. rightmost position
// 32323218 Order 1 Horspool
// fa af fa af fa as st Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// fafafast
// fafafast +2 Order 1 'a' vs 't'
// fafafast +2 = (cbPattern-Gulliver-2 = 8-4-2 = 2) Order 2 'fa' vs 'st' i.e. CASE #3
// 76543218 Order 1 Horspool
// lo on ng gp pa ac ce Order 2 Horspool
// 0 1 2 3 4 5 6
// HIKARIfast
// longpace
// longpace +2 Order 1 'a' vs 'e'
// longpace +7 = (cbPattern-1 = 8-1 = 7) Order 2 'fa' vs 'ce' i.e. CASE #2
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern-1) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4)
== *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
i++;
AdvanceHopperGrass++;
}
if (i == cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) {
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4)
== *(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
count--;
if ( count == 0) Railgunhits++; //return(pbTarget+i);
}
AdvanceHopperGrass++;
}
GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100);
GlobalI += AdvanceHopperGrass;
printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n",
(int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass);
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ]
这首核心歌曲深深打动了我(也因此给了我灵感)。
足够坚强,告诉你为何
我感到足够,告诉你一些与我非常亲近的事
在我心灵的眼中
感受我们所思考的原因
你只是感受这一切
我正跨越溪流,跌落中站立
害怕着说我
我足够坚强,找到我的勇气
找到我一直近在咫尺的方式,它在我心中酝酿
我正跨越溪流,跌落中站立
害怕着说我
我足够坚强,找到我的勇气
找到我一直近在咫尺的方式,它在我心中成长
/艺术家:Elsiane - Across The Stream/
我向所有感受并保持艺术气息的人致敬。
2012年1月24日的补充
只是想看看Sunday的技术(应用于两个字符,2阶)有什么作用,这导致了Railgun_Quadruplet_7Elsiane的出现,迄今为止最大的跨步者,我(已经)梦想着Railgun_Quadruplet_7Gulliver...
Horspool-Sunday-Sunday jumps to: --------------------------------\ | Horspool-Sunday jumps to: -------------------------------\ | Horspool jumps to: ------------------------------\ | String: faster Needle: fast Sunday skip-table: 43215 Horspool skip-table: 32144
对于上述例子
Horspool 跳到 A,其中 A=4 'e' 位置
Sunday 跳到 B,其中 B=5 'r' 位置
Horspool-Sunday ('SunHorse') 跳到 max(A,B) 即 5
Horspool-Sunday-Sunday ('Elsiane') 跳到 max(A,B,C) 即 max(5,6)=6,因为针头中既没有 'e' 也没有 'r',即紧跟在 'r' 之后
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm [ // Note: 'Elsiane' is the strongest (i.e. with biggest-strides) // Boyer-Moore-Sunday-Horspool variant known to me, // named after a 'very close to me' duo(two musicians), viva Elsieanne Caplette! // Benchmark (on Pentium Merom 2166MHz, Windows XP 32bit, CL v16 32bit): // Two tests were done - the first/second '6x2'/'52' searches 12/52 // patterns (for all appearances i.e. counting all hits) into 206,908,949 bytes long English text: // // Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary: // // Function '6x2' test (12 patterns 4-19 chars in length) // // Railgun_Quadruplet_7Elsiane 0,934KB/clock / 184,096% / 004,710,045,936 Iterations Muffin // Railgun_Quadruplet_7sunhorse 1,261KB/clock / 173,840% / 005,145,288,080 Iterations BONBON // Railgun_Quadruplet_7deuce 1,132KB/clock / 155,744% / 005,995,597,776 Iterations Leaves-The-Town // Railgun_Quadruplet_7sun 2,091KB/clock / 153,936% / 005,572,202,064 Iterations Bitter-Sweet // Boyer_Moore_Flensburg 1,162KB/clock / 139,168% / 006,428,106,496 Iterations Loopy // Railgun_Quadruplet_7 2,029KB/clock / 137,568% / 006,454,416,320 Iterations Loopier // Brute_Force_Dummy 0,342KB/clock / 019,200% / 039,726,516,640 Iterations Loopiest // // Function '52' test (52 patterns 4-175 chars in length) // // Railgun_Quadruplet_7Elsiane 0,809KB/clock / 979,072% / 018,535,811,280 Iterations Muffin // Railgun_Quadruplet_7sunhorse 1,051KB/clock / 939,520% / 020,187,534,736 Iterations BONBON // Railgun_Quadruplet_7deuce 0,928KB/clock / 864,048% / 023,292,952,992 Iterations Leaves-The-Town // Railgun_Quadruplet_7sun 1,651KB/clock / 768,416% / 022,362,482,640 Iterations Bitter-Sweet // Boyer_Moore_Flensburg 0,928KB/clock / 720,272% / 025,227,269,760 Iterations Loopy // Railgun_Quadruplet_7 1,632KB/clock / 708,768% / 025,368,049,712 Iterations Loopier // Brute_Force_Dummy 0,279KB/clock / 083,200% / 172,148,232,496 Iterations Loopiest // // Revision: 1, 2012-Jan-24, the main disadvantage: the preprocessing overhead. // Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char. char * Railgun_Quadruplet_7Elsiane_count_hits (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern) { char * pbTargetMax = pbTarget + cbTarget; register unsigned long ulHashPattern; register unsigned long ulHashTarget; signed long count; signed long countSTATIC; unsigned char SINGLET; unsigned long Quadruplet2nd; unsigned long Quadruplet3rd; unsigned long Quadruplet4th; unsigned long AdvanceHopperGrass; long i; //BMH needed int a, j; unsigned int bm_bc[256]; //BMH needed unsigned int bm_bc2nd[256]; //BMS needed unsigned char bm_Sunday_Order2[256*256]; //BMHSS(Elsiane) needed, of course those 65536 bytes are // scary in next revision reduce them to 8192 i.e. bitwise it if (cbPattern > cbTarget) return(NULL); if ( cbPattern<4) { pbTarget = pbTarget+cbPattern; ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1)); if ( cbPattern==3) { for ( ;; ) { if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) { if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) Railgunhits++; //return((pbTarget-3)); } if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++; pbTarget++; if (pbTarget > pbTargetMax) return(NULL); } } else { } for ( ;; ) { if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) ) Railgunhits++; //return((pbTarget-2)); if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++; pbTarget++; if (pbTarget > pbTargetMax) return(NULL); } } else { //if ( cbPattern<4) if (cbTarget<961) { // This value is arbitrary(don't k // now how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'. pbTarget = pbTarget+cbPattern; ulHashPattern = *(unsigned long *)(pbPattern); SINGLET = ulHashPattern & 0xFF; Quadruplet2nd = SINGLET<<8; Quadruplet3rd = SINGLET<<16; Quadruplet4th = SINGLET<<24; for ( ;; ) { AdvanceHopperGrass = 0; ulHashTarget = *(unsigned long *)(pbTarget-cbPattern); if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, // but 'AdvanceHopperGrass' must be calculated - it has a higher priority. count = cbPattern-1; while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) { if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++; count--; } if ( count == 0) Railgunhits++; //return((pbTarget-cbPattern)); } else { // The goal here: to avoid memory accesses by stressing the registers. if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) { AdvanceHopperGrass++; if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) { AdvanceHopperGrass++; if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++; } } } AdvanceHopperGrass++; pbTarget = pbTarget + AdvanceHopperGrass; if (pbTarget > pbTargetMax) return(NULL); } } else { //if (cbTarget<961) countSTATIC = cbPattern-2-2; for (a=0; a < 256; a++) {bm_bc[a]=cbPattern; bm_bc2nd[a]=cbPattern+1;} for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; for (j=0; j < cbPattern; j++) bm_bc2nd[pbPattern[j]]=cbPattern-j; // Elsiane [ for (a=0; a < 256*256; a++) {bm_Sunday_Order2[a]= 0;} // 'memset' if not optimized // alfalfa 7 long 6 BBs (al lf fa al lf fa) 3 distinct BBs (al lf fa) // fast 4 0-1-2 fa as st //for (j=0; j < cbPattern-1; j++) bm_Sunday_Order2[*(unsigned short *)(pbPattern+j)]=1; // The idea here is to use Order 2 of Sunday's approach: // It takes all rows and columns containing a char (e.g. A) // from the pattern to be marked (as 1 i.e. active) // 000 001 002 003 ... A ... 255 // 000 * // 001 * // 002 * // 003 * // ... * // A * * * * * * * * // ... * // 255 * // TO-DO: bytewise -> bitwise for (j=0; j < cbPattern; j++) for (a=0; a < 256; a++) { bm_Sunday_Order2[(a<<8) + pbPattern[j]]=1; // columns filling bm_Sunday_Order2[(pbPattern[j]<<8) + a]=1; // rows filling } // Elsiane ] ulHashPattern = *(unsigned long *)(pbPattern); AdvanceHopperGrass = 0; i=0; while (i <= cbTarget-cbPattern-1-1) { if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) { count = countSTATIC; while ( count !=0 && *(char *)(pbPattern+ (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) count--; if ( count == 0) Railgunhits++; //return(pbTarget+i); } if ( bm_Sunday_Order2[*(unsigned short *)&pbTarget[i+cbPattern]] == 0 ) i = i + cbPattern + 1 + 1; else if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] ) i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]; else i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]]; AdvanceHopperGrass++; } if (i == cbTarget-cbPattern-1) { if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) { count = countSTATIC; while ( count !=0 && *(char *)(pbPattern+ (countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) count--; if ( count == 0) Railgunhits++; //return(pbTarget+i); } i++; AdvanceHopperGrass++; } if (i == cbTarget-cbPattern) { if ( *(unsigned long *)&pbTarget[i] == ulHashPattern) { count = countSTATIC; while ( count !=0 && *(char *) (pbPattern+(countSTATIC-count)+4) == *(char *) (&pbTarget[i]+(countSTATIC-count)+4) ) count--; if ( count == 0) Railgunhits++; //return(pbTarget+i); } AdvanceHopperGrass++; } GlobalSP += (int)((double)cbTarget/AdvanceHopperGrass*100); GlobalI += AdvanceHopperGrass; //printf("Skip-Performance(bigger-the-better): %d%%, %d skips/iterations\n", // (int)((double)cbTarget/AdvanceHopperGrass*100), AdvanceHopperGrass); return(NULL); } //if (cbTarget<961) } //if ( cbPattern<4) } // ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Sunday-Horspool algorithm ] /* strstr_SHORT-SHOWDOWN, revision 7Elsiane_vs_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF, written by Kaze. Full credits to: R.S. Boyer, J.S. Moore, R.N. Horspool, D.M. Sunday. Input Pattern(up to 19+2000 chars): faster Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 591%, 34996936 skips/iterations Railgun_Quadruplet_7sun_hits/Railgun_Quadruplet_7sun_clocks: 744/154 Railgun_Quadruplet_7sun performance: 1312KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 514%, 40194194 skips/iterations Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 744/158 Railgun_Quadruplet_7 performance: 1278KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 656%, 31504148 skips/iterations Railgun_Quadruplet_7sunhorse_hits/Railgun_Quadruplet_7sunhorse_clocks: 744/272 Railgun_Quadruplet_7sunhorse performance: 742KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 567%, 36434006 skips/iterations Railgun_Quadruplet_7deuce_hits/Railgun_Quadruplet_7deuce_clocks: 744/313 Railgun_Quadruplet_7deuce performance: 645KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 710%, 29101626 skips/iterations Railgun_Quadruplet_7Elsiane_hits/Railgun_Quadruplet_7Elsiane_clocks: 744/383 Railgun_Quadruplet_7Elsiane performance: 527KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 514%, 40194216 skips/iterations Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 744/299 Boyer_Moore_Flensburg performance: 675KB/clock Doing Search for Pattern(6bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 106%, 194548807 skips/iterations Two-Way_hits/Two-Way_clocks: 744/836 Two-Way performance: 241KB/clock D:\_KAZE_strstr_SHORT-SHOWDOWN_7Elsiane_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF> */
我非常喜欢D.M. Sunday的这种前瞻性方法,但看不出如何加快速度,有什么想法吗?
2012年1月22日的补充
[
一场对决
'SunHorse' vs Railgun_Quadruplet_7sun vs Railgun_Quadruplet_7 vs Boyer_Moore vs Brute_Force
好的,Railgun_Quadruplet_7sun 来了——新的最快打击者。进行了两次测试——第一次/第二次“6x2”/“52”搜索12/52个模式(所有出现,即计算所有命中)到206,908,949字节长的英文文本中。
首先,7deuce已经成为过去,那是一场厄运,是的。哇,我在上次枪战中受到了附带损害,D.M. Sunday在速度性能和跳跃性能上都表现出色。
Speed-Performance (measured in KB/clock, bigger-the-better) Summary: Function '6x2' test (12 patterns 4-19 chars in length) Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations BONBON Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Leaves-The-Town Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations UNDERRATED Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Slow Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Slower Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Slowest Function '52' test (52 patterns 4-175 chars in length) Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations BONBON Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Leaves-The-Town Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations UNDERRATED Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Slow Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Slower Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Slowest Skip-Performance (measured in %, bigger-the-better; Iterations, lesser-the-better) Summary: Function '6x2' test (12 patterns 4-19 chars in length) Railgun_Quadruplet_7sunhorse 1,250KB/clock / 173,840% / 005,145,288,080 Iterations BONBON Railgun_Quadruplet_7deuce 1,125KB/clock / 155,744% / 005,995,597,776 Iterations Leaves-The-Town Railgun_Quadruplet_7sun 2,092KB/clock / 153,936% / 005,572,202,064 Iterations Bitter-Sweet Boyer_Moore_Flensburg 1,154KB/clock / 139,168% / 006,428,115,568 Iterations Loopy Railgun_Quadruplet_7 2,030KB/clock / 137,568% / 006,454,416,320 Iterations Loopier Brute_Force_Dummy 0,338KB/clock / 019,200% / 039,726,516,640 Iterations Loopiest Function '52' test (52 patterns 4-175 chars in length) Railgun_Quadruplet_7sunhorse 1,046KB/clock / 939,520% / 020,187,534,736 Iterations BONBON Railgun_Quadruplet_7deuce 0,923KB/clock / 864,048% / 023,292,952,992 Iterations Leaves-The-Town Railgun_Quadruplet_7sun 1,657KB/clock / 768,416% / 022,362,482,640 Iterations Bitter-Sweet Boyer_Moore_Flensburg 0,919KB/clock / 720,272% / 025,227,312,176 Iterations Loopy Railgun_Quadruplet_7 1,637KB/clock / 708,768% / 025,368,049,712 Iterations Loopier Brute_Force_Dummy 0,276KB/clock / 083,200% / 172,148,232,496 Iterations Loopiest
对于英文文本,我毫不犹豫地选择哪个函数:它是Railgun_Quadruplet_7sunhorse (Boyer-Moore通过Sunday-Horspool强化)——所有Boyer-Moore变体中最强的(包括原始版本),可惜[仍然]不是最快的。我无法弄清楚这种不成比例的原因:“sunhorse”具有惊人的(对于英文文本)跳过性能,即模式与文本的比较次数非常低,但速度性能(在此CPU上)却滞后。
测试包可在此下载
http://www.sanmayce.com/Downloads/_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF.7z
测试包包含
D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>dir 01/22/2012 07:00 AM 59,198 FH_Flensburg.zip 01/22/2012 07:00 AM 206,908,949 OSHO.TXT 01/22/2012 07:00 AM 469,283 Results_Pentium_Merom_2166Mhz.txt 01/22/2012 07:00 AM 175 RUNME_IT_TAKES_some_20_MINUTES.BAT 01/22/2012 07:00 AM 191 strstr_COMPILE_Microsoft.bat 01/22/2012 07:00 AM 603,625 strstr_SHORT-SHOWDOWN.c 01/22/2012 07:00 AM 633,053 strstr_SHORT-SHOWDOWN.cod 01/22/2012 07:00 AM 108,032 strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe D:\_KAZE_strstr_SHORT-SHOWDOWN_7sun_vs_7_vs_7sunhorse_vs_7deuce_vs_BMF>
一篇非常易于理解的文章,网址为 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bm.htm。在此,我对汉斯·维尔纳·朗教授(弗伦斯堡应用科学大学媒体信息学与技术信息学研究所)表示“三倍感谢”,我使用了您的代码(经过直接移植)。在他的页面(http://www.inf.fh-flensburg.de/lang/)上,有非常好的算法介绍——确实是一个非常好的资源。我期望Railgun_Quadruplet_7sunhorse成为首选武器,但仅限于那些分支/内存处理效率远高于我的Pentium Merom的CPU。
我仍然不理解Turbo Boyer-Moore算法,也许它能显著提升跳过性能。这个项目仍在研究中,所以我称之为探索。欢迎任何能指出更快函数的人在此与我们分享。
我最喜欢的被低估的武器的主循环如下:
; 3209 : i=0; ; 3210 : while (i <= cbTarget-cbPattern-1) { 010e6 8b 4c 24 20 mov ecx, DWORD PTR _cbTarget$GSCopy$[esp+2104] 010ea 89 44 24 28 mov DWORD PTR _ulHashPattern$[esp+2104], eax 010ee 33 c0 xor eax, eax 010f0 2b ca sub ecx, edx 010f2 89 4c 24 14 mov DWORD PTR tv554[esp+2104], ecx 010f6 8d 0c 13 lea ecx, DWORD PTR [ebx+edx] 010f9 89 44 24 10 mov DWORD PTR _AdvanceHopperGrass$[esp+2104], eax 010fd 89 4c 24 18 mov DWORD PTR tv451[esp+2104], ecx $LN13@Railgun_Qu@5: ; 3211 : ulHashTarget=*(unsigned long *)&pbTarget[i]; ; 3212 : if ( ulHashTarget == ulHashPattern) 01101 8b 54 24 28 mov edx, DWORD PTR _ulHashPattern$[esp+2104] 01105 39 14 18 cmp DWORD PTR [eax+ebx], edx 01108 75 41 jne SHORT $LN8@Railgun_Qu@5 ; 3213 : { ; 3214 : count = countSTATIC; 0110a 8b 7c 24 1c mov edi, DWORD PTR _countSTATIC$[esp+2104] ; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 // respectively i.e. no need of comparing in-between chars. 0110e 85 ff test edi, edi 01110 74 2d je SHORT $LN85@Railgun_Qu@5 ; 3213 : { ; 3214 : count = countSTATIC; 01112 8d 56 04 lea edx, DWORD PTR [esi+4] 01115 8d 74 18 04 lea esi, DWORD PTR [eax+ebx+4] 01119 8d a4 24 00 00 00 00 npad 7 $LL10@Railgun_Qu@5: ; 3215 : while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) == *(char *)(&pbTarget[i]+(countSTATIC-count)+4) ) { // if pattern length is 4 or 5 we have count=-1 and count=0 // respectively i.e. no need of comparing in-between chars. 01120 8a 0a mov cl, BYTE PTR [edx] 01122 3a 0e cmp cl, BYTE PTR [esi] 01124 75 11 jne SHORT $LN9@Railgun_Qu@5 ; 3216 : count--; 01126 46 inc esi 01127 42 inc edx 01128 4f dec edi 01129 75 f5 jne SHORT $LL10@Railgun_Qu@5 ; 3217 : } ; 3218 : if ( count == 0) Railgunhits++; //return(pbTarget+i); 0112b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104] 0112f ff 05 00 00 00 00 inc DWORD PTR _Railgunhits 01135 eb 14 jmp SHORT $LN8@Railgun_Qu@5 $LN9@Railgun_Qu@5: 01137 85 ff test edi, edi 01139 75 0c jne SHORT $LN111@Railgun_Qu@5 0113b 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104] $LN85@Railgun_Qu@5: 0113f ff 05 00 00 00 00 inc DWORD PTR _Railgunhits 01145 eb 04 jmp SHORT $LN8@Railgun_Qu@5 $LN111@Railgun_Qu@5: 01147 8b 74 24 24 mov esi, DWORD PTR _pbPattern$GSCopy$[esp+2104] $LN8@Railgun_Qu@5: ; 3219 : } ; 3220 : ; 3221 : // i+=cbPattern; ; 3222 : // i= i - bm_bc2nd[(unsigned char)pbTarget[i]]; ; 3223 : ; 3224 : //if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < // ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]) ) ; 3225 : // i= i + ((cbPattern) - bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]); ; 3226 : if ( bm_bc[(unsigned char)pbTarget[i+cbPattern-1]] < bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]] ) 0114b 8b 54 24 18 mov edx, DWORD PTR tv451[esp+2104] 0114f 0f b6 4c 02 ff movzx ecx, BYTE PTR [edx+eax-1] 01154 0f b6 14 02 movzx edx, BYTE PTR [edx+eax] 01158 8b 4c 8c 30 mov ecx, DWORD PTR _bm_bc$[esp+ecx*4+2104] 0115c 8b 94 94 30 04 00 00 mov edx, DWORD PTR _bm_bc2nd$[esp+edx*4+2104] 01163 3b ca cmp ecx, edx 01165 73 04 jae SHORT $LN7@Railgun_Qu@5 ; 3227 : i= i + bm_bc2nd[(unsigned char)pbTarget[i+(cbPattern)]]; 01167 03 c2 add eax, edx ; 3228 : else 01169 eb 02 jmp SHORT $LN6@Railgun_Qu@5 $LN7@Railgun_Qu@5: ; 3229 : i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]]; 0116b 03 c1 add eax, ecx $LN6@Railgun_Qu@5: ; 3207 : ; 3208 : AdvanceHopperGrass = 0; ; 3209 : i=0; ; 3210 : while (i <= cbTarget-cbPattern-1) { 0116d 8b 4c 24 14 mov ecx, DWORD PTR tv554[esp+2104] ; 3230 : ; 3231 : ; 3232 : AdvanceHopperGrass++; 01171 ff 44 24 10 inc DWORD PTR _AdvanceHopperGrass$[esp+2104] 01175 49 dec ecx 01176 3b c1 cmp eax, ecx 01178 76 87 jbe SHORT $LN13@Railgun_Qu@5 ; 3233 : }
尺寸是
01178 - 01101 +1 +1 -4 = 75字节,-4是因为 'inc DWORD PTR _AdvanceHopperGrass$[esp+2104]
' 仅用于统计目的。
]
背景
功能主页:http://www.sanmayce.com/Railgun/index.html
注意:为了重现图片中(下方)所示的测试,您需要从函数主页下载测试包(57.2 MB),感谢Randor指出。
关于strstr搜索技术的非常有用的资源
使用16种模式,对197MB的干草堆(纯英文书籍文本)进行测试。
机器是一台2.1GHz的普通笔记本电脑,搭载Intel Merom CPU。
Using the Code
代码非常简单明了。
// Here the strstr-like function Railgun_Quadruplet_6ppp is given,
// written by Georgi 'Kaze', 2011-Sep-07.
//
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed,
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_6ppp (char * pbTarget,
char * pbPattern,
unsigned long cbTarget,
unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
//unsigned long count; //r.6+
signed long count;
//unsigned long countSTATIC; //r.6+
signed long countSTATIC;
// unsigned long countRemainder;
/*
const unsigned char SINGLET = *(char *)(pbPattern);
const unsigned long Quadruplet2nd = SINGLET<<8;
const unsigned long Quadruplet3rd = SINGLET<<16;
const unsigned long Quadruplet4th = SINGLET<<24;
*/
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i; //BMH needed
int a, j, bm_bc[ASIZE]; //BMH needed
unsigned char ch; //BMH needed
// unsigned char lastch, firstch; //BMH needed
if (cbPattern > cbTarget)
return(NULL);
// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1,
// OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality)
// SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) { // This IF makes me unhappy: it slows down from
// 390KB/clock to 367KB/clock for 'fast' pattern.
// This fragment(for 2..3 pattern lengths) is needed
// because I need a function different than strchr, but
// sticking to strstr i.e. lengths above 1 are to be handled.
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
// countSTATIC = cbPattern-2;
if ( cbPattern==3) {
for ( ;; )
{
if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else {
}
for ( ;; )
{
// The line below gives for 'cbPattern'>=1:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
// Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
/*
if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) +
*(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern,
(unsigned int)cbPattern) )
return((long)(pbTarget-cbPattern));
*/
// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
// Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock
/*
//For 2 and 3 [
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
// count = countSTATIC;
count = cbPattern-2;
// while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) ==
// *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
while ( count && *(char *)(pbPattern+1) == *(char *)(pbTarget-2) )
{ // Crippling i.e. only 2 and 3 chars are allowed!
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
}
if ( (char)(ulHashPattern>>8) != *(pbTarget-cbPattern+1) ) pbTarget++;
//For 2 and 3 ]
*/
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) )
return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
// The fragment below gives for 'cbPattern'>=2:
// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
// Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
/*
if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
count = countSTATIC>>2;
countRemainder = countSTATIC % 4;
while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) ==
*(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
count--;
}
//if (count == 0) { // Disastrous degradation only from this line
// (317KB/clock when 1+2x4+2+1 bytes pattern:
// 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes
// pattern: 'underdog'), otherwise 368KB/clock.
while ( countRemainder && *(char *)(pbPattern+1+
(countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+
(countSTATIC-countRemainder)) ) {
countRemainder--;
}
//if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
//}
}
*/
pbTarget++;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly),
// it ensures(at least must)
// better performance than 'Boyer_Moore_Horspool'.
{
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
// countSTATIC = cbPattern-1;
//SINGLET = *(char *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; )
{
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here,
// but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
// count = countSTATIC;
// while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) ==
// *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
// if ( countSTATIC==AdvanceHopperGrass+count && SINGLET !=
// *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) )
// AdvanceHopperGrass++;
// count--;
// }
count = cbPattern-1;
while ( count && *(char *)(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET !=
*(char *)(pbTarget-count) ) AdvanceHopperGrass++;
count--;
}
if ( count == 0) return((pbTarget-cbPattern));
} else { // The goal here: to avoid memory accesses by stressing the registers.
if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax)
return(NULL);
}
} else { //if (cbTarget<961)
//countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3; //r.6++
countSTATIC = cbPattern-2-2; //r.6+++ This line fixes the bug from r.6++!
ulHashPattern = *(unsigned long *)(pbPattern);
/* Preprocessing */
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
/* Searching */
//lastch=pbPattern[cbPattern-1];
//firstch=pbPattern[0];
i=0;
while (i <= cbTarget-cbPattern) {
ch=pbTarget[i+cbPattern-1];
//if (ch ==lastch)
//if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
//if (ch == lastch && pbTarget[i] == firstch &&
//memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);
// Kaze: The idea(to prevent execution of slower 'memcmp')
// is borrowed from Karp-Rabin i.e. to perform a slower check only
// when the target "looks like".
//if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) // r.6+
//if (ch == pbPattern[cbPattern-1] && *(long *)&pbTarget[i]
// == *(long *)&pbPattern[0]) // No problema here since we have 4[+]
// long pattern here. Overlapping (1 byte recompared) when length=4, grmbl.
if (ch == pbPattern[cbPattern-1] &&
*(long *)&pbTarget[i] == ulHashPattern) // No problema here since
// we have 4[+] long pattern here. Overlapping
// (1 byte recompared) when length=4, grmbl.
{
count = countSTATIC;
//while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) ==
//*(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { // r.6+
while ( count !=0 && *(char *)(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+(countSTATIC-count)) )
{ // if pattern length is 4 or 5 we have count=-1 and count=0
// respectively i.e. no need of comparing in-between chars.
count--;
}
if ( count <= 0) return(pbTarget+i);
}
i+=bm_bc[ch];
}
return(NULL);
} //if (cbTarget<961)
} //if ( cbPattern<4)
}
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]
咚咚咚,我来了,这是 Railgun_Quadruplet
修订版7
// Caution: For better speed the case 'if (cbPattern==1)' was removed,
// so Pattern must be longer than 1 char.
char * Railgun_Quadruplet_7 (char * pbTarget, char * pbPattern,
unsigned long cbTarget, unsigned long cbPattern)
{
char * pbTargetMax = pbTarget + cbTarget;
register unsigned long ulHashPattern;
unsigned long ulHashTarget;
signed long count;
signed long countSTATIC;
unsigned char SINGLET;
unsigned long Quadruplet2nd;
unsigned long Quadruplet3rd;
unsigned long Quadruplet4th;
unsigned long AdvanceHopperGrass;
long i;
int a, j, bm_bc[ASIZE];
unsigned char ch;
if (cbPattern > cbTarget) return(NULL);
if (cbPattern<4) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = ( (*(char *)(pbPattern))<<8 ) +
*(pbPattern+(cbPattern-1));
if (cbPattern==3) {
for ( ;; ) {
if ( ulHashPattern ==
( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
if ( *(char *)(pbPattern+1) ==
*(char *)(pbTarget-2) ) return((pbTarget-3));
}
if ( (char)(ulHashPattern>>8) !=
*(pbTarget-2) ) pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
}
for ( ;; ) {
if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 )
+ *(pbTarget-1) ) return((pbTarget-2));
if ( (char)(ulHashPattern>>8) != *(pbTarget-1) )
pbTarget++;
pbTarget++;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
if (cbTarget<961) {
pbTarget = pbTarget+cbPattern;
ulHashPattern = *(unsigned long *)(pbPattern);
SINGLET = ulHashPattern & 0xFF;
Quadruplet2nd = SINGLET<<8;
Quadruplet3rd = SINGLET<<16;
Quadruplet4th = SINGLET<<24;
for ( ;; ) {
AdvanceHopperGrass = 0;
ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
if ( ulHashPattern == ulHashTarget ) {
count = cbPattern-1;
while ( count && *(char *)
(pbPattern+(cbPattern-count)) ==
*(char *)(pbTarget-count) ) {
if ( cbPattern-1==
AdvanceHopperGrass+count &&
SINGLET != *(char *)(pbTarget-count)
) AdvanceHopperGrass++;
count--;
}
if ( count == 0)
return((pbTarget-cbPattern));
} else {
if ( Quadruplet2nd != (ulHashTarget
& 0x0000FF00) ) {
AdvanceHopperGrass++;
if ( Quadruplet3rd !=
(ulHashTarget & 0x00FF0000) ) {
AdvanceHopperGrass++;
if ( Quadruplet4th !=
(ulHashTarget & 0xFF000000)
) AdvanceHopperGrass++;
}
}
}
AdvanceHopperGrass++;
pbTarget = pbTarget + AdvanceHopperGrass;
if (pbTarget > pbTargetMax) return(NULL);
}
} else {
countSTATIC = cbPattern-2-2;
ulHashPattern = *(unsigned long *)(pbPattern);
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=
cbPattern-j-1;
i=0;
while (i <= cbTarget-cbPattern) {
if ( *(unsigned long *)&pbTarget[i] ==
ulHashPattern ) {
count = countSTATIC;
while ( count !=0 && *(char *)
(pbPattern+1+3+(countSTATIC-count)) ==
*(char *)(&pbTarget[i]+1+3+
(countSTATIC-count)) ) count--;
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
}
}
}
(已过时的)r.6+++显着加速的(BMH片段中的)调整是这样的
//countSTATIC = cbPattern-2; //r.6+
//countSTATIC = cbPattern-2-3;
//countSTATIC = cbPattern-2-2; // r.6+++ I suppose that the awful degradation
// comes from 2bytes more (from either 'if (countSTATIC<0)
// countSTATIC=0;' or 'count >0' fixes) which make the
// function unfittable in code cache lines?!
//countSTATIC = cbPattern-2-3; // r.7- At last no recompared bytes
// in-between chars
countSTATIC = cbPattern-2-2; // r.7
ulHashPattern = *(unsigned long *)(pbPattern);
//chPTR=(unsigned char *)&chchchch+3;
// Next line fixes the BUG from r.6++: but with awful speed degradation!
// So the bug is fixed in the definitions by setting 'countSTATIC = cbPattern-2-2;',
// bug appears only for patterns with lengths of 4, The setback is one unnecessary
// comparison for 5bytes patterns, stupidly such setback exists (from before) for
// 4bytes as well.
//if (countSTATIC<0) countSTATIC=0;
/* Preprocessing */
for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
/* Searching */
//lastch=pbPattern[cbPattern-1];
//firstch=pbPattern[0];
i=0;
while (i <= cbTarget-cbPattern) {
//ch=pbTarget[i+cbPattern-1];
//ch=pbTarget[i];
//if ( pbTarget[i] == pbPattern[0] && *(unsigned long *)
//&pbTarget[i+cbPattern-1-3] == ulHashPattern) // No problema here since
// we have 4[+] long pattern here. Overlapping
// (1 byte recompared) when length=4, grmbl.
if ( *(unsigned long *)&pbTarget[i] == ulHashPattern ) // The lesson I learned
// from r.7- now applied in r.7: instead of extracting 'ch'
// having higher address now the lower address is extracted
// first in order (hopefully, the test confirms it) the next
// 32bytes (including 'ch') to be cached i.e. to comparison
// part is faster.
{
count = countSTATIC;
while ( count !=0 && *(char *)(pbPattern+(countSTATIC-count)+4) ==
*(char *)(&pbTarget[i]+(countSTATIC-count)+4) )
{ // if pattern length
// is 4 or 5 we have count=-1 and count=0 respectively i.e. no need of
// comparing in-between chars.
count--;
}
if ( count == 0) return(pbTarget+i);
}
i= i + bm_bc[(unsigned char)pbTarget[i+cbPattern-1]];
}
return(NULL);
为了了解我为进行此调整所必须经历的路径,您可以查看讨论板(r.7m主题)。下面的测试未能显示完整的提升,仅仅是因为它们平均了所有长度的时间,从而丢失/减弱了高速运行的时间(与模式长度2-3-4-5-6相比,慢了4-3-2倍)。
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): frustration
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 1355/101
Railgun_6pp performance: 2000KB/clock
Doing Search for Pattern(11bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1355/91
Railgun_Quadruplet_7 performance: 2220KB/clock
Note: Executing the next two tests 256 times i.e. the search is for 256x8x2 patterns!
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_6pp performance: 922KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_6pp performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_6pp performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_6pp performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_6pp performance: 1167KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_6pp performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_6pp performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_6pp performance: 93KB/clock
Found ('underdog') at 137424498 position,
Railgun_Quadruplet_6pp performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_6pp performance: 1603KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_6pp performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_6pp performance: 1836KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_6pp performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_6pp performance: 3KB/clock
Railgun_Quadruplet_6pp 8x2 i.e. average performance: 1342KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Not-Counting-Hits-Just-Returns-First-One
Found ('an') at 157 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('to') at 853 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('TDK') at -4325408 position, Railgun_Quadruplet_7 performance: 990KB/clock
Found ('the') at 511 position, Railgun_Quadruplet_7 performance: 0KB/clock
Found ('fast') at 42381 position, Railgun_Quadruplet_7 performance: 41KB/clock
Found ('easy') at 30163 position, Railgun_Quadruplet_7 performance: 29KB/clock
Found ('grmbl') at -4325408 position, Railgun_Quadruplet_7 performance: 1069KB/clock
Found ('email') at 69587982 position, Railgun_Quadruplet_7 performance: 1078KB/clock
Found ('pasting') at 134600126 position, Railgun_Quadruplet_7 performance: 1383KB/clock
Found ('amazing') at 1629806 position, Railgun_Quadruplet_7 performance: 1591KB/clock
Found ('underdog') at 137424498 position, Railgun_Quadruplet_7 performance: 1698KB/clock
Found ('superdog') at -4325408 position, Railgun_Quadruplet_7 performance: 1422KB/clock
Found ('participants') at 16060 position, Railgun_Quadruplet_7 performance: 15KB/clock
Found ('skillessness') at -4325408 position,
Railgun_Quadruplet_7 performance: 2149KB/clock
Found ('I should have known') at 41831880 position,
Railgun_Quadruplet_7 performance: 2403KB/clock
Found ('human consciousness') at 3863 position,
Railgun_Quadruplet_7 performance: 3KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1354KB/clock
ReallyTraversed: 310,477,846,528 bytes
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('to') 1076629 time(s), BM_HORSPOOL performance: 293KB/clock
Found ('TDK') 0 time(s), BM_HORSPOOL performance: 516KB/clock
Found ('the') 2114180 time(s), BM_HORSPOOL performance: 379KB/clock
Found ('fast') 5945 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('easy') 5191 time(s), BM_HORSPOOL performance: 585KB/clock
Found ('grmbl') 0 time(s), BM_HORSPOOL performance: 759KB/clock
Found ('email') 1 time(s), BM_HORSPOOL performance: 756KB/clock
Found ('pasting') 2 time(s), BM_HORSPOOL performance: 918KB/clock
Found ('amazing') 323 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('underdog') 4 time(s), BM_HORSPOOL performance: 1069KB/clock
Found ('superdog') 0 time(s), BM_HORSPOOL performance: 1074KB/clock
Found ('participants') 147 time(s), BM_HORSPOOL performance: 1603KB/clock
Found ('skillessness') 0 time(s), BM_HORSPOOL performance: 1422KB/clock
Found ('I should have known') 1 time(s), BM_HORSPOOL performance: 1433KB/clock
Found ('human consciousness') 519 time(s), BM_HORSPOOL performance: 1820KB/clock
BM_HORSPOOL 8x2 i.e. average performance: 666KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_6pp performance: 678KB/clock
Found ('to') 1076629 time(s), Railgun_6pp performance: 643KB/clock
Found ('TDK') 0 time(s), Railgun_6pp performance: 1174KB/clock
Found ('the') 2114180 time(s), Railgun_6pp performance: 678KB/clock
Found ('fast') 5945 time(s), Railgun_6pp performance: 918KB/clock
Found ('easy') 5191 time(s), Railgun_6pp performance: 990KB/clock
Found ('grmbl') 0 time(s), Railgun_6pp performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_6pp performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_6pp performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_6pp performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_6pp performance: 1603KB/clock
Found ('superdog') 0 time(s), Railgun_6pp performance: 1820KB/clock
Found ('participants') 147 time(s), Railgun_6pp performance: 2149KB/clock
Found ('skillessness') 0 time(s), Railgun_6pp performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_6pp performance: 2557KB/clock
Found ('human consciousness') 519 time(s), Railgun_6pp performance: 2126KB/clock
Railgun_6pp 8x2 i.e. average performance: 1218KB/clock
Doing Search for 8x2 Patterns into String(206908949bytes) as-one-line ...
Found ('an') 1987797 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('to') 1076629 time(s), Railgun_Quadruplet_7 performance: 678KB/clock
Found ('TDK') 0 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('the') 2114180 time(s), Railgun_Quadruplet_7 performance: 643KB/clock
Found ('fast') 5945 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('easy') 5191 time(s), Railgun_Quadruplet_7 performance: 1074KB/clock
Found ('grmbl') 0 time(s), Railgun_Quadruplet_7 performance: 1287KB/clock
Found ('email') 1 time(s), Railgun_Quadruplet_7 performance: 1167KB/clock
Found ('pasting') 2 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('amazing') 323 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('underdog') 4 time(s), Railgun_Quadruplet_7 performance: 1820KB/clock
Found ('superdog') 0 time(s), Railgun_Quadruplet_7 performance: 1603KB/clock
Found ('participants') 147 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('skillessness') 0 time(s), Railgun_Quadruplet_7 performance: 2126KB/clock
Found ('I should have known') 1 time(s), Railgun_Quadruplet_7 performance: 2557KB/clock
Found ('human consciousness') 519 time(s),
Railgun_Quadruplet_7 performance: 2557KB/clock
Railgun_Quadruplet_7 8x2 i.e. average performance: 1232KB/clock
^C
D:\_KAZE_new-stuff>
在我看来,这是速度的七大戒律之一:不要本末倒置,也就是说,对于相邻的(32?)内存访问,首先提取地址最低的字节。strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe,修订版7,一些快速输入。
D:\_KAZE_new-stuff>strstr_SHORT-SHOWDOWN_Microsoft_v16_Ox.exe
strstr_SHORT-SHOWDOWN, revision 7, written by Kaze.
Input Pattern(up to 19+2000 chars): fast
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_6pp_hits/Railgun_6pp_clocks: 5945/215
Railgun_6pp performance: 939KB/clock
Doing Search for Pattern(4bytes) into String(206908949bytes) as-one-line ...
WARNING! The next line works with BMH only for pattern 4[+] long,
otherwise (for 2 and 3) other searcher takes over!
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 5945/193
Railgun_Quadruplet_7 performance: 1046KB/clock
^C
...
Input Pattern(up to 19+2000 chars): from
Railgun_6pp_hits/Railgun_6pp_clocks: 83155/196
Railgun_6pp performance: 1030KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 83155/194
Railgun_Quadruplet_7 performance: 1041KB/clock
Input Pattern(up to 19+2000 chars): rabea
Railgun_6pp_hits/Railgun_6pp_clocks: 0/176
Railgun_6pp performance: 1148KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): makes
Railgun_6pp_hits/Railgun_6pp_clocks: 9281/172
Railgun_6pp performance: 1174KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 9281/162
Railgun_Quadruplet_7 performance: 1247KB/clock
Input Pattern(up to 19+2000 chars): monkey
Railgun_6pp_hits/Railgun_6pp_clocks: 1691/141
Railgun_6pp performance: 1433KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1691/139
Railgun_Quadruplet_7 performance: 1453KB/clock
Input Pattern(up to 19+2000 chars): punny
Railgun_6pp_hits/Railgun_6pp_clocks: 0/156
Railgun_6pp performance: 1295KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/155
Railgun_Quadruplet_7 performance: 1303KB/clock
Input Pattern(up to 19+2000 chars): funny
Railgun_6pp_hits/Railgun_6pp_clocks: 179/157
Railgun_6pp performance: 1287KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 179/156
Railgun_Quadruplet_7 performance: 1295KB/clock
Input Pattern(up to 19+2000 chars): ramjet
Railgun_6pp_hits/Railgun_6pp_clocks: 0/152
Railgun_6pp performance: 1329KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): fallen
Railgun_6pp_hits/Railgun_6pp_clocks: 2578/151
Railgun_6pp performance: 1338KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 2578/138
Railgun_Quadruplet_7 performance: 1464KB/clock
Input Pattern(up to 19+2000 chars): elephant
Railgun_6pp_hits/Railgun_6pp_clocks: 1198/124
Railgun_6pp performance: 1629KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1198/112
Railgun_Quadruplet_7 performance: 1804KB/clock
Input Pattern(up to 19+2000 chars): layalina
Railgun_6pp_hits/Railgun_6pp_clocks: 0/121
Railgun_6pp performance: 1669KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/110
Railgun_Quadruplet_7 performance: 1836KB/clock
Input Pattern(up to 19+2000 chars): I should have known
Railgun_6pp_hits/Railgun_6pp_clocks: 1/89
Railgun_6pp performance: 2270KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/84
Railgun_Quadruplet_7 performance: 2405KB/clock
Input Pattern(up to 19+2000 chars): human consciousness
Railgun_6pp_hits/Railgun_6pp_clocks: 519/80
Railgun_6pp performance: 2525KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 519/75
Railgun_Quadruplet_7 performance: 2694KB/clock
Input Pattern(up to 19+2000 chars): But he's looking right through me
Railgun_6pp_hits/Railgun_6pp_clocks: 0/84
Railgun_6pp performance: 2405KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/76
Railgun_Quadruplet_7 performance: 2658KB/clock
Input Pattern(up to 19+2000 chars): I'm living in an age that calls darkness light
Railgun_6pp_hits/Railgun_6pp_clocks: 0/72
Railgun_6pp performance: 2806KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/67
Railgun_Quadruplet_7 performance: 3015KB/clock
Input Pattern(up to 19+2000 chars): Following the wanderings of a dream -
a dream that keeps my soul alive
Railgun_6pp_hits/Railgun_6pp_clocks: 0/69
Railgun_6pp performance: 2928KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): I notice what matters and
I got nothing to lose but darkness and shadows
Railgun_6pp_hits/Railgun_6pp_clocks: 0/66
Railgun_6pp performance: 3061KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/63
Railgun_Quadruplet_7 performance: 3207KB/clock
Input Pattern(up to 19+2000 chars): Beth Ditto
Railgun_6pp_hits/Railgun_6pp_clocks: 0/118
Railgun_6pp performance: 1712KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/107
Railgun_Quadruplet_7 performance: 1888KB/clock
Input Pattern(up to 19+2000 chars): SR71
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/182
Railgun_Quadruplet_7 performance: 1110KB/clock
Input Pattern(up to 19+2000 chars): Apache
Railgun_6pp_hits/Railgun_6pp_clocks: 1/162
Railgun_6pp performance: 1247KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/133
Railgun_Quadruplet_7 performance: 1519KB/clock
Input Pattern(up to 19+2000 chars): Fibonacci
Railgun_6pp_hits/Railgun_6pp_clocks: 0/106
Railgun_6pp performance: 1906KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/98
Railgun_Quadruplet_7 performance: 2061KB/clock
Input Pattern(up to 19+2000 chars): Tesla
Railgun_6pp_hits/Railgun_6pp_clocks: 0/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): Albert
Railgun_6pp_hits/Railgun_6pp_clocks: 932/154
Railgun_6pp performance: 1312KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 932/137
Railgun_Quadruplet_7 performance: 1474KB/clock
Input Pattern(up to 19+2000 chars): Toshiba
Railgun_6pp_hits/Railgun_6pp_clocks: 0/130
Railgun_6pp performance: 1554KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 0/121
Railgun_Quadruplet_7 performance: 1669KB/clock
Input Pattern(up to 19+2000 chars): Orion
Railgun_6pp_hits/Railgun_6pp_clocks: 1/175
Railgun_6pp performance: 1154KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 1/160
Railgun_Quadruplet_7 performance: 1262KB/clock
Input Pattern(up to 19+2000 chars): pharaoh
Railgun_6pp_hits/Railgun_6pp_clocks: 26/129
Railgun_6pp performance: 1566KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 26/122
Railgun_Quadruplet_7 performance: 1656KB/clock
Input Pattern(up to 19+2000 chars): lemon
Railgun_6pp_hits/Railgun_6pp_clocks: 33/178
Railgun_6pp performance: 1135KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 33/161
Railgun_Quadruplet_7 performance: 1255KB/clock
Input Pattern(up to 19+2000 chars): grammar
Railgun_6pp_hits/Railgun_6pp_clocks: 218/123
Railgun_6pp performance: 1642KB/clock
Railgun_Quadruplet_7_hits/Railgun_Quadruplet_7_clocks: 218/117
Railgun_Quadruplet_7 performance: 1727KB/clock
D:\_KAZE_new-stuff>
嗯,模式“SR71”与“fast”相比,给我一个暗示,事情离“最快”还差得很远。
关注点
关于我所做测试的有趣之处:“memcmp”和“memcpy”如果正确地内联,会给出更好的结果。反复出现的一点是:永远不要想当然。一个错误(不影响之前结果,正确)已被修复,现在r.6+++已准备就绪/可以运行。
正如您从C(无注释和表格化)和ASM源代码中看到的那样(请参阅页面顶部进行下载)
The Microsoft 16.00.30319.01 compiled Railgun_Quadruplet_6ppp is 00c6b-00a00+5 = 624 bytes long.
The Quadruplet (where cbTarget<961) fragment loop is 00bc3-00b17+5 = 177 bytes long.
The Boyer-Moore-Horspool (where cbTarget>=961) fragment loop is 00c61-00c20+2 = 67 bytes long.
对于更有经验的程序员来说,有一个有趣的问题:**如果将BMH循环减少3字节(即减少到64字节),我们能否获得额外的/显著的提升?**
历史
- 修订版6+++ 编写于 2011年9月7日
- 修订版7 编写于 2011年9月25日
- 修订版7“Sun”编写于 2012年1月22日
- 修订版7“Elsiane”编写于 2012年1月24日
- 修订版7“Gulliver”编写于 2012年1月26日
- 修订版7“Hasherezade”编写于 2012年2月1日
- 修订版7“Decumanus”编写于 2014年1月12日