最快的全文本(矢量和标量)精确搜索器?!
一个报告精确匹配数量的全文命令行工具,速度非常快!
引言
现在是向量时代了。
现在是时候使用向量来基准测试在干草堆中找针这一基本任务了。
实现了一个简单的向量memmem()
练习,很高兴与大家分享。展示了用C编写的(标量和向量,并排展示)超快速memmem()
函数。
这里附带的控制台工具只是对这些函数的包装,本身就是一个框架,NyoTengu 可以很好地作为进一步开发搜索工具的蓝图(起点),具有更多功能。
背景
这个主题在书籍中随处可见,但精确搜索却没有得到优化,这是一个令人难过的现实。我不了解任何在现代机器上以10GB/s的速度运行的函数(更不用说工具了),请看下面的我的主流笔记本电脑如何执行这些共享的C练习 - 使用GCC和ICL编译。
全文精确对决 — 最快的GREP vs NyoTengu
测试机器:运行Windows 10的笔记本电脑,i5-7200U @2.5GHz,8GB DDR4 2133MHz
Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle: "Linus Torvalds"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle1.txt | 11,714,194,285 B/s; 0.084 seconds | 602 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle1.txt | 11,441,771,162 B/s; 0.086 seconds | 602 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "Linus Torvalds" li...tar | N.A.; 0.256 seconds | 602 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle: "5.8.5"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle2.txt | 9,553,323,495 B/s; 0.103 seconds | 74028 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle2.txt | 9,461,464,615 B/s; 0.104 seconds | 74028 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "5.8.5" li...tar | N.A.; 0.212 seconds | 74028 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: enwiki-20201001-all-titles (1,193,068,592 bytes)
Needle: "grep"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe enwiki-20201001-all-titles grep.txt | 12,051,197,898 B/s; 0.099 seconds | 173 |
| NyoTengu_YMM_GCC730_64bit.exe enwiki-20201001-all-titles grep.txt | 11,583,190,213 B/s; 0.103 seconds | 173 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "grep" enwiki-2...titles | N.A.; 0.378 seconds | 173 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: gcc-7.3.0.tar (615,567,360 bytes)
Needle: "SIMD"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe gcc-7.3.0.tar SIMD.txt | 11,837,833,846 B/s; 0.052 seconds | 3606 |
| NyoTengu_YMM_GCC730_64bit.exe gcc-7.3.0.tar SIMD.txt | 11,192,133,818 B/s; 0.055 seconds | 3606 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "SIMD" gcc-7.3.0.tar | N.A.; 0.121 seconds | 3606 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle: "ACGT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 4.txt | 3,508,357,471 B/s; 0.956 seconds | 1607850 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 4.txt | 4,651,858,173 B/s; 0.721 seconds | 1607850 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "ACGT" "SCB_DNA...-28" | N.A.; 5.701 seconds | 1607850 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle: "AACCGGTT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 8.txt | 2,187,860,236 B/s; 1.533 seconds | 2723 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 8.txt | 1,968,303,839 B/s; 1.704 seconds | 2723 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "AACCGGTT" "SCB_...-28" | N.A.; 5.320 seconds | 2723 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
为了重现基准测试,这里是软件包(除了四个语料库)
四个语料库可以在软件包之外下载
- https://linuxkernel.org.cn/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz
- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
- https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-all-titles.gz
- https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
注意:最快的GREP - RIPGREP (https://github.com/
结论:当前的NyoTengu 比当前的RIPGREP 快2倍到5倍,这是256位世界,谁知道512位世界会怎样……
我打算用Railgun_Trolldom_64
的比较器替换memcmp()
实例(不使用外部代码)……
我相信,Railgun_Nyotengu_XMM_YMM_ZMM()
目前是可用的最快的memmem()
,如果有人能证明我错了,我将非常高兴。 :P
Using the Code
附带的NyoTengu.c 包含3022行代码,可以按以下方式编译
在Windows上,使用ICL
icl /O3 Nyotengu.c -DXMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_XMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DYMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_YMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DZMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_ZMM_IntelV150_32bit /FAcs
在*nix上,使用GCC
gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC_64bit.elf
-DYMMtengu -D_POSIX_ENVIRONMENT_
以下片段是由GCC 7.3.0以这种方式生成的
gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.exe
-DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
gcc -O3 -mavx2 -fomit-frame-pointer -S NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.asm
-DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
因此,整个ASM片段
.seh_proc Railgun_Nyotengu_XMM_YMM_ZMM
Railgun_Nyotengu_XMM_YMM_ZMM:
pushq %r15
.seh_pushreg %r15
pushq %r14
.seh_pushreg %r14
pushq %r13
.seh_pushreg %r13
pushq %r12
.seh_pushreg %r12
pushq %rbp
.seh_pushreg %rbp
pushq %rdi
.seh_pushreg %rdi
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $120, %rsp
.seh_stackalloc 120
.seh_endprologue
movl %r8d, %r13d
movq %rcx, %rsi
movq %rdx, %r10
leaq (%rcx,%r13), %r15
cmpl %r9d, %r8d
jb .L293
movl %r9d, %ebp
cmpl $3, %r9d
jbe .L321
movl (%rdx), %eax
movl %eax, 80(%rsp)
cmpl $127, %r8d
ja .L269
movl %eax, %edi
movl %eax, %r13d
movl %eax, %r14d
movzbl %al, %esi
sall $24, %edi
leal -1(%r9), %r9d
sall $8, %r13d
leaq (%rcx,%rbp), %rdx
sall $16, %r14d
movl %edi, 72(%rsp)
movslq %r9d, %rdi
movzwl %r13w, %r13d
andl $16711680, %r14d
negq %rbp
movq %rdi, 80(%rsp)
movl %eax, %r12d
movl %esi, 32(%rsp)
jmp .L275
.p2align 4,,10
.L270:
movl %eax, %r8d
movl $1, %ecx
andl $65280, %r8d
cmpl %r13d, %r8d
je .L274
movl %eax, %r8d
movl $2, %ecx
andl $16711680, %r8d
cmpl %r14d, %r8d
je .L274
xorl %ecx, %ecx
andl $-16777216, %eax
cmpl 72(%rsp), %eax
setne %cl
addq $3, %rcx
.L274:
addq %rcx, %rdx
cmpq %rdx, %r15
jb .L293
.L275:
leaq (%rdx,%rbp), %rbx
movl (%rbx), %eax
cmpl %r12d, %eax
jne .L270
movq %rdx, %r8
movl %r9d, %eax
subq 80(%rsp), %r8
movl $1, %ecx
xorl %edi, %edi
jmp .L271
.p2align 4,,10
.L273:
leal (%rax,%rdi), %esi
cmpl %esi, %r9d
jne .L272
cmpl %r11d, 32(%rsp)
setne %r11b
movzbl %r11b, %r11d
addl %r11d, %edi
.L272:
addl $1, %ecx
addq $1, %r8
subl $1, %eax
je .L319
.L271:
movl %ecx, %r11d
movsbl (%r10,%r11), %r11d
cmpb (%r8), %r11b
je .L273
leal 1(%rdi), %ecx
addq %rcx, %rdx
cmpq %rdx, %r15
jnb .L275
.p2align 4,,10
.L293:
xorl %ebx, %ebx
.L319:
movq %rbx, %rax
addq $120, %rsp
popq %rbx
popq %rsi
popq %rdi
popq %rbp
popq %r12
popq %r13
popq %r14
popq %r15
ret
.p2align 4,,10
.L321:
leal -1(%r9), %eax
movsbl (%rdx), %ebx
addq %rbp, %rsi
movsbl (%rdx,%rax), %eax
sall $8, %ebx
addl %eax, %ebx
movzbl %bh, %edi
cmpl $3, %r9d
je .L265
.p2align 4,,10
.L261:
movsbl -2(%rsi), %eax
movsbl -1(%rsi), %ecx
sall $8, %eax
addl %ecx, %eax
cmpl %eax, %ebx
je .L322
leaq 1(%rsi), %rax
cmpb %dil, %cl
je .L267
addq $2, %rsi
cmpq %rsi, %r15
jnb .L261
jmp .L293
.p2align 4,,10
.L324:
cmpb %cl, 1(%r10)
je .L323
leaq 1(%rsi), %rax
cmpb %cl, %dil
je .L286
.L325:
movq %rsi, %rax
xorl %esi, %esi
cmpb %dil, %dl
setne %sil
leaq 2(%rsi,%rax), %rsi
.L263:
cmpq %rsi, %r15
jb .L293
.L265:
movsbl -3(%rsi), %eax
movsbl -1(%rsi), %r8d
movzbl -2(%rsi), %ecx
sall $8, %eax
movl %r8d, %edx
addl %r8d, %eax
cmpl %eax, %ebx
je .L324
leaq 1(%rsi), %rax
cmpb %cl, %dil
jne .L325
.L286:
movq %rax, %rsi
jmp .L263
.p2align 4,,10
.L267:
cmpq %rax, %r15
jb .L293
movq %rax, %rsi
jmp .L261
.p2align 4,,10
.L269:
shrl $5, %r8d
leaq 32(%rcx), %rbx
vpbroadcastd 80(%rsp), %ymm4
movq %r15, 104(%rsp)
leal -1(%r8), %r14d
movq %rbx, 88(%rsp)
leaq 4096(%rcx), %rax
movl %r9d, 216(%rsp)
salq $5, %r14
movq %rax, %r12
movq %r14, 72(%rsp)
xorl %r14d, %r14d
movq %r13, 96(%rsp)
movq %rdx, %r13
vmovdqu %ymm4, 32(%rsp)
.L278:
vmovdqu 32(%rsp), %ymm3
prefetcht0 (%r12)
vpcmpeqd 1(%rsi,%r14), %ymm3, %ymm0
vpcmpeqd (%rsi,%r14), %ymm3, %ymm1
vpcmpeqd 3(%rsi,%r14), %ymm3, %ymm2
vpor %ymm0, %ymm1, %ymm1
vpcmpeqd 2(%rsi,%r14), %ymm3, %ymm0
vpor %ymm0, %ymm2, %ymm0
vpor %ymm0, %ymm1, %ymm0
vmovmskps %ymm0, %eax
testl %eax, %eax
jne .L326
.L276:
addq $32, %r14
cmpq 72(%rsp), %r14
jb .L278
movq %r13, %r10
movq 96(%rsp), %r13
movq 104(%rsp), %r15
movl 216(%rsp), %r12d
subq %r14, %r13
cmpq %r13, %rbp
ja .L294
movl 80(%rsp), %ebx
leal -1(%r12), %r9d
leaq (%r14,%rbp), %rax
negq %rbp
addq %rsi, %rax
movzbl %bl, %edi
movl %ebx, %edx
movl %ebx, %r13d
movl %ebx, %r12d
movl %edi, 32(%rsp)
sall $8, %edx
movl %ebx, %edi
sall $16, %r13d
sall $24, %edi
movzwl %dx, %r14d
andl $16711680, %r13d
movl %edi, 72(%rsp)
movslq %r9d, %rdi
movq %rdi, 80(%rsp)
jmp .L284
.p2align 4,,10
.L279:
movl %edx, %r8d
movl $1, %ecx
andl $65280, %r8d
cmpl %r14d, %r8d
je .L283
movl %edx, %r8d
movl $2, %ecx
andl $16711680, %r8d
cmpl %r13d, %r8d
je .L283
xorl %ecx, %ecx
andl $-16777216, %edx
cmpl 72(%rsp), %edx
setne %cl
addq $3, %rcx
.L283:
addq %rcx, %rax
cmpq %rax, %r15
jb .L294
.L284:
leaq (%rax,%rbp), %rbx
movl (%rbx), %edx
cmpl %r12d, %edx
jne .L279
movq %rax, %r8
movl %r9d, %edx
subq 80(%rsp), %r8
movl $1, %ecx
xorl %edi, %edi
.p2align 4,,10
.L280:
movl %ecx, %r11d
movsbl (%r10,%r11), %r11d
cmpb (%r8), %r11b
jne .L327
leal (%rdx,%rdi), %esi
cmpl %esi, %r9d
jne .L281
cmpl %r11d, 32(%rsp)
setne %r11b
movzbl %r11b, %r11d
addl %r11d, %edi
.L281:
addl $1, %ecx
addq $1, %r8
subl $1, %edx
jne .L280
vzeroupper
jmp .L319
.p2align 4,,10
.L326:
movq 88(%rsp), %rax
leaq (%r14,%rsi), %r15
leaq (%rax,%r14), %rdi
vzeroupper
jmp .L277
.p2align 4,,10
.L328:
addq $1, %r15
cmpq %r15, %rdi
je .L276
.L277:
movq %rbp, %r8
movq %r13, %rdx
movq %r15, %rcx
movq %r15, %rbx
call memcmp
testl %eax, %eax
jne .L328
jmp .L319
.p2align 4,,10
.L322:
leaq -2(%rsi), %rbx
jmp .L319
.p2align 4,,10
.L327:
leal 1(%rdi), %ecx
jmp .L283
.p2align 4,,10
.L323:
leaq -3(%rsi), %rbx
jmp .L319
.p2align 4,,10
.L294:
xorl %ebx, %ebx
vzeroupper
jmp .L319
.seh_endproc
我希望其他程序员能够进一步完善它,有一个玩具可以玩,可以很快地取得成果。
关注点
AVX512工具的性能还有待观察,如果一位成员可以在支持它的CPU上运行基准测试,请这样做并在下面的评论区分享输出。
历史
- 2020年10月17日:完成初始版本