65.9K
CodeProject 正在变化。 阅读更多。
Home

最快的全文本(矢量和标量)精确搜索器?!

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.33/5 (3投票s)

2020年10月21日

公共领域

2分钟阅读

viewsIcon

20653

downloadIcon

246

一个报告精确匹配数量的全文命令行工具,速度非常快!

引言

现在是向量时代了。
现在是时候使用向量来基准测试在干草堆中找针这一基本任务了。

实现了一个简单的向量memmem()练习,很高兴与大家分享。展示了用C编写的(标量和向量,并排展示)超快速memmem()函数。

这里附带的控制台工具只是对这些函数的包装,本身就是一个框架,NyoTengu 可以很好地作为进一步开发搜索工具的蓝图(起点),具有更多功能。

背景

这个主题在书籍中随处可见,但精确搜索却没有得到优化,这是一个令人难过的现实。我不了解任何在现代机器上以10GB/s的速度运行的函数(更不用说工具了),请看下面的我的主流笔记本电脑如何执行这些共享的C练习 - 使用GCC和ICL编译。

全文精确对决 — 最快的GREP vs NyoTengu

测试机器:运行Windows 10的笔记本电脑,i5-7200U @2.5GHz,8GB DDR4 2133MHz

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "Linus Torvalds"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle1.txt                 | 11,714,194,285 B/s; 0.084 seconds |     602 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle1.txt                    | 11,441,771,162 B/s; 0.086 seconds |     602 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "Linus Torvalds" li...tar |               N.A.; 0.256 seconds |     602 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "5.8.5"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle2.txt                 |  9,553,323,495 B/s; 0.103 seconds |   74028 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle2.txt                    |  9,461,464,615 B/s; 0.104 seconds |   74028 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "5.8.5" li...tar          |               N.A.; 0.212 seconds |   74028 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: enwiki-20201001-all-titles (1,193,068,592 bytes)
Needle:   "grep"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe enwiki-20201001-all-titles grep.txt         | 12,051,197,898 B/s; 0.099 seconds |     173 |
| NyoTengu_YMM_GCC730_64bit.exe enwiki-20201001-all-titles grep.txt            | 11,583,190,213 B/s; 0.103 seconds |     173 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "grep" enwiki-2...titles  |               N.A.; 0.378 seconds |     173 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: gcc-7.3.0.tar (615,567,360 bytes)
Needle:   "SIMD"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe gcc-7.3.0.tar SIMD.txt                      | 11,837,833,846 B/s; 0.052 seconds |    3606 |
| NyoTengu_YMM_GCC730_64bit.exe gcc-7.3.0.tar SIMD.txt                         | 11,192,133,818 B/s; 0.055 seconds |    3606 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "SIMD" gcc-7.3.0.tar      |               N.A.; 0.121 seconds |    3606 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "ACGT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 4.txt               |  3,508,357,471 B/s; 0.956 seconds | 1607850 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 4.txt                  |  4,651,858,173 B/s; 0.721 seconds | 1607850 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "ACGT" "SCB_DNA...-28"    |               N.A.; 5.701 seconds | 1607850 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "AACCGGTT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 8.txt               |  2,187,860,236 B/s; 1.533 seconds |    2723 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 8.txt                  |  1,968,303,839 B/s; 1.704 seconds |    2723 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "AACCGGTT" "SCB_...-28"   |               N.A.; 5.320 seconds |    2723 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

为了重现基准测试,这里是软件包(除了四个语料库)

四个语料库可以在软件包之外下载

  1. https://linuxkernel.org.cn/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz
  2. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
  3. https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-all-titles.gz
  4. https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz

注意:最快的GREP - RIPGREP (https://github.com/BurntSushi/ripgrep) 在GitHub上获得了22,000颗星,据我所知,仅次于拥有99,000颗星的Linux内核!

结论:当前的NyoTengu 比当前的RIPGREP 快2倍到5倍,这是256位世界,谁知道512位世界会怎样……

我打算用Railgun_Trolldom_64的比较器替换memcmp()实例(不使用外部代码)……

我相信,Railgun_Nyotengu_XMM_YMM_ZMM() 目前是可用的最快的memmem(),如果有人能证明我错了,我将非常高兴。 :P

Using the Code

附带的NyoTengu.c 包含3022行代码,可以按以下方式编译

在Windows上,使用ICL

icl /O3 Nyotengu.c -DXMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_XMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DYMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_YMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DZMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_ZMM_IntelV150_32bit /FAcs

在*nix上,使用GCC

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC_64bit.elf 
        -DYMMtengu -D_POSIX_ENVIRONMENT_

以下片段是由GCC 7.3.0以这种方式生成的

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.exe 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
gcc -O3 -mavx2 -fomit-frame-pointer -S NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.asm 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY

因此,整个ASM片段

    .seh_proc    Railgun_Nyotengu_XMM_YMM_ZMM
Railgun_Nyotengu_XMM_YMM_ZMM:
    pushq    %r15
    .seh_pushreg    %r15
    pushq    %r14
    .seh_pushreg    %r14
    pushq    %r13
    .seh_pushreg    %r13
    pushq    %r12
    .seh_pushreg    %r12
    pushq    %rbp
    .seh_pushreg    %rbp
    pushq    %rdi
    .seh_pushreg    %rdi
    pushq    %rsi
    .seh_pushreg    %rsi
    pushq    %rbx
    .seh_pushreg    %rbx
    subq    $120, %rsp
    .seh_stackalloc    120
    .seh_endprologue
    movl    %r8d, %r13d
    movq    %rcx, %rsi
    movq    %rdx, %r10
    leaq    (%rcx,%r13), %r15
    cmpl    %r9d, %r8d
    jb    .L293
    movl    %r9d, %ebp
    cmpl    $3, %r9d
    jbe    .L321
    movl    (%rdx), %eax
    movl    %eax, 80(%rsp)
    cmpl    $127, %r8d
    ja    .L269
    movl    %eax, %edi
    movl    %eax, %r13d
    movl    %eax, %r14d
    movzbl    %al, %esi
    sall    $24, %edi
    leal    -1(%r9), %r9d
    sall    $8, %r13d
    leaq    (%rcx,%rbp), %rdx
    sall    $16, %r14d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movzwl    %r13w, %r13d
    andl    $16711680, %r14d
    negq    %rbp
    movq    %rdi, 80(%rsp)
    movl    %eax, %r12d
    movl    %esi, 32(%rsp)
    jmp    .L275
    .p2align 4,,10
.L270:
    movl    %eax, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r13d, %r8d
    je    .L274
    movl    %eax, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r14d, %r8d
    je    .L274
    xorl    %ecx, %ecx
    andl    $-16777216, %eax
    cmpl    72(%rsp), %eax
    setne    %cl
    addq    $3, %rcx
.L274:
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jb    .L293
.L275:
    leaq    (%rdx,%rbp), %rbx
    movl    (%rbx), %eax
    cmpl    %r12d, %eax
    jne    .L270
    movq    %rdx, %r8
    movl    %r9d, %eax
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    jmp    .L271
    .p2align 4,,10
.L273:
    leal    (%rax,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L272
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L272:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %eax
    je    .L319
.L271:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    je    .L273
    leal    1(%rdi), %ecx
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jnb    .L275
    .p2align 4,,10
.L293:
    xorl    %ebx, %ebx
.L319:
    movq    %rbx, %rax
    addq    $120, %rsp
    popq    %rbx
    popq    %rsi
    popq    %rdi
    popq    %rbp
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    ret
    .p2align 4,,10
.L321:
    leal    -1(%r9), %eax
    movsbl    (%rdx), %ebx
    addq    %rbp, %rsi
    movsbl    (%rdx,%rax), %eax
    sall    $8, %ebx
    addl    %eax, %ebx
    movzbl    %bh, %edi
    cmpl    $3, %r9d
    je    .L265
    .p2align 4,,10
.L261:
    movsbl    -2(%rsi), %eax
    movsbl    -1(%rsi), %ecx
    sall    $8, %eax
    addl    %ecx, %eax
    cmpl    %eax, %ebx
    je    .L322
    leaq    1(%rsi), %rax
    cmpb    %dil, %cl
    je    .L267
    addq    $2, %rsi
    cmpq    %rsi, %r15
    jnb    .L261
    jmp    .L293
    .p2align 4,,10
.L324:
    cmpb    %cl, 1(%r10)
    je    .L323
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    je    .L286
.L325:
    movq    %rsi, %rax
    xorl    %esi, %esi
    cmpb    %dil, %dl
    setne    %sil
    leaq    2(%rsi,%rax), %rsi
.L263:
    cmpq    %rsi, %r15
    jb    .L293
.L265:
    movsbl    -3(%rsi), %eax
    movsbl    -1(%rsi), %r8d
    movzbl    -2(%rsi), %ecx
    sall    $8, %eax
    movl    %r8d, %edx
    addl    %r8d, %eax
    cmpl    %eax, %ebx
    je    .L324
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    jne    .L325
.L286:
    movq    %rax, %rsi
    jmp    .L263
    .p2align 4,,10
.L267:
    cmpq    %rax, %r15
    jb    .L293
    movq    %rax, %rsi
    jmp    .L261
    .p2align 4,,10
.L269:
    shrl    $5, %r8d
    leaq    32(%rcx), %rbx
    vpbroadcastd    80(%rsp), %ymm4
    movq    %r15, 104(%rsp)
    leal    -1(%r8), %r14d
    movq    %rbx, 88(%rsp)
    leaq    4096(%rcx), %rax
    movl    %r9d, 216(%rsp)
    salq    $5, %r14
    movq    %rax, %r12
    movq    %r14, 72(%rsp)
    xorl    %r14d, %r14d
    movq    %r13, 96(%rsp)
    movq    %rdx, %r13
    vmovdqu    %ymm4, 32(%rsp)
.L278:
    vmovdqu    32(%rsp), %ymm3
    prefetcht0    (%r12)
    vpcmpeqd    1(%rsi,%r14), %ymm3, %ymm0
    vpcmpeqd    (%rsi,%r14), %ymm3, %ymm1
    vpcmpeqd    3(%rsi,%r14), %ymm3, %ymm2
    vpor    %ymm0, %ymm1, %ymm1
    vpcmpeqd    2(%rsi,%r14), %ymm3, %ymm0
    vpor    %ymm0, %ymm2, %ymm0
    vpor    %ymm0, %ymm1, %ymm0
    vmovmskps    %ymm0, %eax
    testl    %eax, %eax
    jne    .L326
.L276:
    addq    $32, %r14
    cmpq    72(%rsp), %r14
    jb    .L278
    movq    %r13, %r10
    movq    96(%rsp), %r13
    movq    104(%rsp), %r15
    movl    216(%rsp), %r12d
    subq    %r14, %r13
    cmpq    %r13, %rbp
    ja    .L294
    movl    80(%rsp), %ebx
    leal    -1(%r12), %r9d
    leaq    (%r14,%rbp), %rax
    negq    %rbp
    addq    %rsi, %rax
    movzbl    %bl, %edi
    movl    %ebx, %edx
    movl    %ebx, %r13d
    movl    %ebx, %r12d
    movl    %edi, 32(%rsp)
    sall    $8, %edx
    movl    %ebx, %edi
    sall    $16, %r13d
    sall    $24, %edi
    movzwl    %dx, %r14d
    andl    $16711680, %r13d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movq    %rdi, 80(%rsp)
    jmp    .L284
    .p2align 4,,10
.L279:
    movl    %edx, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r14d, %r8d
    je    .L283
    movl    %edx, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r13d, %r8d
    je    .L283
    xorl    %ecx, %ecx
    andl    $-16777216, %edx
    cmpl    72(%rsp), %edx
    setne    %cl
    addq    $3, %rcx
.L283:
    addq    %rcx, %rax
    cmpq    %rax, %r15
    jb    .L294
.L284:
    leaq    (%rax,%rbp), %rbx
    movl    (%rbx), %edx
    cmpl    %r12d, %edx
    jne    .L279
    movq    %rax, %r8
    movl    %r9d, %edx
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    .p2align 4,,10
.L280:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    jne    .L327
    leal    (%rdx,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L281
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L281:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %edx
    jne    .L280
    vzeroupper
    jmp    .L319
    .p2align 4,,10
.L326:
    movq    88(%rsp), %rax
    leaq    (%r14,%rsi), %r15
    leaq    (%rax,%r14), %rdi
    vzeroupper
    jmp    .L277
    .p2align 4,,10
.L328:
    addq    $1, %r15
    cmpq    %r15, %rdi
    je    .L276
.L277:
    movq    %rbp, %r8
    movq    %r13, %rdx
    movq    %r15, %rcx
    movq    %r15, %rbx
    call    memcmp
    testl    %eax, %eax
    jne    .L328
    jmp    .L319
    .p2align 4,,10
.L322:
    leaq    -2(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L327:
    leal    1(%rdi), %ecx
    jmp    .L283
    .p2align 4,,10
.L323:
    leaq    -3(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L294:
    xorl    %ebx, %ebx
    vzeroupper
    jmp    .L319
    .seh_endproc

我希望其他程序员能够进一步完善它,有一个玩具可以玩,可以很快地取得成果。

关注点

AVX512工具的性能还有待观察,如果一位成员可以在支持它的CPU上运行基准测试,请这样做并在下面的评论区分享输出。

历史

  • 2020年10月17日:完成初始版本
© . All rights reserved.