65.9K
CodeProject 正在变化。 阅读更多。
Home

使用 io_uring 实现的最快的 b3sum,比 cat 到 /dev/null 还快

starIconstarIconstarIconstarIconstarIcon

5.00/5 (1投票)

2023年7月27日

MIT

79分钟阅读

viewsIcon

3476

基于 io_uring 的 b3sum 实现的源代码和解释

引言

我编写了两个版本的程序,用于计算文件的 BLAKE3 哈希值(源代码在此),一个是 C 语言的单线程版本,另一个是 C++ 的多线程版本,两者都使用了 io_uring。单线程版本比我系统上的官方 Rust b3sum 快约 25%,并且**比我系统上的 cat 到 /dev/null 略快**,同时也比我用来测量系统顺序读取速度的 **fio** 略快(参见**补充说明**中我传递给 **fio** 的参数和其他细节)。单线程版本能够在 2.899 秒内哈希一个 10GiB 的文件,相当于约 3533MiB/s,**这与我 NVME 驱动器宣传的读取速度(“3500MB/s”)大致相同**,这让我相信它是我系统中速度最快的文件哈希程序,不可能更快,因为文件哈希程序在完成从磁盘读取最后一个块之前,不可能完成文件的哈希。我的多线程实现比我的单线程实现慢约 1%。

我提供了非正式证明,证明我的程序两个版本都是正确的(即总是返回正确的结果)并且永远不会卡住(即最终会完成)。非正式证明以及对代码的解释都在本文的**代码解释**部分提供。

我的程序**极其简单、通用**,并且**可以轻松修改**以满足您的需求,只要您的需求是**一次读取文件中的几个块,并在等待下一个块到达时处理这些块**——我认为这是一个非常常见的用例,所以我确保它**非常容易**适应您的需求——如果您想对文件的每个块做不同的事情,您只需要将对哈希函数的调用替换为您自己的数据处理函数的调用。

基准测试

对于这些测试,我使用了相同的 1 GiB(或 10 GiB)输入文件,并在每次测试前都刷新了页面缓存,从而确保程序始终从磁盘读取。每个命令都运行了 10 次,我使用了 time 命令的“real”结果来计算统计数据。我在 Debian 12 系统上(uname -r 返回“6.1.0-9-amd64”)使用 ext4 文件系统,没有磁盘加密,也没有 LVM,运行了这些命令。

命令 最小值 中位数 最大值
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 1 0.404 秒 0.4105 秒 0.416 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 2 0.474 秒 0.4755 秒 0.481 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 3 0.44 秒 0.4415 秒 0.451 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 4 0.443 秒 0.4475 秒 0.452 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 5 0.454 秒 0.4585 秒 0.462 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 6 0.456 秒 0.4605 秒 0.463 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 7 0.461 秒 0.4635 秒 0.468 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 8 0.461 秒 0.464 秒 0.47 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --no-mmap 0.381 秒 0.386 秒 0.394 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./b3sum_linux 1GB.txt --no-mmap 0.379 秒 0.39 秒 0.404 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt | ./example 0.364 秒 0.3745 秒 0.381 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt > /dev/null 0.302 秒 0.302 秒 0.303 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example 0.338 秒 0.341 秒 0.348 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K of=/dev/null 0.303 秒 0.306 秒 0.308 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example 0.538 秒 0.5415 秒 0.544 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M of=/dev/null 0.302 秒 0.303 秒 0.304 秒
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting 0.302 秒 0.3025 秒 0.303 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 512 2 1 0 2 0 0 0.301 秒 0.301 秒 0.302 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 512 2 1 0 2 0 0 0.303 秒 0.304 秒 0.305 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 128 20 0 0 8 0 0 0.375 秒 0.378 秒 0.384 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 128 20 0 0 8 0 0 0.304 秒 0.305 秒 0.307 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time xxhsum 1GB.txt 0.318 秒 0.3205 秒 0.325 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null 2.903 秒 2.904 秒 2.908 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 4 1 0 4 0 0 2.898 秒 2.899 秒 2.903 秒

上表中的 liburing_b3sum_singlethreadliburing_b3sum_multithread 是我自己的基于 io_uring 的 b3sum 实现(更多细节见下文),我已验证**我的 b3sum 实现总是产生与官方 b3sum 实现相同的 BLAKE3 哈希输出**。1GB.txt 文件是使用以下命令生成的

$ dd if=/dev/urandom of=1GB.txt bs=1G count=1

我使用此命令安装了 b3sum

$ cargo install b3sum
$ b3sum --version 
b3sum 1.4.1

我从 BLAKE3 Github Releases 页面下载了 b3sum_linux 程序(这是最新的 Linux 二进制文件)

$ ./b3sum_linux --version 
b3sum 1.4.1

我按照 BLAKE3 C 存储库中的说明,从 BLAKE3 C 存储库的 example.c 文件中编译了示例程序

$ gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c \
 blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S \
 blake3_avx512_x86-64_unix.S

我使用此命令安装了 xxhsum

$ apt install xxhash
$ xxhsum --version  
xxhsum 0.8.1 by Yann Collet  
compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0

简而言之

从上面的结果可以看出,我的单线程实现比我的多线程实现略快。

如果单线程版本更快,那为什么还要提多线程版本呢?因为,如上表所示,单线程版本需要 O_DIRECT 才能快速运行(控制是否使用 O_DIRECT 的标志是命令行参数中文件名后的第三个数字)。**多线程版本即使没有 O_DIRECT 也很快速**(如表中所示,多线程版本使用 O_DIRECT 哈希一个 1GiB 文件需要 0.304s,不使用 O_DIRECT 则需要 0.305s)。为什么有人会关心程序是否使用 O_DIRECT?因为 O_DIRECT 版本不会将文件放入页面缓存(它也不会将文件从页面缓存中逐出——它只是绕过页面缓存)。这对于您只想尽可能快地处理(例如哈希)大量大文件,并且每个文件只访问一次的用例很有用,例如您想生成系统中每个文件的哈希列表。但对于您只有一个文件,想用此程序哈希它,然后立即用另一个程序(例如 xxhsum)处理的情况则不太好。这就是我提到我的多线程版本的原因——它即使不使用 O_DIRECT 也很快,这与单线程版本不同。

一个主要的警告是,我的程序目前使用的是 BLAKE3 库的 C 版本。BLAKE3 有两个版本——Rust 版本和 C 版本,只有 Rust 版本支持多线程——C 版本目前不支持多线程。

**与 Rust 实现不同,C 实现目前不支持多线程。**该库的未来版本可以通过可选地依赖 OpenMP 或类似技术来添加支持。或者,我们可以公开一个低级 API,允许调用者自己实现并发。前者会更方便、更不容易出错,而后者将为调用者提供最大的控制权。最佳选择取决于具体用例,因此如果您有 C 语言多线程哈希的需求,请提交一个 GitHub issue 告诉我们。

当您的 IO 速度快于您的单核哈希速度时,多线程可能会有益。例如,如果您的 NVME 速度很快,假设您的 NVME 可以读取 7GB/s,而您的 CPU 单核只能以大约 4GB/s 的速度计算 BLAKE3 哈希值,那么您可能会受益于多线程,这使您能够让多个 CPU 核心同时哈希文件。因此,最显而易见的下一步是为 Rust b3sum 实现添加 io_uring 支持,以便 Linux 系统可以从更快的 IO 中受益。另一种选择是为 C BLAKE3 实现添加多线程支持。

(实际上,说了这么多,我不确定多线程是否总是更快。在大多数存储设备上,顺序读取比随机读取快得多,而且当前的 Rust 多线程实现似乎对整个文件进行读取,而不是顺序读取文件。所以我认为,在某些情况下,进行随机读取可能会极大地减慢 IO 速度,使其实际上比仅顺序读取文件并在单核上进行哈希还要慢。但这显然取决于您拥有的硬件。理想情况下是多核哈希与顺序读取,但我不知道 BLAKE3 算法是否允许这样做。如果不行,我想您只能分别哈希文件的每个块,得到一个哈希列表,然后您可以存储该列表,或者如果您想节省空间,可以存储该列表的哈希值。这实际上是我最初的计划,直到我发现单核 BLAKE3 的速度比我系统上的文件读取速度还快。)

所以,我非常确定我的程序运行在我系统理论速度的极限上,也就是说,它是我系统中实际可能的最快的文件哈希程序,**没有进一步的优化空间了**——这个事实比其他任何事情都让我高兴。我写这篇文章是因为我想写一篇解释我的程序,以及我是如何将我的程序发展到这个阶段的文章,而且我认为令人惊讶的是,(据我所知)还没有其他人使用 io_uring 来开发文件哈希程序——它简直是 io_uring 的完美用例,是我能想到的最容易实现的优化。

另一个需要注意的地方是,io_uring 仅限于 Linux,因此该程序无法在其他操作系统上运行。

代码解释

我将首先解释单线程版本,然后解释多线程版本。

所以,这是单线程版本

/* SPDX-License-Identifier: MIT */
/*
 * Compile with: gcc -Wall -O3 -D_GNU_SOURCE liburing_b3sum_singlethread.c -luring libblake3.a -o liburing_b3sum_singlethread
 * For an explanation of how this code works, see my article/blog post: https://1f604.com/b3sum
 *
 * This program is a modified version of the liburing cp program from Shuveb Hussain's io_uring tutorial.
 * Original source code here: https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
 * The modifications were made by 1f604.
 *
 * The official io_uring documentation can be seen here:
 *    - https://kernel.dk/io_uring.pdf
 *    - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
 *
 * Acronyms: SQ = submission queue, SQE = submission queue entry, CQ = completion queue, CQE = completion queue event
 */
#include "blake3.h"
#include "liburing.h"

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>

/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT requires aligned memory

/* ===============================================
 * ========== Start of global variables ==========
 * ===============================================
 * Declared static because they are only supposed to be visible within this .c file.
 *
 * --- Command line options ---
 * The following variables are set by the user from the command line.
 */
static int g_queuedepth;              /* This variable is not really the queue depth, but is more accurately described as
                                       * "the limit on the number of incomplete requests". Allow me to explain.
                                       * io_uring allows you to have more requests in-flight than the size of your submission
                                       * (and completion) queues. How is this possible? Well, when you call io_uring_submit,
                                       * normally it will submit ALL of the requests in the submission queue, which means that
                                       * when the call returns, your submission queue is now empty, even though the requests
                                       * are still "in-flight" and haven't been completed yet!
                                       * In earlier kernels, you could overflow the completion queue because of this.
                                       * Thus it says in the official io_uring documentation (https://kernel.dk/io_uring.pdf):
                                       *     Since the sqe lifetime is only that of the actual submission of it, it's possible
                                       *     for the application to drive a higher pending request count than the SQ ring size
                                       *     would indicate. The application must take care not to do so, or it could risk
                                       *     overflowing the CQ ring.
                                       * That is to say, the official documentation recommended that applications should ensure
                                       * that the number of in-flight requests does not exceed the size of the submission queue.
                                       * This g_queuedepth variable is therefore a limit on the total number of "incomplete"
                                       * requests, which is the number of requests on the submission queue plus the number of
                                       * requests that are still "in flight".
                                       * See num_unfinished_requests for details on how this is implemented. */
static int g_use_o_direct;            // whether to use O_DIRECT
static int g_process_in_inner_loop;   // whether to process the data inside the inner loop

static int g_use_iosqe_io_drain;      // whether to issue requests with the IOSQE_IO_DRAIN flag
static int g_use_iosqe_io_link;       // whether to issue requests with the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable IORING_SETUP_IOPOLL, on my current system,
					// it turns my process into an unkillable zombie that uses 100% CPU that never terminates.
                                        // when I was using encrypted LVM, it just gave me error: Operation not supported.
                                        // I decided to not allow users to enable that option because I didn't want them
                                        // to accidentally launch an unkillable never-ending zombie process that uses 100% CPU.
                                        // I observed this behavior in fio too when I enabled --hipri on fio, it also turned into
                                        // an unkillable never-ending zombie process that uses 100% CPU.

static size_t g_blocksize;            // This is the size of each buffer in the ringbuf, in bytes.
                                      // It is also the size of each read from the file.
static size_t g_numbufs;              // This is the number of buffers in the ringbuf.

/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
static int g_filedescriptor;          // This is the file descriptor of the file we're hashing.
static size_t g_filesize;             // This will be filled in by the function that gets the file size.
static size_t g_num_blocks_in_file;   // The number of blocks in the file, where each block is g_blocksize bytes.
                                      // This will be calculated by a ceiling division of filesize by blocksize.
static size_t g_size_of_last_block;   // The size of the last block in the file. See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0;         // Position of the "producer head". see explanation in my article/blog post
static int consumer_head = 0;         // Position of the "consumer head". see explanation in my article/blog post

enum ItemState {                      // describes the state of a cell in the ring buffer array ringbuf, see my article/blog post for detailed explanation
    AVAILABLE_FOR_CONSUMPTION,        // consumption and processing in the context of this program refers to hashing
    ALREADY_CONSUMED,
    REQUESTED_BUT_NOT_YET_COMPLETED,
};

struct my_custom_data {               // This is the user_data associated with read requests, which are placed on the submission ring.
                                      // In applications using io_uring, the user_data struct is generally used to identify which request a
                                      // completion is for. In the context of this program, this structure is used both to identify which
                                      // block of the file the read syscall had just read, as well as for the producer and consumer to
                                      // communicate with each other, since it holds the cell state.
                                      // This can be thought of as a "cell" in the ring buffer, since it holds the state of the cell as well
                                      // as a pointer to the data (i.e. a block read from the file) that is "in" the cell.
                                      // Note that according to the official io_uring documentation, the user_data struct only needs to be
                                      // valid until the submit is done, not until completion. Basically, when you submit, the kernel makes
                                      // a copy of user_data and returns it to you with the CQE (completion queue entry).
    unsigned char* buf_addr;          // Pointer to the buffer where the read syscall is to place the bytes from the file into.
    size_t nbytes_expected;           // The number of bytes we expect the read syscall to return. This can be smaller than the size of the buffer
                                      // because the last block of the file can be smaller than the other blocks.
    //size_t nbytes_to_request;       // The number of bytes to request. This is always g_blocksize. I made this decision because O_DIRECT requires
                                      // nbytes to be a multiple of filesystem block size, and it's simpler to always just request g_blocksize.
    off_t offset_of_block_in_file;    // The offset of the block in the file that we want the read syscall to place into the memory location
                                      // pointed to by buf_addr.
    enum ItemState state;             // Describes whether the item is available to be hashed, already hashed, or requested but not yet available for hashing.
    int ringbuf_index;                // The slot in g_ringbuf where this "cell" belongs.
                                      // I added this because once we submit a request on submission queue, we lose track of it.
                                      // When we get back a completion, we need an identifier to know which request the completion is for.
                                      // Alternatively, we could use something to map the buf_addr to the ringbuf_index, but this is just simpler.
};

struct my_custom_data* g_ringbuf;     // This is a pointer to an array of my_custom_data structs. These my_custom_data structs can be thought of as the
                                      // "cells" in the ring buffer (each struct contains the cell state), thus the array that this points to can be
                                      // thought of as the "ring buffer" referred to in my article/blog post, so read that to understand how this is used.
                                      // See the allocate_ringbuf function for details on how and where the memory for the ring buffer is allocated.

/* ===============================================
 * =========== End of global variables ===========
 * ===============================================*/

static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
    int rc;
    int flags = 0;
    rc = io_uring_queue_init(entries, ring, flags);
    if (rc < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-rc));
        return -1;
    }
    return 0;
}

static int get_file_size(int fd, size_t *size)
{
    struct stat st;
    if (fstat(fd, &st) < 0)
        return -1;
    if (S_ISREG(st.st_mode)) {
        *size = st.st_size;
        return 0;
    } else if (S_ISBLK(st.st_mode)) {
        unsigned long long bytes;
        if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
            return -1;
        *size = bytes;
        return 0;
    }
    return -1;
}

static void add_read_request_to_submission_queue(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
    assert(fileoffset_to_request % g_blocksize == 0);
    int block_number = fileoffset_to_request / g_blocksize; // the number of the block in the file
    /*  We do a modulo to map the file block number to the index in the ringbuf
        e.g. if ring buf_addr has 4 slots, then
            file block 0 -> ringbuf index 0
            file block 1 -> ringbuf index 1
            file block 2 -> ringbuf index 2
            file block 3 -> ringbuf index 3
            file block 4 -> ringbuf index 0
            file block 5 -> ringbuf index 1
            file block 6 -> ringbuf index 2
        And so on.
    */
    int ringbuf_idx = block_number % g_numbufs;
    struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
    assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of a my_custom_data struct should never change.

    my_data->offset_of_block_in_file = fileoffset_to_request;
    assert (my_data->buf_addr); // We don't need to change my_data->buf_addr since we set it to point into the backing buffer at the start of the program.
    my_data->nbytes_expected = expected_return_size;
    my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
                                                       *     1. The producer is about to send it off in a request.
                                                       *     2. The consumer shouldn't be trying to read this buffer at this point.
                                                       * So it is okay to set the state to this here.
                                                       */

    struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
    if (!sqe) {
        puts("ERROR: FAILED TO GET SQE");
        exit(1);
    }

    io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr, g_blocksize, fileoffset_to_request);
    // io_uring_prep_read sets sqe->flags to 0, so we need to set the flags AFTER calling it.

    if (g_use_iosqe_io_drain)
        sqe->flags |= IOSQE_IO_DRAIN;
    if (g_use_iosqe_io_link)
        sqe->flags |= IOSQE_IO_LINK;

    io_uring_sqe_set_data(sqe, my_data);
}

static void increment_buffer_index(int* head) // moves the producer or consumer head forward by one position
{
    *head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}


static void resume_consumer() // Conceptually, this resumes the consumer "thread".
{                             // As this program is single-threaded, we can think of it as cooperative multitasking.
    while (g_ringbuf[consumer_head].state == AVAILABLE_FOR_CONSUMPTION){
        // Consume the item.
        // The producer has already checked that nbytes_expected is the same as the amount of bytes actually returned.
        // If the read syscall returned something different to nbytes_expected then the program would have terminated with an error message.
        // Therefore it is okay to assume here that nbytes_expected is the same as the amount of actual data in the buffer.
        blake3_hasher_update(&g_hasher, g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);

        // We have finished consuming the item, so mark it as consumed and move the consumer head to point to the next cell in the ringbuffer.
        g_ringbuf[consumer_head].state = ALREADY_CONSUMED;
        increment_buffer_index(&consumer_head);
    }
}


static void producer_thread()
{
    int rc;
    unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
    unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
    unsigned long num_unfinished_requests = 0;
    /* A brief note on how the num_unfinished_requests variable is used:
     * As mentioned earlier, in io_uring it is possible to have more requests in-flight than the
     * size of the completion ring. In earlier kernels this could cause the completion queue to overflow.
     * In later kernels there was an option added (IORING_FEAT_NODROP) which, when enabled, means that
     * if a completion event occurs and the completion queue is full, then the kernel will internally
     * store the event until the completion queue has room for more entries.
     * Therefore, on newer kernels, it isn't necessary, strictly speaking, for the application to limit
     * the number of in-flight requests. But, since it is almost certainly the case that an application
     * can submit requests at a faster rate than the system is capable of servicing them, if we don't
     * have some backpressure mechanism, then the application will just keep on submitting more and more
     * requests, which will eventually lead to the system running out of memory.
     * Setting a hard limit on the total number of in-flight requests serves as a backpressure mechanism
     * to prevent the number of requests buffered in the kernel from growing without bound.
     * The implementation is very simple: we increment num_unfinished_requests whenever a request is placed
     * onto the submission queue, and decrement it whenever an entry is removed from the completion queue.
     * Once num_unfinished_requests hits the limit that we set, then we cannot issue any more requests
     * until we receive more completions, therefore the number of new completions that we receive is exactly
     * equal to the number of new requests that we will place, thus ensuring that the number of in-flight
     * requests can never exceed g_queuedepth.
    */

    off_t next_file_offset_to_request = 0;
    struct io_uring uring;
    if (setup_io_uring_context(g_queuedepth, &uring)) {
        puts("FAILED TO SET UP CONTEXT");
        exit(1);
    }
    struct io_uring* ring = &uring;

    while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
                                         // Step 1. Submit read requests.
                                         // Step 2. Retrieve completions.
                                         // Step 3. Run consumer.
        /*
         * Step 1: Make zero or more read requests until g_queuedepth limit is reached, or until the producer head reaches
         *         a cell that is not in the ALREADY_CONSUMED state (see my article/blog post for explanation).
         */
        unsigned long num_unfinished_requests_prev = num_unfinished_requests; // The only purpose of this variable is to keep track of whether
                                                                              // or not we added any new requests to the submission queue.
        while (num_blocks_left_to_request) {
            if (num_unfinished_requests >= g_queuedepth)
                break;
            if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
                break;

            /* expected_return_size is the number of bytes that we expect this read request to return.
             * expected_return_size will be the block size until the last block of the file.
             * when we get to the last block of the file, expected_return_size will be the size of the last
             * block of the file, which is calculated in calculate_numblocks_and_size_of_last_block
             */
            size_t expected_return_size = g_blocksize;
            if (num_blocks_left_to_request == 1) // if we're at the last block of the file
                expected_return_size = g_size_of_last_block;

            add_read_request_to_submission_queue(ring, expected_return_size, next_file_offset_to_request);
            next_file_offset_to_request += expected_return_size;
            ++num_unfinished_requests;
            --num_blocks_left_to_request;
            // We have added a request for the read syscall to write into the cell in the ringbuffer.
            // The add_read_request_to_submission_queue has already marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
            // so now we just need to move the producer head to point to the next cell in the ringbuffer.
            increment_buffer_index(&producer_head);
        }

        // If we added any read requests to the submission queue, then submit them.
        if (num_unfinished_requests_prev != num_unfinished_requests) {
            rc = io_uring_submit(ring);
            if (rc < 0) {
                fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
                exit(1);
            }
        }

        /*
         * Step 2: Remove all the items from the completion queue.
         */
        bool first_iteration = 1;                   // On the first iteration of the loop, we wait for at least one cqe to be available,
                                                    // then remove one cqe.
                                                    // On each subsequent iteration, we try to remove one cqe without waiting.
                                                    // The loop terminates only when there are no more items left in the completion queue,
        while (num_blocks_left_to_receive) {        // Or when we've read in all of the blocks of the file.
            struct io_uring_cqe *cqe;
            if (first_iteration) {                  // On the first iteration we always wait until at least one cqe is available
                rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed and give us one cqe.
                first_iteration = 0;
            } else {
                rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are no more items left in the completion queue.
                if (rc == -EAGAIN) { // A return code of -EAGAIN means that there are no more items left in the completion queue.
                    break;
                }
            }
            if (rc < 0) {
                fprintf(stderr, "io_uring_peek_cqe: %s\n",
                            strerror(-rc));
                exit(1);
            }
            assert(cqe);

            // At this point we have a cqe, so let's see what our syscall returned.
            struct my_custom_data *data = (struct my_custom_data*) io_uring_cqe_get_data(cqe);

            // Check if the read syscall returned an error
            if (cqe->res < 0) {
                // we're not handling EAGAIN because it should never happen.
                fprintf(stderr, "cqe failed: %s\n",
                        strerror(-cqe->res));
                exit(1);
            }
            // Check if the read syscall returned an unexpected number of bytes.
            if ((size_t)cqe->res != data->nbytes_expected) {
                // We're not handling short reads because they should never happen on a disk-based filesystem.
                if ((size_t)cqe->res < data->nbytes_expected) {
                    puts("panic: short read");
                } else {
                    puts("panic: read returned more data than expected (wtf). Is the file changing while you're reading it??");
                }
                exit(1);
            }

            assert(data->offset_of_block_in_file % g_blocksize == 0);

            // If we reach this point, then it means that there were no errors: the read syscall returned exactly what we expected.
            // Since the read syscall returned, this means it has finished filling in the cell with ONE block of data from the file.
            // This means that the cell can now be read by the consumer, so we need to update the cell state.
            g_ringbuf[data->ringbuf_index].state = AVAILABLE_FOR_CONSUMPTION;
            --num_blocks_left_to_receive; // We received ONE block of data from the file
            io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed, so that its slot can get reused
            --num_unfinished_requests;
            if (g_process_in_inner_loop)
                resume_consumer();
        }
        /* Step 3: Run consumer. This might be thought of as handing over control to the consumer "thread". See my article/blog post. */
        resume_consumer();
    }
    resume_consumer();

    close(g_filedescriptor);
    io_uring_queue_exit(ring);

    // Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
    uint8_t output[BLAKE3_OUT_LEN];
    blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);

    // Print the hash as hexadecimal.
    printf("BLAKE3 hash: ");
    for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
        printf("%02x", output[i]);
    }
    printf("\n");
}

static void process_cmd_line_args(int argc, char* argv[])
{
    if (argc != 9) {
        printf("%s: infile g_blocksize g_queuedepth g_use_o_direct g_process_in_inner_loop g_numbufs g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
        exit(1);
    }
    g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
    g_queuedepth = atoi(argv[3]);
    if (g_queuedepth > 32768)
        puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
    g_use_o_direct = atoi(argv[4]);
    g_process_in_inner_loop = atoi(argv[5]);
    g_numbufs = atoi(argv[6]);
    g_use_iosqe_io_drain = atoi(argv[7]);
    g_use_iosqe_io_link = atoi(argv[8]);
}

static void open_and_get_size_of_file(const char* filename)
{
    if (g_use_o_direct){
        g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
        puts("opening file with O_DIRECT");
    } else {
        g_filedescriptor = open(filename, O_RDONLY);
        puts("opening file without O_DIRECT");
    }
    if (g_filedescriptor < 0) {
        perror("open file");
        exit(1);
    }
    if (get_file_size(g_filedescriptor, &g_filesize)){
        puts("Failed getting file size");
        exit(1);
    }
}

static void calculate_numblocks_and_size_of_last_block()
{
    // this is the mathematically correct way to do ceiling division
    // (assumes no integer overflow)
    g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
    // calculate the size of the last block of the file
    if (g_filesize % g_blocksize == 0)
        g_size_of_last_block = g_blocksize;
    else
        g_size_of_last_block = g_filesize % g_blocksize;
}

static void allocate_ringbuf()
{
    // We only make 2 memory allocations in this entire program and they both happen in this function.
    assert(g_blocksize % ALIGNMENT == 0);
    // First, we allocate the entire underlying ring buffer (which is a contiguous block of memory
    // containing all the actual buffers) in a single allocation.
    // This is one big piece of memory which is used to hold the actual data from the file.
    // The buf_addr field in the my_custom_data struct points to a buffer within this ring buffer.
    unsigned char* ptr_to_underlying_ring_buffer;
    // We need aligned memory, because O_DIRECT requires it.
    if (posix_memalign((void**)&ptr_to_underlying_ring_buffer, ALIGNMENT, g_blocksize * g_numbufs)) {
        puts("posix_memalign failed!");
        exit(1);
    }
    // Second, we allocate an array containing all of the my_custom_data structs.
    // This is not an array of pointers, but an array holding all of the actual structs.
    // All the items are the same size, which makes this easy
    g_ringbuf = (struct my_custom_data*) malloc(g_numbufs * sizeof(struct my_custom_data));

    // We partially initialize all of the my_custom_data structs here.
    // (The other fields are initialized in the add_read_request_to_submission_queue function)
    off_t cur_offset = 0;
    for (int i = 0; i < g_numbufs; ++i) {
        g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer + cur_offset; // This will never change during the runtime of this program.
        cur_offset += g_blocksize;
        // g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
        g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to ALREADY_CONSUMED at the start. See my article/blog post for explanation.
        g_ringbuf[i].ringbuf_index = i; // This will never change during the runtime of this program.
    }
}

int main(int argc, char *argv[])
{
    assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
    process_cmd_line_args(argc, argv); // we need to first parse the command line arguments
    open_and_get_size_of_file(argv[1]);
    calculate_numblocks_and_size_of_last_block();
    allocate_ringbuf();

    // Initialize the hasher.
    blake3_hasher_init(&g_hasher);

    // Run the main loop
    producer_thread();
}

它是如何工作的?

基本上,我的程序从磁盘读取文件然后进行哈希。更详细地说,我的程序发出对文件块的读取请求,并在文件块返回后进行哈希。所以,它几乎是可以使用 io_uring 的最简单的程序。

为了真正理解我的程序,读者需要知道 io_uring 是一个异步 API,它由两个环组成,一个提交队列 (SQ) 和一个完成队列 (CQ)。它的使用方法很简单:将您的请求(称为提交队列条目 SQE)放入 SQ,然后调用 submit(),然后轮询或等待 CQ 以获取完成项,即完成队列事件 (CQE)。关于 io_uring 需要记住的重要一点是,默认情况下,提交给 io_uring 的请求可能以任何顺序完成。有两个标志可以强制 io_uring 按顺序完成请求(IOSQE_IO_DRAINIOSQE_IO_LINK),但它们会大大降低程序的性能,从而破坏了使用 io_uring 的初衷(当然,这从它们的工作方式来看是合理的——这些标志使得下一个 sqe 要等到前一个完成后才能开始,这几乎破坏了异步 API 的意义……实际上我发现 IOSQE_IO_LINK 甚至不起作用——我仍然得到乱序的完成,但它确实大大减慢了我的程序……不过那是早期版本了)——您实际上可以很容易地自己检查这一点:我的程序将这些标志作为命令行参数暴露出来。我最初尝试使用这些标志,直到我无法获得可接受的性能时,我才终于接受我必须处理乱序的完成。实际上,有一些用例不需要按顺序处理文件块(例如文件复制)。不幸的是,文件哈希不是其中之一——在哈希文件时,我们通常需要按顺序处理文件块(至少 BLAKE3 C API 没有提供按任意顺序哈希块的选项)。因此,我们需要一种方法来处理乱序的完成。

所以,我认为处理这种情况的最简单方法是分配一个足够大的数组来容纳整个文件,然后,每当收到一个完成项时,我们就将其放入数组中相应的位置。例如,如果一个文件由 100 个块组成,那么我们将分配一个包含 100 个槽的数组,当收到第 5 个块的完成项时,我们将把该完成项放入数组的第 5 个槽中。这样,文件哈希函数就可以简单地遍历这个数组,在继续之前等待数组中的下一个槽被填充上完成项。

为了达到速度极限,我还需要进行一次优化。那就是使用固定大小的环形缓冲区,而不是分配一个足够大以容纳文件中所有块的单个缓冲区。除了更快之外,这还有使用更少内存的优点——这实际上是我实现它的初衷——我没想到它会提高我的程序性能,我只是因为我的电脑在哈希大文件时内存不足而实现了它。所以我很惊喜地发现它实际上提高了我的程序性能。

现在,环形缓冲区的实现要复杂一些,所以让我们考虑一下这个问题:这个算法的正确性属性是什么?我曾这样想

消费者将仅一次且按顺序处理文件中的所有块。

编写一个仅一次且按顺序处理文件块的消费者实际上是微不足道的。棘手的部分是确保它将处理文件中的所有块。我们如何确保它仅一次且按顺序处理文件块?很简单:使用以下实现:消费者只需遍历数组并查看下一个单元格的状态和块偏移量——如果单元格的状态是“准备处理”,并且块偏移量是它想要的块,那么消费者就会处理它。消费者所需要做的就是内部跟踪它已消耗的最后一个块的偏移量,并加上块大小以获得它想要的下一个块的偏移量。

但我们还有一个必须牢记的考虑——队列深度。更确切地说,正如我在代码注释中所解释的,它本质上是正在进行的请求的数量。为什么这很重要?因为 io_uring 是一个异步 API——submit 会立即返回,所以您可以不断地放入越来越多的请求,提交它们,最终(至少在某些内核版本中)系统会耗尽内存。因此,我们需要限制任何给定时间点可以进行的请求数量,最简单的方法是使用一个计数器——当我们发出请求时,我们将计数器减一,当我们收到完成项时,我们将计数器加一。当计数器达到 0 时,我们停止发出新请求,只等待完成项。这样,我们可以确保正在进行的请求数量永远不会超过计数器的初始大小。

有了这个设计,我们不能一次发出整个文件的请求。如果文件包含,比如说,100 万个块,我们不能在程序开始时发出 100 万个读取请求然后只轮询完成项,因为我们不能允许在任何给定时间点有超过少量正在进行的请求。相反,我们必须将自己限制为一次只发送几个块的请求,并在至少一个请求完成后再发送任何新请求。

举个例子。假设我们有一个有 5 个槽的环形缓冲区,队列深度限制为 3,这意味着我们一次最多只能有 3 个正在进行的请求。在程序开始时,我们可以发出 3 个请求。那么问题是,我们要发出哪 3 个请求?答案很明显:前 3 个块(对应于前 3 个槽),我们想按块 0、块 1、块 2 的顺序发出它们。为什么?因为即使不能保证请求会按发出的顺序完成,内核至少会尝试按发出的顺序处理这些请求(毕竟这是一个环形缓冲区)。我们希望第一个块先完成,因为这将允许消费者尽快开始工作——为了效率,我们希望消费者始终在处理一些数据,所以我们希望第一个块先返回(而不是,比如说,第二个块先返回)。所以,进一步推而广之,这意味着我们希望按顺序发出请求,以便尽快解除消费者的阻塞。这意味着,如果消费者当前正在等待块 i “可用于处理”,那么生产者必须在发出任何其他块的请求之前发出块 i 的请求。向前推算,生产者也应该在发出块 i+2 的请求之前发出块 i+1 的请求,并且应该在发出块 i+3 的请求之前发出块 i+2 的请求,依此类推。因此,生产者应该按 0, 1, 2, 3, ... 的顺序发出请求,直到最后一个块。

因此,我们现在有消费者有一个本质上的“消费头”,每当它可以移动时,它就会不断地向前移动到环形缓冲区中的下一个单元格,但“生产者头”也是如此,我认为它更准确的叫法是“请求者头”。因此,我们有两个指针

  1. 一个“消费头”指针,使消费者能够线性地遍历环形缓冲区,一次一个顺序地处理单元格。
  2. 一个“请求者头”指针,使生产者能够线性地遍历环形缓冲区,一次一个顺序地发出单元格的请求。

“请求者头”指向生产者将发出下一个读取请求的单元格——然后读取 syscall 将用文件中的块填充该单元格。一旦生产者发出一个块的请求,它就应该发出下一个块的请求,正如我所解释的,除非它不能。什么时候不能?有两种情况:第一,如果正在进行的请求数量已达到限制,那么它就不能再发出任何新请求,直到我们收到完成项。第二,如果下一个单元格不可用,它就不能发出请求。重要的是要记住,块被映射到环形缓冲区中的单元格,并且发出对文件中下一个块的请求意味着它需要环形缓冲区中的下一个单元格是“空闲的”,也就是说,可以被填充。现在,如果消费者当前正在读取该单元格,那么显然它不能为该单元格发出请求。但还有其他情况它不能为该单元格发出请求,例如,如果它已经为该单元格发出了请求,但消费者尚未读取该单元格。我认为用一个例子来理解这一点更容易

基本上有两种情况。第一种是队列深度小于环形缓冲区中的槽数,第二种是队列深度大于环形缓冲区中的槽数。让我们逐一分析。请看这个图

起初,假设我们将队列深度设置为 3,环形缓冲区大小设置为 5 个槽。最初(图 1),环形缓冲区是空的,所以消费者无法移动,但生产者可以开始发出请求,于是它就这么做了。由于队列深度是 3,这意味着生产者一开始最多可以发出 3 个请求。所以现在(图 2),我们发出了 3 个请求,每个请求对应块 0、1 和 2,分别对应槽 0、1 和 2。这里我用符号“r”来表示我们已经发出读取请求但尚未返回的槽。这意味着消费者无法读取它们,因为槽中还没有数据。现在,假设槽 1 和 2 的读取请求已返回,但槽 0 的读取请求仍未返回(图 3)。这意味着即使槽 1 和 2 可供读取,消费者也无法前进,因为它仍然被阻塞,等待槽 0 可供处理。但是,由于我们收到了块 1 和 2 的完成项,这意味着我们可以发出另外 2 个请求,因此我们发出对我们尚未发出请求的下一个 2 个块的请求,即槽 3 和 4。现在(图 4),假设槽 4 的读取调用已返回,但槽 0 仍未返回。但现在,正在进行的请求数量只有 2 个,这意味着队列深度没有阻止生产者发出更多读取请求。然而,消费者仍然被阻塞,等待槽 0 可供消费。此时,生产者是否可以覆盖它?现在,生产者想要发出块 5 的读取请求,写入槽 0,但消费者还没有看到块 0,所以生产者显然不应该在消费者处理完它之前发出槽 0 的请求。由此我们可以推断出一个通用原则。之前我们将每个槽映射到一个块列表,例如,如果环形缓冲区有 5 个槽,那么我们将槽 0 映射到块 0、块 5、块 10、块 15 等等。我希望图表显示,我们需要确保在消费者完成块 0 之前不请求块 5,因为它们都映射到同一个槽,这意味着如果我们先请求块 5 而不是块 0,我们就会覆盖存储块 0 的槽。更一般地说,通过确保我们仅在消费者完成每个槽的“前一个”块之后才请求该槽的“下一个”块,我们可以确保我们不会覆盖正在处理或尚未处理的块。

所以,这是看待它的一个方式。另一种方式是考虑单元格状态。在图表中,我们显示了单元格可以存在的 2 种状态——“r”状态和“a”状态。现在,我们可以问:在生产者允许为该单元格发出请求之前,该单元格必须处于什么状态?嗯,如果单元格处于“r”状态,那么它意味着生产者已经为该单元格发出了请求,而消费者尚未消耗该单元格(显然,因为它还没有可供消费),这意味着如果我们覆盖该单元格,那么我们将覆盖我们已经发出请求的块。我所说的情况是,当生产者移动到下一个单元格,并发现该单元格处于“r”状态时——这只能意味着一件事,那就是生产者已经绕着环形缓冲区走了一圈,现在正在查看一个单元格,该单元格正在等待接收下一个单元格的“前一个”块,而我们已经解释过在这种情况下发出请求是错误的。实际上,我认为这种思考方式——通过单元格状态——更容易推理。那么,有多少种单元格状态?我可以想到几种

  1. 空(程序开始时)
  2. 已请求但尚未完成
  3. 读取完成且可供处理
  4. 正在处理
  5. 处理完成

但我们真的需要表示所有这些状态吗?那么,状态最终是做什么的?它们是为了确保程序的正确性,即

  1. 消费者不消耗错误的块,并且只一次性地按正确的顺序消耗正确的块
  2. 消费者不会卡住

为了确保这些正确性属性,我们只需要确保以下几点

  1. 当消费者可以前进时,它才会前进
  2. 当生产者可以前进时,它才会前进

我使用“才”这个词来传达两个要求:消费者/生产者不应该在不安全的时候前进(正确性),而且一旦安全就必须前进(效率)。既然这是单元格状态的唯一目的,我们可以将状态简化为这 4 种

  1. 对消费者可用,但对生产者不可用
  2. 对生产者可用,但对消费者不可用
  3. 对两者都可用
  4. 对两者都不可用

事实上,状态 3 不可能存在,因为我们不能让一个单元格同时被消费者和生产者访问——那将是一个数据竞争(因为生产者正在写入单元格,而消费者正在读取它),因此是未定义的行为。所以实际上只有 3 种状态

  1. 对消费者可用,但对生产者不可用(我将称之为“a”状态——“可供处理”)
  2. 对生产者可用,但对消费者不可用(我将称之为“c”状态——“已消耗”)
  3. 对两者都不可用(我将称之为“r”状态——“已请求但尚未完成”)

然后,我们可以将之前提到的状态映射到这 3 种状态中。

所以,现在我们有了环形缓冲区中单元格的完整生命周期:在程序开始时,所有单元格都是空的,因此消费者无法消耗它们,但生产者可以为它们发出请求,所以最初我们需要将数组中的每个块标记为“c”状态,以便生产者发出请求,但消费者不会尝试消耗它们。一旦生产者发出了该单元格的请求,该单元格将进入“r”状态,直到读取请求完成。为什么?因为正如前面解释的,如果生产者已经发出了一个单元格的读取请求,但读取请求尚未返回,那么生产者不应该覆盖该单元格,消费者也无法处理它,因为它还没有被填充,所以“r”状态在这里是合适的。接下来,读取请求返回,所以我们将单元格更改为“a”状态。为什么?因为在读取请求返回一个单元格后,该单元格现在可以被消费者处理,但我们不希望生产者在消费者完成之前发出对该单元格的请求,因此“a”状态在这里是合适的。当消费者正在处理单元格时,生产者不应该为它发出请求,所以单元格可以保持“a”状态。当消费者完成该单元格的处理时,现在生产者可以为它发出请求,但消费者不应该再次尝试消耗该单元格(因为它已经对文件块进行了哈希,再次将其传递给哈希更新函数将导致不正确的结果),因此我们必须将单元格转换为“c”状态。最后,一旦生产者头再次到达该单元格,循环就会重新开始。

但这个设计如何满足我们希望程序拥有的正确性属性呢?我们希望有某种证明表明消费者不会卡住。所以,假设消费者头当前指向槽 i。什么会阻止它移动到槽 i+1?唯一的事情是当前槽不是“a”状态——如果它是“a”状态,那么消费者将处理它然后移动到下一个槽。所以我们只需要证明,如果一个槽处于非“a”状态,它最终会过渡到“a”状态。那么我们如何证明这一点呢?嗯,只有两种非“a”状态:“r”和“c”。如果一个槽处于“r”状态,那么它只是在等待读取 syscall 返回,所以当该读取 syscall 返回时,状态将变为“a”,因此所有“r”槽最终都会变为“a”。如果一个槽处于“c”状态,那么我们需要某种保证生产者最终会将其转换为“r”状态。这实际上也很容易证明——生产者会不断地遍历数组,将所有“c”转换为“r”,只有在正在进行的请求数量达到限制时,或者当生产者遇到一个不是“c”状态的单元格时才会停止(请注意,我在这里谈论的是概念上的高级抽象设计,而不是任何具体的实现)。我们只需要证明的是,生产者永远不会因为一个非“c”单元格而无法到达一个“c”单元格。嗯,通过图表可能更容易看出

上图显示了一个大小为 5 的环形缓冲区,队列深度为 3(图表显示了当队列深度小于环形缓冲区大小时发生的情况——它只意味着请求者循环不会发出更多请求,直到更多完成项到达。正如您所见,随着生产者头一次遍历环形缓冲区中的一个槽,每次请求文件的下一个块,消费者就会跟随生产者头,因此每次消费者处理一个新槽时,那就是文件的下一个块。如果队列深度大于或等于环形缓冲区的大小,情况就更简单了:请求者循环只会发出对数组中处于“c”状态的每个单元格的请求)。开始时,消费者和生产者头都指向第一个槽,即槽 0。现在,此时,消费者头被卡住,无法移动,但生产者头可以移动,所以生产者头先移动——它提交请求,生产者头移动 3 个槽,直到达到队列深度,然后我们检查完成项。现在这里会发生不同的事情。左侧是如果 3 个请求都在那时都返回会发生什么。在这种情况下,所有处于“r”状态的槽现在都处于“a”状态,因此当我们运行消费者时,消费者将愉快地遍历所有这些“a”槽,在处理它们的同时处理它们,将它们变成“c”槽,当它完成处理后,最终它将到达仍然处于“c”状态的槽 3,所以现在消费者停止并将控制权交还给生产者。在另一个宇宙中,如果槽 0 仍然处于“r”状态,那么消费者将继续等待槽 0,因此将控制权交还给生产者,生产者将继续发出对槽 3 和 4 的请求,因为正在进行的请求数量已低于队列深度。然后我们请求完成项,并看到槽 4 已返回,因此我们将其转换为“a”,我们再次尝试运行消费者,但它仍在等待槽 0,然后我们回到生产者,它也在等待槽 0。

所以,我想证明生产者永远不会被永久阻塞,因为,正如图表所示,当生产者在环形缓冲区中移动时,它会留下一个“r”状态的单元格序列,这些单元格最终会过渡到“a”状态。同样,当消费者在环形缓冲区中移动时,它会留下一个“c”状态的单元格序列。因此,我们保证“c”单元格和“a”/“r”单元格不会“混合”——所有“在消费者头之后但生产者头之前”的单元格都是“c”,而所有“在消费者头之后但生产者头之前”的单元格都是“r”或“a”。因此,正如图表所示,当生产者头到达一个处于“r”或“a”状态的槽时,这意味着它已经赶上了消费者头,因为如果它在消费者头之后,它就会看到“c”状态的单元格。因此,这意味着它与消费者头指向的单元格相同。如果单元格处于“a”状态,消费者将处理它并将其转换为“c”状态;如果它处于“r”状态,它最终会变为“a”状态,然后消费者将处理它并将其转换为“c”状态。

以上推理适用于生产者和消费者在不同线程上的通用多线程实现。但我们的单线程环形缓冲区实现呢?嗯,实际上,我上面展示的事件顺序与我的单线程环形缓冲区实现的事件顺序是一致的,它包含一个主循环,由 3 个子循环组成,它们按顺序执行——一个发出请求的“请求者循环”,一个检查完成项的“完成循环”,最后是运行消费者的“消费者循环”。我想证明这个执行顺序永远不会卡住。所以,首先,请求者循环可以发出零个或多个请求。正如我们所讨论的,它将发出请求,直到正在进行的请求数量达到队列深度,或者直到生产者头到达一个不是“c”状态的单元格。所以这个循环运行得非常快,而且永远不会卡住。接下来,我们等待直到完成队列至少包含一个完成项,然后我们从完成队列中移除所有完成项。那么,这个循环会卡住吗?这只有在我们永远等待看到完成项时才可能发生。而这只有在我们没有处于“r”状态的单元格时才可能发生。为什么?因为如果至少有一个单元格处于“r”状态,那么读取 syscall 最终会返回,我们将得到一个完成事件。那么,当我们等待完成队列时,是否存在没有处于“r”状态的单元格的可能性?

让我们考虑一下发生这种情况所需的条件

  1. 没有正在进行的请求
  2. 请求者循环没有发出任何请求

那么,在请求者循环没有发出任何请求,但也没有任何正在进行的请求的情况下,是什么情况?这只有在生产者头指向的单元格不是“c”状态时才可能。而且,由于没有处于“r”状态的单元格,那么该单元格必定处于“a”状态。我之前已经解释过,这种情况发生的唯一情况是生产者头赶上了消费者头(因为消费者在消费者头后面留下了一串“c”状态的单元格)。这意味着生产者和消费者头指向同一个单元格。现在,由于请求者循环没有发出任何请求,这意味着生产者头在请求者循环的开始到结束之间没有移动,所以在请求者循环的开始,生产者和消费者头都指向同一个单元格,该单元格处于“a”状态。这意味着上一轮消费者循环的结束是消费者头指向一个处于“a”状态的单元格(它不可能是“r”状态,因为我们只在完成循环中将单元格状态从“r”更新为“a”,所以单元格的状态从消费者循环开始到请求者循环结束之间不会从“r”变为“a”。另一种思考方式是,如果消费者循环以消费者头指向“r”状态的单元格结束,那么该单元格将在整个请求者循环中保持“r”状态,因为请求者循环不能将“r”单元格变为“a”单元格,这意味着当进入完成循环时,该单元格将处于“r”状态,而我刚才已经解释过这种情况是不可能的),这是不可能的,因为消费者会尝试处理处于“a”状态的所有单元格,并将它们变成“c”状态,因此消费者头不可能在消费者循环结束时指向一个处于“a”状态的单元格。

因此,我们证明了完成循环不可能卡住。最后一个循环,消费者循环,也不会卡住,因为它只是遍历数组处理单元格,直到遇到一个不是“c”状态的单元格,此时它就终止了。

因此,我们已经确定我们的单线程环形缓冲区实现永远不会卡住。那正确性呢?嗯,既然我们通过使用状态来确保正确性,那么我们只需要确保状态始终是正确的。在单线程实现中,三个循环按顺序执行。这确保了状态始终是正确的。当请求者循环终止时,它所触及的单元格,即转换为“r”状态的单元格,确实处于“r”状态。当完成循环终止时,它所触及的单元格,即转换为“a”状态的单元格,确实处于“a”状态。当消费者循环终止时,它所触及的单元格,即转换为“c”状态的单元格,确实处于“c”状态。并且由于循环是按顺序运行的,因此可以保证从每个循环的角度来看,单元格状态始终是正确的,因此,例如,消费者可以确信处于“a”状态的单元格确实可供它处理。在多线程实现中,需要同步来防止生产者线程和消费者线程同时访问单元格状态,因为那将是一个数据竞争,因此是未定义的行为。

现在,关于重新发出请求?例如,如果读取请求失败,或者您遇到了部分读取,那么您可能需要重新发出读取请求(使用更新的参数)。现在,在我的程序中,消费者按顺序消耗,而生产者也按顺序发出请求,但如果您想重新发出请求,那么就意味着发出对您之前已经请求过的块的请求。但这实际上完全没问题。为什么没问题?因为我的程序实际上并不关心请求发出的顺序。它只关心生产者和消费者头是否一次一个单元格地在数组中移动,单元格是否映射到正确的块,并且单元格的状态始终是正确的。也就是说,如果一个单元格转换为“a”状态,那么消费者处理它就必须是没问题的。这实际上与读取请求返回错误或不完整读取并且您想重新发出请求的用例完全一致——在这种情况下,您只是不将单元格转换为“a”状态,而是将其保留在“r”状态并重新发出请求。从消费者和生产者的角度来看,这看起来就像单元格在“r”状态下停留的时间比平时长。您可以在完成循环中实现这一点——如果 syscall 的返回值(cqe 中的 res 字段)不符合预期,您就可以在那里重新发出请求。

概述

这是一个简单的概述

  1. 最初,生产者将开始遍历数组,向前移动“请求者头”,它留下的槽最初都标记为“已请求但尚未完成”,并且随着时间的推移,它们将以不可确定的顺序变为“可供消耗”。
  2. 消费者紧随其后,只要下一个槽变得“可供消耗”,就向前移动“消费头”,并留下标记为“已消耗”的槽序列。
  3. 生产者最终会绕回数组的开头,它将等待数组的第一个槽变为“已消耗”状态,因为仅有的另外 2 种状态是“已请求但尚未完成”和“可供消耗”,并且它不能为处于这些状态的槽发出请求,因为它们在语义上是“尚未消耗”。

因此,生产者不断等待下一个槽变为“已消耗”,而消费者则不断等待下一个槽变为“可供消耗”。请求者留下“已请求但尚未完成”的槽序列,这些槽最终都会变为“可供消耗”,而消费者留下“已消耗”的槽序列。

这里有一个图来说明这一点

所以,在上面的图中,P 箭头表示生产者“请求者头”的位置(它在概念上是生产者——不一定是一个实际的线程),而 C 箭头表示消费者的“消费头”。您可以看到,只要生产者前面的槽是“c”(“已消耗”)状态,生产者的“请求者头”就会向前移动,而消费者的“消费头”也会向前移动,只要前面的槽是“a”(“可供消耗”)状态。所以我们可以看到,最终,请求者头之后的所有槽都将变为“a”状态,而消费者头之后的所有槽将立即标记为“c”状态。因此,我们拥有这样的属性:所有“在消费者头之前但生产者头之后”的槽最终都会变为“a”状态,而所有“在消费者头之后但生产者头之前”的槽总是“c”状态。开始时,所有槽都标记为“c”,所以生产者头会先移动,而消费者头无法移动。然后,当生产者留下“r”序列时,其中一些“r”会变成“a”,因此消费者头也会开始移动。 thus we established the main cycle where the producer moves forward consuming the "c"s and leaving behind "r"s which turn into "a"s while the consumer follows behind consuming the "a"s and turning them back into "c"s.

多线程实现

在本节中,我将解释多线程实现与单线程实现有何不同,以及为什么我认为它是正确的并且不会卡住。要理解本节,您需要阅读前面的部分,其中解释了单线程实现的工作原理以及为什么我认为它是正确的并且不会卡住。

这是我的程序的 C++ 多线程版本的代码

/* SPDX-License-Identifier: MIT */
/*
 * Compile with: g++ -Wall -O3 -D_GNU_SOURCE liburing_b3sum_multithread.cc -luring libblake3.a -o liburing_b3sum_multithread
 * For an explanation of how this code works, see my article/blog post: https://1f604.com/b3sum
 *
 * Note: the comments in this program are copied unchanged from the single-threaded implementation in order to make it easier for a reader using `diff` to see the code changes from the single-thread version
 * 
 * This program is a modified version of the liburing cp program from Shuveb Hussain's io_uring tutorial.
 * Original source code here: https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
 * The modifications were made by 1f604.
 *
 * The official io_uring documentation can be seen here:
 *    - https://kernel.dk/io_uring.pdf
 *    - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
 *
 * Acronyms: SQ = submission queue, SQE = submission queue entry, CQ = completion queue, CQE = completion queue event
 */
#include "blake3.h"
#include "liburing.h"

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>

// Multithreading stuff
#include <atomic>
#include <thread>


/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT requires aligned memory

/* ===============================================
 * ========== Start of global variables ==========
 * ===============================================
 * Declared static because they are only supposed to be visible within this .c file.
 *
 * --- Command line options ---
 * The following variables are set by the user from the command line.
 */
static int g_queuedepth;              /* This variable is not really the queue depth, but is more accurately described as
                                       * "the limit on the number of incomplete requests". Allow me to explain.
                                       * io_uring allows you to have more requests in-flight than the size of your submission
                                       * (and completion) queues. How is this possible? Well, when you call io_uring_submit,
                                       * normally it will submit ALL of the requests in the submission queue, which means that
                                       * when the call returns, your submission queue is now empty, even though the requests
                                       * are still "in-flight" and haven't been completed yet!
                                       * In earlier kernels, you could overflow the completion queue because of this.
                                       * Thus it says in the official io_uring documentation (https://kernel.dk/io_uring.pdf):
                                       *     Since the sqe lifetime is only that of the actual submission of it, it's possible
                                       *     for the application to drive a higher pending request count than the SQ ring size
                                       *     would indicate. The application must take care not to do so, or it could risk
                                       *     overflowing the CQ ring.
                                       * That is to say, the official documentation recommended that applications should ensure
                                       * that the number of in-flight requests does not exceed the size of the submission queue.
                                       * This g_queuedepth variable is therefore a limit on the total number of "incomplete"
                                       * requests, which is the number of requests on the submission queue plus the number of
                                       * requests that are still "in flight".
                                       * See num_unfinished_requests for details on how this is implemented. */
static int g_use_o_direct;            // whether to use O_DIRECT
static int g_process_in_inner_loop;   // whether to process the data inside the inner loop

static int g_use_iosqe_io_drain;      // whether to issue requests with the IOSQE_IO_DRAIN flag
static int g_use_iosqe_io_link;       // whether to issue requests with the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable either IORING_SETUP_SQPOLL or IORING_SETUP_IOPOLL, on my current system,
//static int g_use_ioring_setup_sqpoll; // it turns my process into an unkillable zombie that uses 100% CPU that never terminates.
                                        // when I was using encrypted LVM, it just gave me error: Operation not supported.
                                        // I decided to not allow users to enable these options because I didn't want them
                                        // to accidentally launch an unkillable never-ending zombie process that uses 100% CPU.
                                        // I observed this problem in fio too when I enabled --hipri on fio, it also turned into
                                        // an unkillable never-ending zombie process that uses 100% CPU.

static size_t g_blocksize;            // This is the size of each buffer in the ringbuf, in bytes.
                                      // It is also the size of each read from the file.
static size_t g_numbufs;              // This is the number of buffers in the ringbuf.

/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
static int g_filedescriptor;          // This is the file descriptor of the file we're hashing.
static size_t g_filesize;             // This will be filled in by the function that gets the file size.
static size_t g_num_blocks_in_file;   // The number of blocks in the file, where each block is g_blocksize bytes.
                                      // This will be calculated by a ceiling division of filesize by blocksize.
static size_t g_size_of_last_block;   // The size of the last block in the file. See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0;         // Position of the "producer head". see explanation in my article/blog post
static int consumer_head = 0;         // Position of the "consumer head". see explanation in my article/blog post

#define AVAILABLE_FOR_CONSUMPTION 1
#define ALREADY_CONSUMED 2
#define REQUESTED_BUT_NOT_YET_COMPLETED 3

struct my_custom_data {               // This is the user_data associated with read requests, which are placed on the submission ring.
                                      // In applications using io_uring, the user_data struct is generally used to identify which request a
                                      // completion is for. In the context of this program, this structure is used both to identify which
                                      // block of the file the read syscall had just read, as well as for the producer and consumer to
                                      // communicate with each other, since it holds the cell state.
                                      // This can be thought of as a "cell" in the ring buffer, since it holds the state of the cell as well
                                      // as a pointer to the data (i.e. a block read from the file) that is "in" the cell.
                                      // Note that according to the official io_uring documentation, the user_data struct only needs to be
                                      // valid until the submit is done, not until completion. Basically, when you submit, the kernel makes
                                      // a copy of user_data and returns it to you with the CQE (completion queue entry).
    unsigned char* buf_addr;          // Pointer to the buffer where the read syscall is to place the bytes from the file into.
    size_t nbytes_expected;           // The number of bytes we expect the read syscall to return. This can be smaller than the size of the buffer
                                      // because the last block of the file can be smaller than the other blocks.
    //size_t nbytes_to_request;       // The number of bytes to request. This is always g_blocksize. I made this decision because O_DIRECT requires
                                      // nbytes to be a multiple of filesystem block size, and it's simpler to always just request g_blocksize.
    off_t offset_of_block_in_file;    // The offset of the block in the file that we want the read syscall to place into the memory location
                                      // pointed to by buf_addr.
    std::atomic_int state;             // Describes whether the item is available to be hashed, already hashed, or requested but not yet available for hashing.
    int ringbuf_index;                // The slot in g_ringbuf where this "cell" belongs.
                                      // I added this because once we submit a request on submission queue, we lose track of it.
                                      // When we get back a completion, we need an identifier to know which request the completion is for.
                                      // Alternatively, we could use something to map the buf_addr to the ringbuf_index, but this is just simpler.
};

// multithreading function
static void update_cell_state(struct my_custom_data* data, int new_state) { // State updates need to be synchronized because both threads look at the state
                                  // In effect they communicate via state updates.
                                  // both the consumer and producer threads block/wait for state change to continue
    data->state = new_state;
}


struct my_custom_data* g_ringbuf;     // This is a pointer to an array of my_custom_data structs. These my_custom_data structs can be thought of as the
                                      // "cells" in the ring buffer (each struct contains the cell state), thus the array that this points to can be
                                      // thought of as the "ring buffer" referred to in my article/blog post, so read that to understand how this is used.
                                      // See the allocate_ringbuf function for details on how and where the memory for the ring buffer is allocated.

/* ===============================================
 * =========== End of global variables ===========
 * ===============================================*/

static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
    int rc;
    int flags = 0;
    rc = io_uring_queue_init(entries, ring, flags);
    if (rc < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-rc));
        return -1;
    }
    return 0;
}

static int get_file_size(int fd, size_t *size)
{
    struct stat st;
    if (fstat(fd, &st) < 0)
        return -1;
    if (S_ISREG(st.st_mode)) {
        *size = st.st_size;
        return 0;
    } else if (S_ISBLK(st.st_mode)) {
        unsigned long long bytes;
        if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
            return -1;
        *size = bytes;
        return 0;
    }
    return -1;
}

static void add_read_request_to_submission_queue(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
    assert(fileoffset_to_request % g_blocksize == 0);
    int block_number = fileoffset_to_request / g_blocksize; // the number of the block in the file
    /*  We do a modulo to map the file block number to the index in the ringbuf
        e.g. if ring buf_addr has 4 slots, then
            file block 0 -> ringbuf index 0
            file block 1 -> ringbuf index 1
            file block 2 -> ringbuf index 2
            file block 3 -> ringbuf index 3
            file block 4 -> ringbuf index 0
            file block 5 -> ringbuf index 1
            file block 6 -> ringbuf index 2
        And so on.
    */
    int ringbuf_idx = block_number % g_numbufs;
    struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
    assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of a my_custom_data struct should never change.

    my_data->offset_of_block_in_file = fileoffset_to_request;
    assert (my_data->buf_addr); // We don't need to change my_data->buf_addr since we set it to point into the backing buffer at the start of the program.
    my_data->nbytes_expected = expected_return_size;
    update_cell_state(my_data, REQUESTED_BUT_NOT_YET_COMPLETED);
    /*my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
                                                       *     1. The producer is about to send it off in a request.
                                                       *     2. The consumer shouldn't be trying to read this buffer at this point.
                                                       * So it is okay to set the state to this here.
                                                       */

    struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
    if (!sqe) {
        puts("ERROR: FAILED TO GET SQE");
        exit(1);
    }

    io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr, g_blocksize, fileoffset_to_request);
    // io_uring_prep_read sets sqe->flags to 0, so we need to set the flags AFTER calling it.

    if (g_use_iosqe_io_drain)
        sqe->flags |= IOSQE_IO_DRAIN;
    if (g_use_iosqe_io_link)
        sqe->flags |= IOSQE_IO_LINK;

    io_uring_sqe_set_data(sqe, my_data);
}

static void increment_buffer_index(int* head) // moves the producer or consumer head forward by one position
{
    *head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}


static void consumer_thread() // Conceptually, this resumes the consumer "thread".
{                             // As this program is single-threaded, we can think of it as cooperative multitasking.
    for (int i = 0; i < g_num_blocks_in_file; ++i){
        while (g_ringbuf[consumer_head].state != AVAILABLE_FOR_CONSUMPTION) {} // busywait
        // Consume the item.
        // The producer has already checked that nbytes_expected is the same as the amount of bytes actually returned.
        // If the read syscall returned something different to nbytes_expected then the program would have terminated with an error message.
        // Therefore it is okay to assume here that nbytes_expected is the same as the amount of actual data in the buffer.
        blake3_hasher_update(&g_hasher, g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);

        // We have finished consuming the item, so mark it as consumed and move the consumer head to point to the next cell in the ringbuffer.
        update_cell_state(&g_ringbuf[consumer_head], ALREADY_CONSUMED);
        increment_buffer_index(&consumer_head);
    }
    // Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
    uint8_t output[BLAKE3_OUT_LEN];
    blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);

    // Print the hash as hexadecimal.
    printf("BLAKE3 hash: ");
    for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
        printf("%02x", output[i]);
    }
    printf("\n");
}


static void producer_thread()
{
    int rc;
    unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
    unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
    unsigned long num_unfinished_requests = 0;
    /* A brief note on how the num_unfinished_requests variable is used:
     * As mentioned earlier, in io_uring it is possible to have more requests in-flight than the
     * size of the completion ring. In earlier kernels this could cause the completion queue to overflow.
     * In later kernels there was an option added (IORING_FEAT_NODROP) which, when enabled, means that
     * if a completion event occurs and the completion queue is full, then the kernel will internally
     * store the event until the completion queue has room for more entries.
     * Therefore, on newer kernels, it isn't necessary, strictly speaking, for the application to limit
     * the number of in-flight requests. But, since it is almost certainly the case that an application
     * can submit requests at a faster rate than the system is capable of servicing them, if we don't
     * have some backpressure mechanism, then the application will just keep on submitting more and more
     * requests, which will eventually lead to the system running out of memory.
     * Setting a hard limit on the total number of in-flight requests serves as a backpressure mechanism
     * to prevent the number of requests buffered in the kernel from growing without bound.
     * The implementation is very simple: we increment num_unfinished_requests whenever a request is placed
     * onto the submission queue, and decrement it whenever an entry is removed from the completion queue.
     * Once num_unfinished_requests hits the limit that we set, then we cannot issue any more requests
     * until we receive more completions, therefore the number of new completions that we receive is exactly
     * equal to the number of new requests that we will place, thus ensuring that the number of in-flight
     * requests can never exceed g_queuedepth.
    */

    off_t next_file_offset_to_request = 0;
    struct io_uring uring;
    if (setup_io_uring_context(g_queuedepth, &uring)) {
        puts("FAILED TO SET UP CONTEXT");
        exit(1);
    }
    struct io_uring* ring = &uring;

    while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
                                         // Step 1. Submit read requests.
                                         // Step 2. Retrieve completions.
                                         // Step 3. Run consumer.
        /*
         * Step 1: Make zero or more read requests until g_queuedepth limit is reached, or until the producer head reaches
         *         a cell that is not in the ALREADY_CONSUMED state (see my article/blog post for explanation).
         */
        unsigned long num_unfinished_requests_prev = num_unfinished_requests; // The only purpose of this variable is to keep track of whether
                                                                              // or not we added any new requests to the submission queue.
        while (num_blocks_left_to_request) {
            if (num_unfinished_requests >= g_queuedepth)
                break;
            // wait for state to change
            if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
                break;

            /* expected_return_size is the number of bytes that we expect this read request to return.
             * expected_return_size will be the block size until the last block of the file.
             * when we get to the last block of the file, expected_return_size will be the size of the last
             * block of the file, which is calculated in calculate_numblocks_and_size_of_last_block
             */
            size_t expected_return_size = g_blocksize;
            if (num_blocks_left_to_request == 1) // if we're at the last block of the file
                expected_return_size = g_size_of_last_block;

            add_read_request_to_submission_queue(ring, expected_return_size, next_file_offset_to_request);
            next_file_offset_to_request += expected_return_size;
            ++num_unfinished_requests;
            --num_blocks_left_to_request;
            // We have added a request for the read syscall to write into the cell in the ringbuffer.
            // The add_read_request_to_submission_queue has already marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
            // so now we just need to move the producer head to point to the next cell in the ringbuffer.
            increment_buffer_index(&producer_head);
        }

        // If we added any read requests to the submission queue, then submit them.
        if (num_unfinished_requests_prev != num_unfinished_requests) {
            rc = io_uring_submit(ring);
            if (rc < 0) {
                fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
                exit(1);
            }
        }

        /*
         * Step 2: Remove all the items from the completion queue.
         */
        bool first_iteration = 0;                   // On the first iteration of the loop, we wait for at least one cqe to be available,
                                                    // then remove one cqe.
                                                    // On each subsequent iteration, we try to remove one cqe without waiting.
                                                    // The loop terminates only when there are no more items left in the completion queue,
        while (num_blocks_left_to_receive) {        // Or when we've read in all of the blocks of the file.
            struct io_uring_cqe *cqe;
            if (first_iteration) {                  // On the first iteration we always wait until at least one cqe is available
                rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed and give us one cqe.
                first_iteration = 0;
            } else {
                rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are no more items left in the completion queue.
                if (rc == -EAGAIN) { // A return code of -EAGAIN means that there are no more items left in the completion queue.
                    break;
                }
            }
            if (rc < 0) {
                fprintf(stderr, "io_uring_peek_cqe: %s\n",
                            strerror(-rc));
                exit(1);
            }
            assert(cqe);

            // At this point we have a cqe, so let's see what our syscall returned.
            struct my_custom_data *data = (struct my_custom_data*) io_uring_cqe_get_data(cqe);

            // Check if the read syscall returned an error
            if (cqe->res < 0) {
                // we're not handling EAGAIN because it should never happen.
                fprintf(stderr, "cqe failed: %s\n",
                        strerror(-cqe->res));
                exit(1);
            }
            // Check if the read syscall returned an unexpected number of bytes.
            if ((size_t)cqe->res != data->nbytes_expected) {
                // We're not handling short reads because they should never happen on a disk-based filesystem.
                if ((size_t)cqe->res < data->nbytes_expected) {
                    puts("panic: short read");
                } else {
                    puts("panic: read returned more data than expected (wtf). Is the file changing while you're reading it??");
                }
                exit(1);
            }

            assert(data->offset_of_block_in_file % g_blocksize == 0);

            // If we reach this point, then it means that there were no errors: the read syscall returned exactly what we expected.
            // Since the read syscall returned, this means it has finished filling in the cell with ONE block of data from the file.
            // This means that the cell can now be read by the consumer, so we need to update the cell state.
            update_cell_state(&g_ringbuf[data->ringbuf_index], AVAILABLE_FOR_CONSUMPTION);
            --num_blocks_left_to_receive; // We received ONE block of data from the file
            io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed, so that its slot can get reused
            --num_unfinished_requests;
            //if (g_process_in_inner_loop)
            //    resume_consumer();
        }
        /* Step 3: Run consumer. This might be thought of as handing over control to the consumer "thread". See my article/blog post. */
        //resume_consumer();
    }
    //resume_consumer();

    close(g_filedescriptor);
    io_uring_queue_exit(ring);
}

static void process_cmd_line_args(int argc, char* argv[])
{
    if (argc != 9) {
        printf("%s: infile g_blocksize g_queuedepth g_use_o_direct g_process_in_inner_loop g_numbufs g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
        exit(1);
    }
    g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
    g_queuedepth = atoi(argv[3]);
    if (g_queuedepth > 32768)
        puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
    g_use_o_direct = atoi(argv[4]);
    g_process_in_inner_loop = atoi(argv[5]);
    g_numbufs = atoi(argv[6]);
    g_use_iosqe_io_drain = atoi(argv[7]);
    g_use_iosqe_io_link = atoi(argv[8]);
}

static void open_and_get_size_of_file(const char* filename)
{
    if (g_use_o_direct){
        g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
        puts("opening file with O_DIRECT");
    } else {
        g_filedescriptor = open(filename, O_RDONLY);
        puts("opening file without O_DIRECT");
    }
    if (g_filedescriptor < 0) {
        perror("open file");
        exit(1);
    }
    if (get_file_size(g_filedescriptor, &g_filesize)){
        puts("Failed getting file size");
        exit(1);
    }
}

static void calculate_numblocks_and_size_of_last_block()
{
    // this is the mathematically correct way to do ceiling division
    // (assumes no integer overflow)
    g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
    // calculate the size of the last block of the file
    if (g_filesize % g_blocksize == 0)
        g_size_of_last_block = g_blocksize;
    else
        g_size_of_last_block = g_filesize % g_blocksize;
}

static void allocate_ringbuf()
{
    // We only make 2 memory allocations in this entire program and they both happen in this function.
    assert(g_blocksize % ALIGNMENT == 0);
    // First, we allocate the entire underlying ring buffer (which is a contiguous block of memory
    // containing all the actual buffers) in a single allocation.
    // This is one big piece of memory which is used to hold the actual data from the file.
    // The buf_addr field in the my_custom_data struct points to a buffer within this ring buffer.
    unsigned char* ptr_to_underlying_ring_buffer;
    // We need aligned memory, because O_DIRECT requires it.
    if (posix_memalign((void**)&ptr_to_underlying_ring_buffer, ALIGNMENT, g_blocksize * g_numbufs)) {
        puts("posix_memalign failed!");
        exit(1);
    }
    // Second, we allocate an array containing all of the my_custom_data structs.
    // This is not an array of pointers, but an array holding all of the actual structs.
    // All the items are the same size, which makes this easy
    g_ringbuf = (struct my_custom_data*) malloc(g_numbufs * sizeof(struct my_custom_data));

    // We partially initialize all of the my_custom_data structs here.
    // (The other fields are initialized in the add_read_request_to_submission_queue function)
    off_t cur_offset = 0;
    for (int i = 0; i < g_numbufs; ++i) {
        g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer + cur_offset; // This will never change during the runtime of this program.
        cur_offset += g_blocksize;
        // g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
        g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to ALREADY_CONSUMED at the start. See my article/blog post for explanation.
        g_ringbuf[i].ringbuf_index = i; // This will never change during the runtime of this program.
    }
}

int main(int argc, char *argv[])
{
    assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
    process_cmd_line_args(argc, argv); // we need to first parse the command line arguments
    open_and_get_size_of_file(argv[1]);
    calculate_numblocks_and_size_of_last_block();
    allocate_ringbuf();

    // Initialize the hasher.
    blake3_hasher_init(&g_hasher);

    // Run the threads
    std::thread t1(producer_thread), t2(consumer_thread);
    t1.join();
    t2.join();
}

单线程版本和多线程版本之间只有少数几个不同之处,其中两个是关键的,我将在下面解释。嗯,让我们从不重要的区别开始。首先,多线程版本使用了 std::thread(在 C++11 中添加),所以它必须使用 g++ 编译,而 g++ 也可以编译 C 代码……所以您也可以使用 g++ 来编译单线程和多线程版本。因此,新增的两个头文件是 <atomic<thread>。嗯,既然我们使用 std::thread 进行多线程编程,那么包含 <thread> 是显而易见的。但为什么需要 <atomic>?为什么不是 <mutex><condition_variable>?那是因为我们使用原子操作而不是互斥量或条件变量来进行同步。

我的程序中两个关键变化之一是 my_custom_data 结构体的 state 字段类型从 enum 变为 std::atomic_int。这是绝对必要的,否则生产者可能会在消费者读取 state 的同时(反之亦然)写入 state(有关“同时”在此上下文中确切含义的正式定义,请参阅我的博客文章:https://1f604.blogspot.com/2023/06/atomic-weapons-part-1.html),这将是一个数据竞争。它需要是原子的,因为您需要消费者线程“看到”生产者线程在内存中所做的所有更改,在这种情况下,生产者线程正在用文件中的数据填充缓冲区。原子操作解决了这个问题,因为对原子变量的写入是 store/release,而对原子变量的读取是 load/acquire,并且保证,如果线程 A 写入一个原子变量,线程 B 读取同一个原子变量,那么当线程 B 读取该原子变量并看到某个值时,线程 A 在写入该值之前所做的所有操作都将立即对 B 可见,一旦 B 读取该原子变量。在这种情况下,这意味着,一旦消费者线程读取状态,它就会“看到”生产者在写入 state 变量之前填充的所有块。更具体地说,如果线程 A 仅在确保读取 syscall 已将数据写入该单元格后才将单元格的 state 设置为“a”(即该单元格现在已填充文件块),那么如果线程 B 看到单元格的 state 为“a”,那么保证线程 B 也会看到读取 syscall 对单元格所做的更改(即完全写入的数据)。另一种思考方式是,生产者和消费者线程通过 state 变量进行概念上的通信——这是共享内存,两个线程都会对其进行读写,因此必须对其进行同步,否则两个线程可能同时访问它,这是一个数据竞争,是未定义的行为。

现在,正如前面解释的,消费者线程需要等待消费者头指向的单元格变为“a”状态后才能继续。在单线程版本中,这只是一个在遇到非“a”单元格时退出的循环。但现在它是一个独立的线程,我们可以最终用消费者线程概念的直接实现替换消费者循环,这是一个循环,一旦单元格变为“a”状态就会消费当前单元格,然后移动到下一个单元格并重复。但这现在需要我们实现一个“等待”,而以前我们在消费者循环中没有任何等待,因为消费者需要在处理单元格之前等待单元格变为“a”状态。无论如何,使用原子变量实现等待非常简单——您只需编写一个空体的 while 循环,如下所示:while(current_cell.state is not "a") {}。这被称为“忙等待”,尽管它在性能上据说相当好,但我没有注意到在“分配足够内存以容纳整个文件”版本的程序中,忙等待版本和条件变量等待版本之间有任何性能差异,所以我怀疑忙等待在这里不会带来任何性能优势。相反,我选择使用它是因为我认为它更容易推理。

我的程序中的另一个关键变化是从 bool first_iteration = 1; 改为 bool first_iteration = 0;。为什么这是一个关键变化?因为没有它,程序可能会卡住。如何卡住?回想一下,在我们的单线程程序中,完成循环在第一次迭代中等待一个完成项。现在我们已经将生产者函数转换为一个线程,保留这个等待可能会导致程序卡住。让我来演示一下。

上图显示了一个带有 2 个槽的环形缓冲区。起初,所有单元格都处于“c”状态。现在消费者被阻塞,等待槽 0 变为“a”状态,而生产者启动请求者循环并遍历数组,将所有单元格变为“r”状态。所以现在,槽 0 和 1 都处于“r”状态。接下来发生的事情是,由于生产者现在无法发出新请求,它进入完成循环并被阻塞,等待一个完成事件。到目前为止一切顺利。接下来,一个槽 0 的完成项来了,所以生产者将该单元格的状态设置为“a”,这意味着消费者现在解除阻塞,因此消费者开始处理槽 0。现在生产者解除阻塞并可以继续,所以它回到请求者循环,发现没有什么可做的,然后它回到完成循环并等待下一个完成项。现在消费者仍在处理槽 0,而生产者正在等待下一个完成项。此时,消费者可能先完成处理单元格,或者完成事件可能先到达。假设完成事件先到达。**然后生产者将槽 1 转换为“a”状态,这样现在两个槽都处于“a”状态**。然后生产者将返回请求者循环,然后,它会发现它无法推进生产者头,因为槽 0 仍然处于“a”状态(消费者还没有完成),所以它将退出请求者循环并开始完成循环,等待下一个完成事件。问题就出在这里:**现在两个槽都处于“a”状态,这意味着没有正在进行的请求了,因此不会有完成事件!**消费者将完成处理槽 0,将其转换为“c”状态,但生产者仍在等待完成项!然后消费者将完成处理槽 1,将其转换为“c”状态,**但生产者仍在等待完成项,而这个完成项永远不会到达,因为没有正在进行的请求!!!**所以,移除完成循环第一次迭代中的等待是绝对至关重要的,否则程序可能会卡住。

那么,这个程序是否正确?让我们考虑一下正确性属性,即消费者必须按顺序一次性处理文件中的所有块。要证明这一点,只需证明消费者只会在可以处理该单元格时才处理它就足够了。我们使用单元格状态来保证这个属性,并且我们说过,如果状态始终是正确的,那么程序也将是正确的。让我们考虑一下状态始终正确意味着什么。

从消费者的角度来看,这意味着**无论何时**消费者看到一个单元格处于“a”状态,那么它**总是**意味着消费者处理它确实是安全的。一个单元格可以安全处理但未处于“a”状态是可以的,因为消费者将简单地不消耗它,而且不会发生任何坏事。所以从消费者的角度来看,正确性仅仅是如果一个单元格处于“a”状态,那么它确实可以处理,生产者通过首先获得完成事件然后写入状态来更新它来保证这一点,从而确保当消费者看到单元格状态已更改为“a”时,消费者也看到生产者对单元格所做的更改(即用文件中的块填充它)。

从生产者的角度来看,这意味着**无论何时**生产者看到一个单元格处于“c”状态,那么它**总是**意味着生产者发出该单元格的请求确实是安全的。一个单元格可以安全地发出请求但未处于“c”状态是可以的,因为生产者将什么都不做,所以不会发生任何坏事。因此,从生产者的角度来看,正确性仅仅是如果一个单元格处于“c”状态,那么发出请求确实是安全的,而消费者通过首先处理单元格然后将单元格状态更新为“c”来保证这一点——它只在完成处理单元格后才更新单元格状态,从而确保每当生产者看到单元格状态已更改为“c”时,就意味着消费者已经完成了该块,并且不再需要该单元格中的数据。

因此,通过确保单元格状态始终是正确的,我们确保了消费者只会在真正可以处理该块时才处理它,并且生产者只会在真正可以发出请求时才发出请求。也就是说,消费者只会在可以读取该块时才读取它,而生产者只会在可以写入该块时才写入它。

现在,关于“最终终止”属性呢?我们想证明程序永远不会卡住。让我们从消费者开始,因为如果我们能证明消费者永远不会卡住,那就足以证明程序永远不会卡住。消费者线程非常简单——它只是等待状态变为“a”。因此,消费者唯一可能卡住的方式是它正在等待的单元格永远不会变为“a”——这意味着它要么永远保持“r”状态,要么永远保持“c”状态。生产者负责将单元格转换为“a”状态(从“r”或“c”状态),这意味着**只有当生产者卡住时,消费者才会卡住。**现在我们必须看看生产者。

生产者会卡住吗?生产者线程由两个循环组成:一个请求者循环,后面跟着一个完成循环。请求者循环会将请求放到 SQ 上,直到无法再将更多请求放到 SQ 上(因为未完成的请求数量已达到队列深度,或者生产者头指向的单元格不是“c”状态),此时它将终止并提交请求。然后生产者线程将进入完成循环,在那里它会遍历完成队列中的完成项,移除它们并将相应的单元格转换为“a”状态,直到完成队列中没有更多完成项,此时完成循环将终止。主要的区别是,如前所述,单线程版本会在完成循环的第一次迭代中等待一个完成项,而多线程版本则不会。在多线程版本中,如果在完成循环的第一次迭代中完成队列中没有完成项,那么完成循环将立即终止,而不会等待。请求者循环也是如此(这始终是情况)——如果它无法发出请求,请求者循环将立即终止。

生产者本质上有 2 项工作:首先,它不断检查生产者头指向的单元格,等待它变为“c”状态,以便它可以为其发出请求,从而完成将生产者头指向的单元格从“c”状态转换为“r”状态的工作;其次,它不断检查完成项,从而完成将单元格从“r”状态转换为“a”状态的工作。没有“等待”,生产者在这两个循环之间不断地执行。这两个循环中没有阻塞指令,只是“发出请求直到不能为止”,然后“从完成队列弹出直到不能为止”。因此,很容易看出每个循环都在有限的时间内运行——请求者循环将永远最终终止,因为生产者最终将无法推进生产者头(因为它遇到了一个不是“c”状态的单元格,或者它已经达到了未完成请求的限制)——为了看到这一点,想象一个无限快的消费者,所以当消费者通过时,“a”单元格会立即变成“c”单元格。即使在这种情况下,请求者循环仍将终止,因为它需要运行完成循环才能将“r”单元格变为“a”单元格,这意味着最终环形缓冲区将被“r”单元格填满,从而迫使请求者循环终止。完成循环将永远终止,因为任何给定时间点只有有限数量的正在进行的请求,并且它在完成循环期间无法发出新请求,因此完成循环的每次完整迭代都会将正在进行的请求数量减少 1,所以最终将没有正在进行的请求,因此完成队列将为空,完成循环将终止。因此,请求者循环和完成循环都保证会终止,这意味着生产者将始终在两个工作之间不断进行——它不会卡在一个工作上。

那么,在什么情况下生产者会卡住呢?嗯,它实际上不会卡住,因为它总是在这两个循环之间不断地执行。这意味着我们可以确信处于“r”状态的单元格最终将转换为“a”状态,因为这是生产者第二循环所做的。生产者的另一项工作是将“c”状态的单元格转换为“r”状态,在这里我们再次确定,每当生产者头指向的单元格转换为“c”状态时,它保证最终会转换为“r”状态,因为这是生产者第一个循环所做的。

那么,生产者卡住的唯一方法是生产者头始终等待它指向的单元格转换为“c”状态。现在,如果单元格处于“r”状态,那么它最终会如上所述转换为“a”状态,因此我们可以简单地考虑单元格处于“a”状态的情况。实际上,这适用于整个环形缓冲区——因为可以保证“r”单元格最终会变为“a”单元格,所以我们应该考虑生产者头被卡住的任何情况,即环形缓冲区中的所有单元格都处于“a”状态或“c”状态。以下是这种情况的一个例子

在上述情况下,生产者线程被阻塞,等待 P 下方的单元格转换到“c”状态,而消费者线程则阻塞,等待 C 下方的单元格转换到“a”状态。两个头部都必须被阻塞,因为如果有一个头部不被阻塞,它最终会移动到另一个头部的位置并解除另一个头部的阻塞。例如,如果消费者不被阻塞,它最终会移动到 P 的位置并将 P 下方的单元格转换为“c”状态。同样,如果生产者不被阻塞,它最终会移动到 C 的位置并将 C 下方的单元格转换为“r”状态,而“r”状态最终会变成“a”状态。因此,为了让其中一个被阻塞,另一个也必须被阻塞。也就是说,为了让消费者被阻塞,生产者也必须被阻塞;同样,为了让生产者被阻塞,消费者也必须被阻塞。因此,为了让程序被阻塞,我们必须处于一个生产者和消费者同时被阻塞的情况。

因此,为了证明程序不会被阻塞,我们只需要证明消费者和生产者不可能同时被阻塞。让我们从程序的开头说起。在程序开始时,所有单元格都处于“c”状态,并且只有生产者头部可以移动。然后生产者将开始在环形缓冲区中移动,将“c”单元格转换为“r”单元格,而“r”单元格又会变成“a”单元格,消费者头部将紧随其后。现在,在什么情况下生产者会被阻塞?当生产者头部指向处于“r”或“a”状态的单元格时,它将被阻塞。现在,由于生产者是唯一可以将单元格转换为“r”或“a”状态的实体,而环形缓冲区一开始所有单元格都处于“c”状态,并且生产者一次向前移动一个单元格,将它接触到的每个单元格都转换为“r”状态(然后变成“a”状态),这意味着生产者被阻塞在一个处于“r”或“a”状态的单元格上的唯一方式是,生产者已经绕环形缓冲区完整地走了一圈,即整个环形缓冲区现在都充满了处于“r”或“a”状态的单元格。为了可视化这一点,可以想象最初所有单元格都处于“c”状态,当生产者向前移动时,它就像一条蛇,其身体/尾巴由处于“r”状态并变成“a”状态的单元格组成,而消费者就像另一条蛇,其身体/尾巴由处于“c”状态的单元格组成,消费者蛇紧随在生产者蛇后面。如果生产者比消费者快,它最终会“吃掉”所有处于“c”状态的单元格并追上自己的尾巴……这是它能到达处于“a”或“r”状态的单元格的唯一方法。

因此,由于环形缓冲区的初始布局(所有单元格都处于“c”状态),生产者被阻塞的唯一可能情况是环形缓冲区中的所有单元格都处于“r”或“a”状态,并且我们知道所有“r”单元格最终都会转换为“a”单元格,所以最终环形缓冲区将完全由“a”单元格填充。但是,在这种状态下,消费者不可能被阻塞,因为它没有任何东西在等待!因此,在生产者被阻塞的**唯一**可能场景中,消费者没有被阻塞,因此不可能出现生产者和消费者都同时被阻塞的情况,而且由于我们之前已经确定,除非生产者和消费者同时被阻塞,否则程序不会被阻塞,这意味着程序永远不会被阻塞。

实际上有多少个请求正在处理中?

一个自然的问题是,队列深度和环形缓冲区大小的合适值是多少?我发现对于单线程的 O_DIRECT 版本,2-4 的小数值效果很好。我编写了一个单线程环形缓冲区实现的仪表化版本,并观察到当队列深度和环形缓冲区大小为 4 时,进入消费者循环时在途请求的数量始终为 3,这完全说得通——在我的机器上,哈希速度大于文件读取速度,这意味着消费者将尽快处理由 read 系统调用返回的块。因此,如果我们放置接下来的 4 个块的请求,并等待至少一个块返回,结果是,一旦第一个块返回,完成循环将完成,消费者循环将运行,处理该块。然后我们回到请求者循环,它发出 1 个请求,然后再次等待完成队列。由于等待在块返回后立即完成,磁盘刚刚开始读取文件的下一个块,CPU 就开始处理该块。由于 CPU 处理块的速度远快于磁盘返回块的速度,因此当 CPU 完成处理该块时,磁盘尚未完成读取文件的下一个块,因此我们必须再次等待。因此,在途请求的数量始终比队列深度和环形缓冲区大小少一,所以如果队列深度和环形缓冲区大小为 4,则始终为 3;如果队列深度和环形缓冲区大小为 2,则始终为 1。

$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./instru 1GB.txt 512 4 1 0 4 0 0
opening file with O_DIRECT
BLAKE3 hash: 90e7144c7a091d505102e433fb4f2160d1d56d7f5b39291206819e5648c1e639
Number of requests in-flight:
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 0, 0, 
Num records: 2049

real	0m0.302s
user	0m0.237s
sys	0m0.022s

我想一个自然的后续问题是:如果 CPU 比磁盘慢,我们会看到什么?假设队列深度和环形缓冲区大小为 4。那么最初,请求者循环发出 4 个请求,完成循环获得一个块,现在有 3 个请求在途。现在,消费者循环运行,但在消费者完成之前,假设磁盘已经检索了文件的接下来的 2 个块,所以现在只有 1 个请求在途。然后请求者循环再次运行并发出 1 个新请求,所以现在只有 2 个请求在途。这一次,完成循环返回 2 个块而不是 1 个,现在消费者必须处理这两个块。在消费者完成处理这两个块时,磁盘应该已经完成了另外两个在途块的读取,并且花费了一些时间处于空闲状态。但是,当消费者循环开始时,有两个块在途,尽管当消费者循环结束时,这两个请求早已完成。事实上,当我运行故意减慢消费者的代码时,我看到在消费者循环开始时记录的在途请求数量大多为 2(偶尔为 3)。

$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread_instrumented 1GB.txt 512 4 1 0 4 0 0
opening file with O_DIRECT
BLAKE3 hash: 90e7144c7a091d505102e433fb4f2160d1d56d7f5b39291206819e5648c1e639
Number of requests in-flight:
3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0, 
Num records: 1042

real	0m0.431s
user	0m0.414s
sys	0m0.016s

我已将仪表化版本附加在下面

// This is the instrumented version of liburing_b3sum_singlethread.

/* SPDX-License-Identifier: MIT */
/*
 * Compile with: gcc -Wall -O3 -D_GNU_SOURCE liburing_b3sum_singlethread_instrumented.c -luring libblake3.a -o liburing_b3sum_singlethread_instrumented
 * For an explanation of how this code works, see my article/blog post: https://1f604.com/b3sum
 *
 * This program is a modified version of the liburing cp program from Shuveb Hussain's io_uring tutorial.
 * Original source code here: https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
 * The modifications were made by 1f604.
 *
 * The official io_uring documentation can be seen here:
 *    - https://kernel.dk/io_uring.pdf
 *    - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
 *
 * Acronyms: SQ = submission queue, SQE = submission queue entry, CQ = completion queue, CQE = completion queue event
 */
#include "dependencies/rust-blake3-1.3.1/c/blake3.h"
#include "liburing.h"

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>

// Instrumentation
static int in_flight_count[9000];
static unsigned long num_unfinished_requests = 0;
static int consume_idx = 0;

/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT requires aligned memory

/* ===============================================
 * ========== Start of global variables ==========
 * ===============================================
 * Declared static because they are only supposed to be visible within this .c file.
 *
 * --- Command line options ---
 * The following variables are set by the user from the command line.
 */
static int g_queuedepth;              /* This variable is not really the queue depth, but is more accurately described as
                                       * "the limit on the number of incomplete requests". Allow me to explain.
                                       * io_uring allows you to have more requests in-flight than the size of your submission
                                       * (and completion) queues. How is this possible? Well, when you call io_uring_submit,
                                       * normally it will submit ALL of the requests in the submission queue, which means that
                                       * when the call returns, your submission queue is now empty, even though the requests
                                       * are still "in-flight" and haven't been completed yet!
                                       * In earlier kernels, you could overflow the completion queue because of this.
                                       * Thus it says in the official io_uring documentation (https://kernel.dk/io_uring.pdf):
                                       *     Since the sqe lifetime is only that of the actual submission of it, it's possible
                                       *     for the application to drive a higher pending request count than the SQ ring size
                                       *     would indicate. The application must take care not to do so, or it could risk
                                       *     overflowing the CQ ring.
                                       * That is to say, the official documentation recommended that applications should ensure
                                       * that the number of in-flight requests does not exceed the size of the submission queue.
                                       * This g_queuedepth variable is therefore a limit on the total number of "incomplete"
                                       * requests, which is the number of requests on the submission queue plus the number of
                                       * requests that are still "in flight".
                                       * See num_unfinished_requests for details on how this is implemented. */
static int g_use_o_direct;            // whether to use O_DIRECT
static int g_process_in_inner_loop;   // whether to process the data inside the inner loop

static int g_use_iosqe_io_drain;      // whether to issue requests with the IOSQE_IO_DRAIN flag
static int g_use_iosqe_io_link;       // whether to issue requests with the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable either IORING_SETUP_SQPOLL or IORING_SETUP_IOPOLL, on my current system,
//static int g_use_ioring_setup_sqpoll; // it turns my process into an unkillable zombie that uses 100% CPU that never terminates.
                                        // when I was using encrypted LVM, it just gave me error: Operation not supported.
                                        // I decided to not allow users to enable these options because I didn't want them
                                        // to accidentally launch an unkillable never-ending zombie process that uses 100% CPU.
                                        // I observed this problem in fio too when I enabled --hipri on fio, it also turned into
                                        // an unkillable never-ending zombie process that uses 100% CPU.

static size_t g_blocksize;            // This is the size of each buffer in the ringbuf, in bytes.
                                      // It is also the size of each read from the file.
static size_t g_numbufs;              // This is the number of buffers in the ringbuf.

/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
blake3_hasher g_hasher2;
static int g_filedescriptor;          // This is the file descriptor of the file we're hashing.
static size_t g_filesize;             // This will be filled in by the function that gets the file size.
static size_t g_num_blocks_in_file;   // The number of blocks in the file, where each block is g_blocksize bytes.
                                      // This will be calculated by a ceiling division of filesize by blocksize.
static size_t g_size_of_last_block;   // The size of the last block in the file. See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0;         // Position of the "producer head". see explanation in my article/blog post
static int consumer_head = 0;         // Position of the "consumer head". see explanation in my article/blog post

enum ItemState {                      // describes the state of a cell in the ring buffer array ringbuf, see my article/blog post for detailed explanation
    AVAILABLE_FOR_CONSUMPTION,        // consumption and processing in the context of this program refers to hashing
    ALREADY_CONSUMED,
    REQUESTED_BUT_NOT_YET_COMPLETED,
};

struct my_custom_data {               // This is the user_data associated with read requests, which are placed on the submission ring.
                                      // In applications using io_uring, the user_data struct is generally used to identify which request a
                                      // completion is for. In the context of this program, this structure is used both to identify which
                                      // block of the file the read syscall had just read, as well as for the producer and consumer to
                                      // communicate with each other, since it holds the cell state.
                                      // This can be thought of as a "cell" in the ring buffer, since it holds the state of the cell as well
                                      // as a pointer to the data (i.e. a block read from the file) that is "in" the cell.
                                      // Note that according to the official io_uring documentation, the user_data struct only needs to be
                                      // valid until the submit is done, not until completion. Basically, when you submit, the kernel makes
                                      // a copy of user_data and returns it to you with the CQE (completion queue entry).
    unsigned char* buf_addr;          // Pointer to the buffer where the read syscall is to place the bytes from the file into.
    size_t nbytes_expected;           // The number of bytes we expect the read syscall to return. This can be smaller than the size of the buffer
                                      // because the last block of the file can be smaller than the other blocks.
    //size_t nbytes_to_request;       // The number of bytes to request. This is always g_blocksize. I made this decision because O_DIRECT requires
                                      // nbytes to be a multiple of filesystem block size, and it's simpler to always just request g_blocksize.
    off_t offset_of_block_in_file;    // The offset of the block in the file that we want the read syscall to place into the memory location
                                      // pointed to by buf_addr.
    enum ItemState state;             // Describes whether the item is available to be hashed, already hashed, or requested but not yet available for hashing.
    int ringbuf_index;                // The slot in g_ringbuf where this "cell" belongs.
                                      // I added this because once we submit a request on submission queue, we lose track of it.
                                      // When we get back a completion, we need an identifier to know which request the completion is for.
                                      // Alternatively, we could use something to map the buf_addr to the ringbuf_index, but this is just simpler.
};

struct my_custom_data* g_ringbuf;     // This is a pointer to an array of my_custom_data structs. These my_custom_data structs can be thought of as the
                                      // "cells" in the ring buffer (each struct contains the cell state), thus the array that this points to can be
                                      // thought of as the "ring buffer" referred to in my article/blog post, so read that to understand how this is used.
                                      // See the allocate_ringbuf function for details on how and where the memory for the ring buffer is allocated.

/* ===============================================
 * =========== End of global variables ===========
 * ===============================================*/

static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
    int rc;
    int flags = 0;
    rc = io_uring_queue_init(entries, ring, flags);
    if (rc < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-rc));
        return -1;
    }
    return 0;
}

static int get_file_size(int fd, size_t *size)
{
    struct stat st;
    if (fstat(fd, &st) < 0)
        return -1;
    if (S_ISREG(st.st_mode)) {
        *size = st.st_size;
        return 0;
    } else if (S_ISBLK(st.st_mode)) {
        unsigned long long bytes;
        if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
            return -1;
        *size = bytes;
        return 0;
    }
    return -1;
}

static void add_read_request_to_submission_queue(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
    assert(fileoffset_to_request % g_blocksize == 0);
    int block_number = fileoffset_to_request / g_blocksize; // the number of the block in the file
    /*  We do a modulo to map the file block number to the index in the ringbuf
        e.g. if ring buf_addr has 4 slots, then
            file block 0 -> ringbuf index 0
            file block 1 -> ringbuf index 1
            file block 2 -> ringbuf index 2
            file block 3 -> ringbuf index 3
            file block 4 -> ringbuf index 0
            file block 5 -> ringbuf index 1
            file block 6 -> ringbuf index 2
        And so on.
    */
    int ringbuf_idx = block_number % g_numbufs;
    struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
    assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of a my_custom_data struct should never change.

    my_data->offset_of_block_in_file = fileoffset_to_request;
    assert (my_data->buf_addr); // We don't need to change my_data->buf_addr since we set it to point into the backing buffer at the start of the program.
    my_data->nbytes_expected = expected_return_size;
    my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
                                                       *     1. The producer is about to send it off in a request.
                                                       *     2. The consumer shouldn't be trying to read this buffer at this point.
                                                       * So it is okay to set the state to this here.
                                                       */

    struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
    if (!sqe) {
        puts("ERROR: FAILED TO GET SQE");
        exit(1);
    }

    io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr, g_blocksize, fileoffset_to_request);
    // io_uring_prep_read sets sqe->flags to 0, so we need to set the flags AFTER calling it.

    if (g_use_iosqe_io_drain)
        sqe->flags |= IOSQE_IO_DRAIN;
    if (g_use_iosqe_io_link)
        sqe->flags |= IOSQE_IO_LINK;

    io_uring_sqe_set_data(sqe, my_data);
}

static void increment_buffer_index(int* head) // moves the producer or consumer head forward by one position
{
    *head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}


static void resume_consumer() // Conceptually, this resumes the consumer "thread".
{                             // As this program is single-threaded, we can think of it as cooperative multitasking.
    in_flight_count[consume_idx] = num_unfinished_requests;
    ++consume_idx;
    while (g_ringbuf[consumer_head].state == AVAILABLE_FOR_CONSUMPTION){
        // Consume the item.
        // The producer has already checked that nbytes_expected is the same as the amount of bytes actually returned.
        // If the read syscall returned something different to nbytes_expected then the program would have terminated with an error message.
        // Therefore it is okay to assume here that nbytes_expected is the same as the amount of actual data in the buffer.
        blake3_hasher_update(&g_hasher, g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);
//        blake3_hasher_update(&g_hasher2, g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);

        // We have finished consuming the item, so mark it as consumed and move the consumer head to point to the next cell in the ringbuffer.
        g_ringbuf[consumer_head].state = ALREADY_CONSUMED;
        increment_buffer_index(&consumer_head);
    }
}


static void producer_thread()
{
    int rc;
    unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
    unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
    /* A brief note on how the num_unfinished_requests variable is used:
     * As mentioned earlier, in io_uring it is possible to have more requests in-flight than the
     * size of the completion ring. In earlier kernels this could cause the completion queue to overflow.
     * In later kernels there was an option added (IORING_FEAT_NODROP) which, when enabled, means that
     * if a completion event occurs and the completion queue is full, then the kernel will internally
     * store the event until the completion queue has room for more entries.
     * Therefore, on newer kernels, it isn't necessary, strictly speaking, for the application to limit
     * the number of in-flight requests. But, since it is almost certainly the case that an application
     * can submit requests at a faster rate than the system is capable of servicing them, if we don't
     * have some backpressure mechanism, then the application will just keep on submitting more and more
     * requests, which will eventually lead to the system running out of memory.
     * Setting a hard limit on the total number of in-flight requests serves as a backpressure mechanism
     * to prevent the number of requests buffered in the kernel from growing without bound.
     * The implementation is very simple: we increment num_unfinished_requests whenever a request is placed
     * onto the submission queue, and decrement it whenever an entry is removed from the completion queue.
     * Once num_unfinished_requests hits the limit that we set, then we cannot issue any more requests
     * until we receive more completions, therefore the number of new completions that we receive is exactly
     * equal to the number of new requests that we will place, thus ensuring that the number of in-flight
     * requests can never exceed g_queuedepth.
    */

    off_t next_file_offset_to_request = 0;
    struct io_uring uring;
    if (setup_io_uring_context(g_queuedepth, &uring)) {
        puts("FAILED TO SET UP CONTEXT");
        exit(1);
    }
    struct io_uring* ring = &uring;

    while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
                                         // Step 1. Submit read requests.
                                         // Step 2. Retrieve completions.
                                         // Step 3. Run consumer.
        /*
         * Step 1: Make zero or more read requests until g_queuedepth limit is reached, or until the producer head reaches
         *         a cell that is not in the ALREADY_CONSUMED state (see my article/blog post for explanation).
         */
        unsigned long num_unfinished_requests_prev = num_unfinished_requests; // The only purpose of this variable is to keep track of whether
                                                                              // or not we added any new requests to the submission queue.
        while (num_blocks_left_to_request) {
            if (num_unfinished_requests >= g_queuedepth)
                break;
            if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
                break;

            /* expected_return_size is the number of bytes that we expect this read request to return.
             * expected_return_size will be the block size until the last block of the file.
             * when we get to the last block of the file, expected_return_size will be the size of the last
             * block of the file, which is calculated in calculate_numblocks_and_size_of_last_block
             */
            size_t expected_return_size = g_blocksize;
            if (num_blocks_left_to_request == 1) // if we're at the last block of the file
                expected_return_size = g_size_of_last_block;

            add_read_request_to_submission_queue(ring, expected_return_size, next_file_offset_to_request);
            next_file_offset_to_request += expected_return_size;
            ++num_unfinished_requests;
            --num_blocks_left_to_request;
            // We have added a request for the read syscall to write into the cell in the ringbuffer.
            // The add_read_request_to_submission_queue has already marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
            // so now we just need to move the producer head to point to the next cell in the ringbuffer.
            increment_buffer_index(&producer_head);
        }

        // If we added any read requests to the submission queue, then submit them.
        if (num_unfinished_requests_prev != num_unfinished_requests) {
            rc = io_uring_submit(ring);
            if (rc < 0) {
                fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
                exit(1);
            }
        }

        /*
         * Step 2: Remove all the items from the completion queue.
         */
        bool first_iteration = 1;                   // On the first iteration of the loop, we wait for at least one cqe to be available,
                                                    // then remove one cqe.
                                                    // On each subsequent iteration, we try to remove one cqe without waiting.
                                                    // The loop terminates only when there are no more items left in the completion queue,
        while (num_blocks_left_to_receive) {        // Or when we've read in all of the blocks of the file.
            struct io_uring_cqe *cqe;
            if (first_iteration) {                  // On the first iteration we always wait until at least one cqe is available
                rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed and give us one cqe.
                first_iteration = 0;
            } else {
                rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are no more items left in the completion queue.
                if (rc == -EAGAIN) { // A return code of -EAGAIN means that there are no more items left in the completion queue.
                    break;
                }
            }
            if (rc < 0) {
                fprintf(stderr, "io_uring_peek_cqe: %s\n",
                            strerror(-rc));
                exit(1);
            }
            assert(cqe);

            // At this point we have a cqe, so let's see what our syscall returned.
            struct my_custom_data *data = (struct my_custom_data*) io_uring_cqe_get_data(cqe);

            // Check if the read syscall returned an error
            if (cqe->res < 0) {
                // we're not handling EAGAIN because it should never happen.
                fprintf(stderr, "cqe failed: %s\n",
                        strerror(-cqe->res));
                exit(1);
            }
            // Check if the read syscall returned an unexpected number of bytes.
            if ((size_t)cqe->res != data->nbytes_expected) {
                // We're not handling short reads because they should never happen on a disk-based filesystem.
                if ((size_t)cqe->res < data->nbytes_expected) {
                    puts("panic: short read");
                } else {
                    puts("panic: read returned more data than expected (wtf). Is the file changing while you're reading it??");
                }
                exit(1);
            }

            assert(data->offset_of_block_in_file % g_blocksize == 0);

            // If we reach this point, then it means that there were no errors: the read syscall returned exactly what we expected.
            // Since the read syscall returned, this means it has finished filling in the cell with ONE block of data from the file.
            // This means that the cell can now be read by the consumer, so we need to update the cell state.
            g_ringbuf[data->ringbuf_index].state = AVAILABLE_FOR_CONSUMPTION;
            --num_blocks_left_to_receive; // We received ONE block of data from the file
            io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed, so that its slot can get reused
            --num_unfinished_requests;
            if (g_process_in_inner_loop)
                resume_consumer();
        }
        /* Step 3: Run consumer. This might be thought of as handing over control to the consumer "thread". See my article/blog post. */
        resume_consumer();
    }
    resume_consumer();

    close(g_filedescriptor);
    io_uring_queue_exit(ring);

    // Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
    uint8_t output[BLAKE3_OUT_LEN];
    uint8_t output2[BLAKE3_OUT_LEN];
    blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);
    blake3_hasher_finalize(&g_hasher2, output2, BLAKE3_OUT_LEN);

    // Print the hash as hexadecimal.
    printf("BLAKE3 hash: ");
    for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
        printf("%02x", output[i]);
    }
    printf("\n");
}

static void process_cmd_line_args(int argc, char* argv[])
{
    if (argc != 9) {
        printf("%s: infile g_blocksize g_queuedepth g_use_o_direct g_process_in_inner_loop g_numbufs g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
        exit(1);
    }
    g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
    g_queuedepth = atoi(argv[3]);
    if (g_queuedepth > 32768)
        puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
    g_use_o_direct = atoi(argv[4]);
    g_process_in_inner_loop = atoi(argv[5]);
    g_numbufs = atoi(argv[6]);
    g_use_iosqe_io_drain = atoi(argv[7]);
    g_use_iosqe_io_link = atoi(argv[8]);
}

static void open_and_get_size_of_file(const char* filename)
{
    if (g_use_o_direct){
        g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
        puts("opening file with O_DIRECT");
    } else {
        g_filedescriptor = open(filename, O_RDONLY);
        puts("opening file without O_DIRECT");
    }
    if (g_filedescriptor < 0) {
        perror("open file");
        exit(1);
    }
    if (get_file_size(g_filedescriptor, &g_filesize)){
        puts("Failed getting file size");
        exit(1);
    }
}

static void calculate_numblocks_and_size_of_last_block()
{
    // this is the mathematically correct way to do ceiling division
    // (assumes no integer overflow)
    g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
    // calculate the size of the last block of the file
    if (g_filesize % g_blocksize == 0)
        g_size_of_last_block = g_blocksize;
    else
        g_size_of_last_block = g_filesize % g_blocksize;
}

static void allocate_ringbuf()
{
    // We only make 2 memory allocations in this entire program and they both happen in this function.
    assert(g_blocksize % ALIGNMENT == 0);
    // First, we allocate the entire underlying ring buffer (which is a contiguous block of memory
    // containing all the actual buffers) in a single allocation.
    // This is one big piece of memory which is used to hold the actual data from the file.
    // The buf_addr field in the my_custom_data struct points to a buffer within this ring buffer.
    unsigned char* ptr_to_underlying_ring_buffer;
    // We need aligned memory, because O_DIRECT requires it.
    if (posix_memalign((void**)&ptr_to_underlying_ring_buffer, ALIGNMENT, g_blocksize * g_numbufs)) {
        puts("posix_memalign failed!");
        exit(1);
    }
    // Second, we allocate an array containing all of the my_custom_data structs.
    // This is not an array of pointers, but an array holding all of the actual structs.
    // All the items are the same size, which makes this easy
    g_ringbuf = (struct my_custom_data*) malloc(g_numbufs * sizeof(struct my_custom_data));

    // We partially initialize all of the my_custom_data structs here.
    // (The other fields are initialized in the add_read_request_to_submission_queue function)
    off_t cur_offset = 0;
    for (int i = 0; i < g_numbufs; ++i) {
        g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer + cur_offset; // This will never change during the runtime of this program.
        cur_offset += g_blocksize;
        // g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
        g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to ALREADY_CONSUMED at the start. See my article/blog post for explanation.
        g_ringbuf[i].ringbuf_index = i; // This will never change during the runtime of this program.
    }
}

int main(int argc, char *argv[])
{
    assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
    process_cmd_line_args(argc, argv); // we need to first parse the command line arguments
    open_and_get_size_of_file(argv[1]);
    calculate_numblocks_and_size_of_last_block();
    allocate_ringbuf();

    // Initialize the hasher.
    blake3_hasher_init(&g_hasher);
    blake3_hasher_init(&g_hasher2);

    // Run the main loop
    producer_thread();

    puts("Number of requests in-flight:");
    for (int i = 0; i < consume_idx; ++i){
        printf("%d, ", in_flight_count[i]);
    }
    printf("\n");
    printf("Num records: %d\n", consume_idx);
}

其他说明

对于我系统上的**任何可能的哈希程序**,理论速度极限是多少?要计算这一点,我们只需要知道两件事:第一,从文件系统中读取文件的速度有多快;第二,哈希的计算速度有多快。

在我用于测试 b3sum 程序的系统中,根据 **fio** 的测试,读取 1GiB 文件(1073741824 字节)的最小时间为 0.302 秒。因此,从文件系统中读取文件的最快速度约为 3.311GiB/s 或 3.555GB/s。

要衡量 BLAKE3 哈希的计算速度,只需运行示例程序两次而不刷新页面缓存——第二次运行时长大大缩短反映了哈希速度,因为文件已经在内存中缓存。在我的系统上,已缓存的 1GiB 文件可以在 0.252 秒内完成哈希,这大约是 3.968 GiB/s 或 4.26 GB/s。

因此,我们知道在我当前的系统上,**哈希速度大于文件读取速度**。这意味着什么?考虑一个简单的生产者-消费者实现,其中生产者线程一次从文件系统中读取一个块,并将块交给消费者线程,消费者线程然后对每个传入的块进行哈希处理。由于消费者线程处理块的速度快于生产者线程生成块的速度,这意味着程序受到生产者线程的瓶颈——**程序在生产者完成读取所有块之前无法完成**。

因此,在此系统上,任何 b3sum 程序的理论速度限制是 3.311GiB/s 或 3.555GB/s——即文件可以从文件系统中读取的速度。简单来说,这意味着在我当前的系统上,**一个 1GiB 的文件不可能在不到 0.302 秒内完成哈希。**

这是我在当前系统(即本文顶部基准测试进行的系统)上使用 128k 块大小运行 fio 的结果

$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=100% --io_size=1g --blocksize=128k --ioengine=io_uring --fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting  
TEST: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=io_uring, iodepth=4  
fio-3.35  
Starting 1 process  

TEST: (groupid=0, jobs=1): err= 0: pid=3797: Mon Jul 17 19:31:29 2023  
read: IOPS=27.0k, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)  
slat (usec): min=2, max=118, avg= 8.32, stdev= 2.03  
clat (usec): min=69, max=604, avg=138.31, stdev=33.95  
lat (usec): min=78, max=722, avg=146.63, stdev=34.28  
clat percentiles (usec):  
| 1.00th=[ 122], 5.00th=[ 128], 10.00th=[ 130], 20.00th=[ 130],  
| 30.00th=[ 131], 40.00th=[ 131], 50.00th=[ 131], 60.00th=[ 131],  
| 70.00th=[ 133], 80.00th=[ 133], 90.00th=[ 137], 95.00th=[ 204],  
| 99.00th=[ 334], 99.50th=[ 379], 99.90th=[ 449], 99.95th=[ 482],  
| 99.99th=[ 603]  
lat (usec) : 100=0.02%, 250=98.27%, 500=1.68%, 750=0.02%  
cpu : usr=4.64%, sys=33.77%, ctx=8150, majf=0, minf=141  
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%  
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%  
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%  
issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0  
latency : target=0, window=0, percentile=100.00%, depth=4  

Run status group 0 (all jobs):  
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec  

fio 的结果取决于块大小和 I/O 深度。我发现使用 512KiB 的块大小在我的当前系统上产生了最佳结果。302 毫秒是我从 fio 看到的最低值。

$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4444: Tue Jul 25 17:39:46 2023
  read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
    slat (usec): min=6, max=259, avg=17.83, stdev= 7.24
    clat (usec): min=162, max=648, avg=276.04, stdev=49.54
     lat (usec): min=175, max=770, avg=293.87, stdev=50.74
    clat percentiles (usec):
     |  1.00th=[  251],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  262], 40.00th=[  262], 50.00th=[  262], 60.00th=[  262],
     | 70.00th=[  262], 80.00th=[  265], 90.00th=[  330], 95.00th=[  367],
     | 99.00th=[  529], 99.50th=[  594], 99.90th=[  619], 99.95th=[  644],
     | 99.99th=[  652]
  lat (usec)   : 250=0.68%, 500=97.71%, 750=1.61%
  cpu          : usr=3.64%, sys=12.91%, ctx=2047, majf=0, minf=265
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec

Disk stats (read/write):
  nvme0n1: ios=1977/0, merge=0/0, ticks=481/0, in_queue=481, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4463: Tue Jul 25 17:39:47 2023
  read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
    slat (usec): min=5, max=258, avg=13.00, stdev= 7.63
    clat (usec): min=214, max=657, avg=281.06, stdev=49.22
     lat (usec): min=234, max=767, avg=294.06, stdev=50.43
    clat percentiles (usec):
     |  1.00th=[  258],  5.00th=[  262], 10.00th=[  265], 20.00th=[  265],
     | 30.00th=[  265], 40.00th=[  265], 50.00th=[  269], 60.00th=[  269],
     | 70.00th=[  269], 80.00th=[  269], 90.00th=[  334], 95.00th=[  375],
     | 99.00th=[  537], 99.50th=[  603], 99.90th=[  619], 99.95th=[  652],
     | 99.99th=[  660]
  lat (usec)   : 250=0.49%, 500=97.71%, 750=1.81%
  cpu          : usr=0.00%, sys=12.29%, ctx=2048, majf=0, minf=268
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3391MiB/s (3555MB/s), 3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec

Disk stats (read/write):
  nvme0n1: ios=1979/0, merge=0/0, ticks=496/0, in_queue=496, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4482: Tue Jul 25 17:39:47 2023
  read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
    slat (usec): min=6, max=262, avg=18.51, stdev= 7.30
    clat (usec): min=223, max=646, avg=275.47, stdev=48.93
     lat (usec): min=243, max=773, avg=293.98, stdev=50.54
    clat percentiles (usec):
     |  1.00th=[  251],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  262], 40.00th=[  262], 50.00th=[  262], 60.00th=[  262],
     | 70.00th=[  262], 80.00th=[  262], 90.00th=[  326], 95.00th=[  367],
     | 99.00th=[  529], 99.50th=[  586], 99.90th=[  611], 99.95th=[  644],
     | 99.99th=[  644]
  lat (usec)   : 250=0.83%, 500=97.56%, 750=1.61%
  cpu          : usr=0.00%, sys=16.89%, ctx=2048, majf=0, minf=266
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec

Disk stats (read/write):
  nvme0n1: ios=1977/0, merge=0/0, ticks=480/0, in_queue=480, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4501: Tue Jul 25 17:39:48 2023
  read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
    slat (usec): min=6, max=260, avg=18.08, stdev= 7.25
    clat (usec): min=230, max=649, avg=275.81, stdev=49.25
     lat (usec): min=251, max=769, avg=293.89, stdev=50.52
    clat percentiles (usec):
     |  1.00th=[  251],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  262], 40.00th=[  262], 50.00th=[  262], 60.00th=[  262],
     | 70.00th=[  262], 80.00th=[  265], 90.00th=[  330], 95.00th=[  367],
     | 99.00th=[  529], 99.50th=[  594], 99.90th=[  619], 99.95th=[  644],
     | 99.99th=[  652]
  lat (usec)   : 250=0.39%, 500=98.00%, 750=1.61%
  cpu          : usr=1.32%, sys=15.23%, ctx=2049, majf=0, minf=266
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec

Disk stats (read/write):
  nvme0n1: ios=1978/0, merge=0/0, ticks=481/0, in_queue=482, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4520: Tue Jul 25 17:39:49 2023
  read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
    slat (usec): min=10, max=297, avg=31.39, stdev= 7.77
    clat (usec): min=177, max=639, avg=262.32, stdev=49.90
     lat (usec): min=229, max=800, avg=293.72, stdev=51.18
    clat percentiles (usec):
     |  1.00th=[  237],  5.00th=[  243], 10.00th=[  245], 20.00th=[  247],
     | 30.00th=[  247], 40.00th=[  247], 50.00th=[  247], 60.00th=[  247],
     | 70.00th=[  247], 80.00th=[  249], 90.00th=[  314], 95.00th=[  355],
     | 99.00th=[  523], 99.50th=[  586], 99.90th=[  603], 99.95th=[  627],
     | 99.99th=[  644]
  lat (usec)   : 250=82.28%, 500=16.11%, 750=1.61%
  cpu          : usr=0.00%, sys=25.83%, ctx=2047, majf=0, minf=266
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec

Disk stats (read/write):
  nvme0n1: ios=1977/0, merge=0/0, ticks=477/0, in_queue=477, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4539: Tue Jul 25 17:39:50 2023
  read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
    slat (usec): min=6, max=263, avg=18.32, stdev= 7.38
    clat (usec): min=163, max=689, avg=275.59, stdev=50.07
     lat (usec): min=177, max=952, avg=293.91, stdev=51.72
    clat percentiles (usec):
     |  1.00th=[  251],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  260], 40.00th=[  262], 50.00th=[  262], 60.00th=[  262],
     | 70.00th=[  262], 80.00th=[  265], 90.00th=[  326], 95.00th=[  367],
     | 99.00th=[  545], 99.50th=[  594], 99.90th=[  644], 99.95th=[  644],
     | 99.99th=[  693]
  lat (usec)   : 250=0.63%, 500=97.75%, 750=1.61%
  cpu          : usr=0.00%, sys=16.89%, ctx=2047, majf=0, minf=267
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec

Disk stats (read/write):
  nvme0n1: ios=1978/0, merge=0/0, ticks=482/0, in_queue=482, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4558: Tue Jul 25 17:39:51 2023
  read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
    slat (usec): min=6, max=260, avg=17.98, stdev= 7.13
    clat (usec): min=180, max=745, avg=275.94, stdev=50.17
     lat (usec): min=194, max=1006, avg=293.92, stdev=52.07
    clat percentiles (usec):
     |  1.00th=[  251],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  262], 40.00th=[  262], 50.00th=[  262], 60.00th=[  262],
     | 70.00th=[  262], 80.00th=[  265], 90.00th=[  326], 95.00th=[  367],
     | 99.00th=[  545], 99.50th=[  594], 99.90th=[  644], 99.95th=[  660],
     | 99.99th=[  742]
  lat (usec)   : 250=0.63%, 500=97.61%, 750=1.76%
  cpu          : usr=2.66%, sys=13.62%, ctx=2048, majf=0, minf=268
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3391MiB/s (3555MB/s), 3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec

Disk stats (read/write):
  nvme0n1: ios=1978/0, merge=0/0, ticks=480/0, in_queue=481, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process

TEST: (groupid=0, jobs=1): err= 0: pid=4577: Tue Jul 25 17:39:52 2023
  read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
    slat (usec): min=6, max=262, avg=19.10, stdev= 7.35
    clat (usec): min=218, max=654, avg=274.66, stdev=49.05
     lat (usec): min=251, max=807, avg=293.75, stdev=50.54
    clat percentiles (usec):
     |  1.00th=[  249],  5.00th=[  258], 10.00th=[  260], 20.00th=[  260],
     | 30.00th=[  260], 40.00th=[  260], 50.00th=[  260], 60.00th=[  260],
     | 70.00th=[  262], 80.00th=[  262], 90.00th=[  326], 95.00th=[  367],
     | 99.00th=[  537], 99.50th=[  586], 99.90th=[  611], 99.95th=[  644],
     | 99.99th=[  652]
  lat (usec)   : 250=1.03%, 500=97.41%, 750=1.56%
  cpu          : usr=0.00%, sys=17.61%, ctx=2048, majf=0, minf=268
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=3391MiB/s (3555MB/s), 3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec

Disk stats (read/write):
  nvme0n1: ios=1979/0, merge=0/0, ticks=480/0, in_queue=481, util=61.04%

以下是将 I/O 大小设置为 10GiB 的结果,不知何故这使得速度变慢了。

$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3382MiB/s][r=3382 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3014: Tue Jul 25 00:02:27 2023
 read: IOPS=3370, BW=3371MiB/s (3534MB/s)(10.0GiB/3038msec)
 slat (usec): min=6, max=1227, avg=19.20, stdev=15.52
 clat (usec): min=578, max=22014, avg=9451.40, stdev=1298.66
 lat (usec): min=601, max=22032, avg=9470.60, stdev=1296.89
 clat percentiles (usec):
 |  1.00th=[ 5997],  5.00th=[ 9110], 10.00th=[ 9110], 20.00th=[ 9110],
 | 30.00th=[ 9110], 40.00th=[ 9241], 50.00th=[ 9241], 60.00th=[ 9503],
 | 70.00th=[ 9634], 80.00th=[ 9634], 90.00th=[ 9634], 95.00th=[ 9896],
 | 99.00th=[16319], 99.50th=[19268], 99.90th=[21365], 99.95th=[21627],
 | 99.99th=[21890]
 bw (  MiB/s): min= 3360, max= 3406, per=100.00%, avg=3378.00, stdev=16.15, samples=6
 iops        : min= 3360, max= 3406, avg=3378.00, stdev=16.15, samples=6
 lat (usec)   : 750=0.04%, 1000=0.04%
 lat (msec)   : 2=0.16%, 4=0.39%, 10=95.45%, 20=3.54%, 50=0.39%
 cpu          : usr=0.59%, sys=8.23%, ctx=10225, majf=0, minf=30
 IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
 issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
 READ: bw=3371MiB/s (3534MB/s), 3371MiB/s-3371MiB/s (3534MB/s-3534MB/s), io=10.0GiB (10.7GB), run=3038-3038msec

Disk stats (read/write):
 nvme0n1: ios=39088/0, merge=30/0, ticks=361238/0, in_queue=361238, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3386MiB/s][r=3386 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3033: Tue Jul 25 00:02:39 2023
 read: IOPS=3375, BW=3375MiB/s (3539MB/s)(10.0GiB/3034msec)
 slat (usec): min=6, max=582, avg=30.97, stdev=13.43
 clat (usec): min=514, max=2248, avg=1152.31, stdev=134.69
 lat (usec): min=570, max=2265, avg=1183.28, stdev=134.35
 clat percentiles (usec):
 |  1.00th=[ 1074],  5.00th=[ 1074], 10.00th=[ 1074], 20.00th=[ 1090],
 | 30.00th=[ 1090], 40.00th=[ 1090], 50.00th=[ 1106], 60.00th=[ 1106],
 | 70.00th=[ 1156], 80.00th=[ 1188], 90.00th=[ 1221], 95.00th=[ 1467],
 | 99.00th=[ 1778], 99.50th=[ 1909], 99.90th=[ 2008], 99.95th=[ 2057],
 | 99.99th=[ 2245]
 bw (  MiB/s): min= 3364, max= 3412, per=100.00%, avg=3381.33, stdev=17.00, samples=6
 iops        : min= 3364, max= 3412, avg=3381.33, stdev=17.00, samples=6
 lat (usec)   : 750=0.05%, 1000=0.08%
 lat (msec)   : 2=99.74%, 4=0.14%
 cpu          : usr=0.76%, sys=12.30%, ctx=10240, majf=0, minf=523
 IO depths    : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
 READ: bw=3375MiB/s (3539MB/s), 3375MiB/s-3375MiB/s (3539MB/s-3539MB/s), io=10.0GiB (10.7GB), run=3034-3034msec

Disk stats (read/write):
 nvme0n1: ios=39137/0, merge=0/0, ticks=40995/0, in_queue=40995, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=128k --ioengine=io_uring --fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3383MiB/s][r=27.1k IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3071: Tue Jul 25 00:02:57 2023
 read: IOPS=27.0k, BW=3373MiB/s (3537MB/s)(10.0GiB/3036msec)
 slat (usec): min=3, max=120, avg=10.17, stdev= 1.31
 clat (usec): min=81, max=2362, avg=137.30, stdev=37.04
 lat (usec): min=100, max=2372, avg=147.47, stdev=37.04
 clat percentiles (usec):
 |  1.00th=[  120],  5.00th=[  126], 10.00th=[  128], 20.00th=[  129],
 | 30.00th=[  129], 40.00th=[  129], 50.00th=[  129], 60.00th=[  130],
 | 70.00th=[  130], 80.00th=[  130], 90.00th=[  137], 95.00th=[  204],
 | 99.00th=[  306], 99.50th=[  359], 99.90th=[  445], 99.95th=[  478],
 | 99.99th=[  685]
 bw (  MiB/s): min= 3362, max= 3407, per=100.00%, avg=3378.79, stdev=17.59, samples=6
 iops        : min=26896, max=27258, avg=27030.33, stdev=140.69, samples=6
 lat (usec)   : 100=0.01%, 250=98.11%, 500=1.86%, 750=0.02%
 lat (msec)   : 4=0.01%
 cpu          : usr=5.77%, sys=36.54%, ctx=81525, majf=0, minf=138
 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=81920,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
 READ: bw=3373MiB/s (3537MB/s), 3373MiB/s-3373MiB/s (3537MB/s-3537MB/s), io=10.0GiB (10.7GB), run=3036-3036msec

Disk stats (read/write):
 nvme0n1: ios=78208/2, merge=0/1, ticks=10769/0, in_queue=10770, util=97.03%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=256k --ioengine=io_uring --fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3387MiB/s][r=13.5k IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3090: Tue Jul 25 00:03:06 2023
 read: IOPS=13.5k, BW=3377MiB/s (3541MB/s)(10.0GiB/3032msec)
 slat (usec): min=4, max=181, avg=14.10, stdev= 2.00
 clat (usec): min=186, max=908, avg=281.24, stdev=49.54
 lat (usec): min=221, max=923, avg=295.33, stdev=49.53
 clat percentiles (usec):
 |  1.00th=[  255],  5.00th=[  262], 10.00th=[  265], 20.00th=[  265],
 | 30.00th=[  265], 40.00th=[  265], 50.00th=[  265], 60.00th=[  265],
 | 70.00th=[  265], 80.00th=[  269], 90.00th=[  343], 95.00th=[  375],
 | 99.00th=[  506], 99.50th=[  570], 99.90th=[  644], 99.95th=[  717],
 | 99.99th=[  906]
 bw (  MiB/s): min= 3365, max= 3410, per=100.00%, avg=3383.04, stdev=15.92, samples=6
 iops        : min=13460, max=13642, avg=13532.00, stdev=63.66, samples=6
 lat (usec)   : 250=0.35%, 500=98.51%, 750=1.09%, 1000=0.04%
 cpu          : usr=1.29%, sys=25.87%, ctx=40940, majf=0, minf=267
 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=40960,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
 READ: bw=3377MiB/s (3541MB/s), 3377MiB/s-3377MiB/s (3541MB/s-3541MB/s), io=10.0GiB (10.7GB), run=3032-3032msec

Disk stats (read/write):
 nvme0n1: ios=39157/2, merge=0/1, ticks=11087/0, in_queue=11088, util=96.90%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3384MiB/s][r=6768 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3109: Tue Jul 25 00:03:14 2023
 read: IOPS=6743, BW=3372MiB/s (3536MB/s)(10.0GiB/3037msec)
 slat (usec): min=4, max=401, avg=11.26, stdev= 2.87
 clat (usec): min=199, max=1699, avg=580.70, stdev=84.16
 lat (usec): min=250, max=2043, avg=591.96, stdev=84.45
 clat percentiles (usec):
 |  1.00th=[  537],  5.00th=[  545], 10.00th=[  545], 20.00th=[  545],
 | 30.00th=[  545], 40.00th=[  545], 50.00th=[  545], 60.00th=[  545],
 | 70.00th=[  545], 80.00th=[  611], 90.00th=[  660], 95.00th=[  734],
 | 99.00th=[  963], 99.50th=[  988], 99.90th=[ 1156], 99.95th=[ 1287],
 | 99.99th=[ 1680]
 bw (  MiB/s): min= 3362, max= 3404, per=100.00%, avg=3377.33, stdev=15.27, samples=6
 iops        : min= 6724, max= 6808, avg=6754.67, stdev=30.53, samples=6
 lat (usec)   : 250=0.01%, 500=0.03%, 750=95.28%, 1000=4.30%
 lat (msec)   : 2=0.38%
 cpu          : usr=1.02%, sys=11.20%, ctx=20478, majf=0, minf=14
 IO depths    : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
 READ: bw=3372MiB/s (3536MB/s), 3372MiB/s-3372MiB/s (3536MB/s-3536MB/s), io=10.0GiB (10.7GB), run=3037-3037msec

Disk stats (read/write):
 nvme0n1: ios=39091/2, merge=0/1, ticks=21153/3, in_queue=21158, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3388MiB/s][r=6776 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3128: Tue Jul 25 00:03:24 2023
 read: IOPS=6756, BW=3378MiB/s (3543MB/s)(10.0GiB/3031msec)
 slat (usec): min=9, max=311, avg=29.94, stdev= 3.30
 clat (usec): min=198, max=892, avg=264.98, stdev=50.38
 lat (usec): min=234, max=922, avg=294.93, stdev=50.66
 clat percentiles (usec):
 |  1.00th=[  239],  5.00th=[  245], 10.00th=[  247], 20.00th=[  247],
 | 30.00th=[  249], 40.00th=[  249], 50.00th=[  249], 60.00th=[  251],
 | 70.00th=[  251], 80.00th=[  251], 90.00th=[  322], 95.00th=[  359],
 | 99.00th=[  510], 99.50th=[  570], 99.90th=[  652], 99.95th=[  742],
 | 99.99th=[  889]
 bw (  MiB/s): min= 3367, max= 3413, per=100.00%, avg=3384.83, stdev=15.96, samples=6
 iops        : min= 6734, max= 6826, avg=6769.67, stdev=31.91, samples=6
 lat (usec)   : 250=61.26%, 500=37.62%, 750=1.06%, 1000=0.05%
 cpu          : usr=0.76%, sys=24.09%, ctx=20480, majf=0, minf=267
 IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
 READ: bw=3378MiB/s (3543MB/s), 3378MiB/s-3378MiB/s (3543MB/s-3543MB/s), io=10.0GiB (10.7GB), run=3031-3031msec

Disk stats (read/write):
 nvme0n1: ios=39179/0, merge=0/0, ticks=9334/0, in_queue=9334, util=96.90%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=0 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1795MiB/s][r=3590 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3147: Tue Jul 25 00:03:34 2023
 read: IOPS=3685, BW=1843MiB/s (1932MB/s)(10.0GiB/5557msec)
 slat (usec): min=28, max=499, avg=74.93, stdev=74.14
 clat (usec): min=30, max=4387, avg=434.37, stdev=184.91
 lat (usec): min=229, max=4499, avg=509.29, stdev=147.50
 clat percentiles (usec):
 |  1.00th=[  137],  5.00th=[  200], 10.00th=[  204], 20.00th=[  210],
 | 30.00th=[  306], 40.00th=[  314], 50.00th=[  461], 60.00th=[  545],
 | 70.00th=[  570], 80.00th=[  635], 90.00th=[  660], 95.00th=[  660],
 | 99.00th=[  725], 99.50th=[  824], 99.90th=[  848], 99.95th=[  848],
 | 99.99th=[  906]
 bw (  MiB/s): min= 1643, max= 1962, per=99.97%, avg=1842.19, stdev=143.95, samples=11
 iops        : min= 3286, max= 3924, avg=3684.36, stdev=287.89, samples=11
 lat (usec)   : 50=0.17%, 100=0.01%, 250=23.12%, 500=27.00%, 750=48.97%
 lat (usec)   : 1000=0.72%
 lat (msec)   : 10=0.01%
 cpu          : usr=0.29%, sys=80.89%, ctx=26899, majf=0, minf=266
 IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
 READ: bw=1843MiB/s (1932MB/s), 1843MiB/s-1843MiB/s (1932MB/s-1932MB/s), io=10.0GiB (10.7GB), run=5557-5557msec

Disk stats (read/write):
 nvme0n1: ios=39591/3, merge=0/1, ticks=7997/1, in_queue=7998, util=92.45%

我尝试了其他块大小和队列深度。我尝试了块大小 64k、128k、256k、512k、1024k,以及队列深度 2、4、8、16、32、64,但从未快于 302 毫秒。对于读取 1Gib 文件,似乎 512k 块大小和队列深度为 2 是最快的。我还尝试将 --nonvectored 设置为 0 然后设置为 1,但没有区别。运行时间相同。这表明 read 和 readv 在性能方面是相同的……

nonvectored=int : [io_uring] [io_uring_cmd]  
 
With this option, fio will use non-vectored read/write commands, where address must contain the address directly. Default is -1.  

我还尝试了使用和不使用 --fixedbufs,也没有区别。
我还尝试了 --registerfiles--sqthread_poll,它们也没有任何区别。尝试使用 --hipri 创建了一个无法终止的、永远存在的僵尸进程,该进程使用了 100% 的 CPU——我非常确定这是因为 IORING_SETUP_IOPOLL,因为当我将 IORING_SETUP_IOPOLL 启用在我的程序中时,发生了同样的情况。

有趣的是,我的 liburing_b3sum_singlethread 程序能够在仅 2.896 秒内处理一个 10GiB 的文件,尽管 `cat` 到 /dev/null 的时间比这还长。

$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 8 1 0 20 0 0
opening file with O_DIRECT
BLAKE3 hash: 65f593d91b9301369724bb8e0f4da385baf46c73a5fb0ac996a5d5d714028ff6

real	0m2.896s
user	0m2.187s
sys	0m0.150s

$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null

real	0m2.900s
user	0m0.013s
sys	0m1.844s

请注意,这些数字与本文顶部显示的数字不同。我认为我的系统出于我未知的原因,在性能上有所波动,有时快,有时慢,但在进行比较时,我总是在同一时间段内运行程序,以确保这种波动不会影响排名。

LVM 加密的影响

我实际上大部分的开发工作都是在我的前一个系统上完成的,该系统使用了 LVM 加密。后来,我重新安装了非加密的 ext4 Debian,这就是我现在的系统。所以硬件和操作系统完全相同,唯一显著的区别是我旧系统使用了 LVM 加密,而我现在的系统没有。

这是我旧系统(使用 LVM 加密)上 `cat` 和 `dd` 到 /dev/null 的性能

$ echo 1 > /proc/sys/vm/drop_caches
$ time cat 1GB.txt > /dev/null

real	0m0.540s
user	0m0.000s
sys	0m0.182s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=8K
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.661612 s, 1.6 GB/s

real	0m0.664s
user	0m0.004s
sys	0m0.226s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=16K
65536+0 records in
65536+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.581368 s, 1.8 GB/s

real	0m0.585s
user	0m0.014s
sys	0m0.196s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=32K
32768+0 records in
32768+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.574081 s, 1.9 GB/s

real	0m0.577s
user	0m0.004s
sys	0m0.198s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64K
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.574562 s, 1.9 GB/s

real	0m0.577s
user	0m0.000s
sys	0m0.198s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=128K
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.563443 s, 1.9 GB/s

real	0m0.566s
user	0m0.004s
sys	0m0.195s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=256K
4096+0 records in
4096+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.58437 s, 1.8 GB/s

real	0m0.587s
user	0m0.000s
sys	0m0.190s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=4M
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.590212 s, 1.8 GB/s

real	0m0.593s
user	0m0.000s
sys	0m0.184s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=32M
32+0 records in
32+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.605234 s, 1.8 GB/s

real	0m0.608s
user	0m0.000s
sys	0m0.230s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64M
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.627107 s, 1.7 GB/s

real	0m0.630s
user	0m0.000s
sys	0m0.232s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64K iflag=direct
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.09124 s, 513 MB/s

real	0m2.093s
user	0m0.000s
sys	0m0.200s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=512M iflag=direct
2+0 records in
2+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.648582 s, 1.7 GB/s

real	0m0.652s
user	0m0.000s
sys	0m0.061s

$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=1G iflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.539423 s, 2.0 GB/s

real	0m0.544s
user	0m0.004s
sys	0m0.102s

 

这是我当前机器上 `dd` 的结果

命令 最小值 中位数 最大值
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K | ./example_64K 0.363s 0.3725s 0.381 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K | ./example_2M 0.362s 0.37s 0.384 秒
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K | ./example_64K 0.35s 0.3555s 0.361s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K | ./example_2M 0.346s 0.353s 0.357s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K | ./example_64K 0.333s 0.341 秒 0.349s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K | ./example_2M 0.337s 0.346s 0.358s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example_64K 0.332s 0.3385s 0.342s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example_2M 0.335s 0.337s 0.345s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K | ./example_64K 0.351s 0.3675s 0.371s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K | ./example_2M 0.344s 0.357s 0.505s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K | ./example_64K 0.532s 0.55s 0.556s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K | ./example_2M 0.53s 0.547s 0.723s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example_64K 0.542s 0.545s 0.548s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example_2M 0.542s 0.5445s 0.548s

上面的结果表明 `dd` 的最佳块大小是 64KiB。为了确保这并非因为示例程序一次读取 64KiB,我制作了示例程序的另一个版本,该版本一次读取 2M(很简单,只需将示例源代码中的 65536 更改为 2 * 1024 * 1024)。它们的性能(参见上表)在所有块大小下都相同,这让我认为它们没有任何区别。

这里重要的要点是 `dd` 的最佳块大小是 64KiB。**较大的块大小反而更慢。** 以下是我的当前系统上使用 O_DIRECT 的结果。

命令 最小值 中位数 最大值
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K iflag=direct | ./example_64K 1.334s 1.343s 1.475s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K iflag=direct | ./example_64K 0.924s 0.927s 0.984s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K iflag=direct | ./example_64K 0.687s 0.721s 0.721s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K iflag=direct | ./example_64K 0.555s 0.555s 0.56s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K iflag=direct oflag=direct | ./example_64K 0.561s 0.567s 0.573s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K iflag=direct | ./example_64K 0.56s 0.5745s 0.591s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K iflag=direct | ./example_64K 0.572s 0.573s 0.574s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=512K iflag=direct | ./example_64K 0.546s 0.551s 0.552s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=1M iflag=direct | ./example_64K 0.553s 0.556s 0.557s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct | ./example_64K 0.576s 0.5865s 0.594s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct | ./example_2M 0.582s 0.583s 0.583s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128M iflag=direct | ./example_64K 0.565s 0.579s 0.585s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=1G iflag=direct | ./example_64K 0.557s 0.5605s 0.564s
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct oflag=direct | ./example_2M 0.745s 0.754s 0.755s

有趣的是,在我的旧系统上,64K O_DIRECT 的性能明显慢于当前系统。此外,**使用 O_DIRECT 时最佳块大小明显大于不使用 O_DIRECT 时**。这是一个反复出现的主题——我也发现在非 io_uring 多线程程序以及 io_uring 版本中也是如此。同样有趣的是,使用 `dd` 将数据通过管道传输到我的程序时,bs 为 64KiB 是最佳的,但当使用 `dd` 将数据通过管道传输到 /dev/null 时,更大的 bs 2M 是最佳的。另外值得注意的是,`cat` 能够以最大可能的速度将数据通过管道传输到 /dev/null,尽管它使用了 128KiB 的缓冲区大小(当我使用 `dd` 且 bs=128KiB 时,它无法达到最大速度……)。

另外,以下是我旧系统上 fio 的结果

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3522MiB/s][r=3522 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=39134: Tue Jul 11 10:51:55 2023
  read: IOPS=3505, BW=3506MiB/s (3676MB/s)(10.0GiB/2921msec)

Run status group 0 (all jobs):
   READ: bw=3506MiB/s (3676MB/s), 3506MiB/s-3506MiB/s (3676MB/s-3676MB/s), io=10.0GiB (10.7GB), run=2921-2921msec

以下是我当前系统上获得的结果

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3374MiB/s][r=3374 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=1160: Mon Jul 17 17:27:35 2023
  read: IOPS=3363, BW=3364MiB/s (3527MB/s)(10.0GiB/3044msec)

Run status group 0 (all jobs):
   READ: bw=3364MiB/s (3527MB/s), 3364MiB/s-3364MiB/s (3527MB/s-3527MB/s), io=10.0GiB (10.7GB), run=3044-3044msec

有趣的是,在我使用 LVM 加密的旧系统上,我的程序的 io_uring 版本实际上比 `cat` 到 /dev/null 的速度快了近一倍。是的,我在计时前刷新了页面缓存。但在我当前的非加密系统上,io_uring 版本和 `cat` 到 /dev/null 的速度几乎相同。我的猜测是加密增加了延迟,我通过一次使用多个在途请求(使用 io_uring)来缓解了这种情况,但 `cat` 程序无法做到这一点,因为它使用的是同步读取,所以必须在下一个读取开始之前完成上一个读取。但这只是我的猜测,尚未验证。但这也解释了为什么在我的旧系统上,使用 io_uring 的 fio 比 `cat` 快。

但这并不能解释为什么在我旧系统上的 fio 比我当前系统上的快。对于这一点,我有 2 个猜测

  • 我的 SSD 速度变慢了。我需要 TRIM 吗?
  • 我的软件变差了。加密磁盘是否因为使用了 LVM 而更快?

我认为第一个猜测的可能性非常小。我曾对 SSD 进行过大量测试,并从 fio 获得了稳定的速度读数。第二个猜测似乎更可能,请参见:https://lkml.org/lkml/2010/5/7/223

在我使用磁盘加密时,fio 的更多详细信息

我使用的 fio 版本是 3.33

这是我使用的命令

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=mmap --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting

总之,以下是我在我机器上获得的数据

sync (read): READ: bw=1313MiB/s (1377MB/s), 1313MiB/s-1313MiB/s (1377MB/s-1377MB/s), io=10.0GiB (10.7GB), run=7798-7798msec

psync (pread): READ: bw=1316MiB/s (1380MB/s), 1316MiB/s-1316MiB/s (1380MB/s-1380MB/s), io=10.0GiB (10.7GB), run=7783-7783msec

vsync (readv): READ: bw=1295MiB/s (1357MB/s), 1295MiB/s-1295MiB/s (1357MB/s-1357MB/s), io=2017MiB (2115MB), run=1558-1558msec

pvsync (preadv): READ: bw=1332MiB/s (1397MB/s), 1332MiB/s-1332MiB/s (1397MB/s-1397MB/s), io=10.0GiB (10.7GB), run=7688-7688msec

pvsync2 (preadv2): READ: bw=1304MiB/s (1368MB/s), 1304MiB/s-1304MiB/s (1368MB/s-1368MB/s), io=10.0GiB (10.7GB), run=7851-7851msec

io_uring: READ: bw=3501MiB/s (3671MB/s), 3501MiB/s-3501MiB/s (3671MB/s-3671MB/s), io=10.0GiB (10.7GB), run=2925-2925msec

libaio: READ: bw=3518MiB/s (3689MB/s), 3518MiB/s-3518MiB/s (3689MB/s-3689MB/s), io=10.0GiB (10.7GB), run=2911-2911msec

posixaio (aio_read): READ: bw=3518MiB/s (3689MB/s), 3518MiB/s-3518MiB/s (3689MB/s-3689MB/s), io=10.0GiB (10.7GB), run=2911-2911msec

mmap: READ: bw=664MiB/s (696MB/s), 664MiB/s-664MiB/s (696MB/s-696MB/s), io=10.0GiB (10.7GB), run=15431-15431msec

splice (splice and vmsplice): READ: bw=879MiB/s (921MB/s), 879MiB/s-879MiB/s (921MB/s-921MB/s), io=10.0GiB (10.7GB), run=11653-11653msec

请注意,3.5GB/s 是我的 NVME 驱动器根据制造商广告和第三方测试(我在网上找到的)验证的读取速度限制,因此我对 io_uring、libaio 和 posix aio 给出的结果达到了存储设备的物理限制非常有信心。

进一步说明

根据我之前在线阅读的内容,我曾认为 O_DIRECT 应该比不使用 O_DIRECT 快,因为它更“直接”,即绕过了页面缓存。我通过运行 fio 并使用和不使用 O_DIRECT 找到了一些支持证据。在我旧的系统中,当我运行

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting

它给出了 3.5GB/s

但当我运行

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=0 --numjobs=1 --runtime=60 --group_reporting

它只给出了 2.3GB/s

所以 O_DIRECT 在我的旧系统中使 fio 速度更快。

在我当前的系统中,我得到了类似的结果

$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process

Jobs: 1 (f=1): [R(1)][-.-%][r=3364MiB/s][r=3364 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3178: Tue Jul 25 00:15:53 2023
 read: IOPS=3277, BW=3278MiB/s (3437MB/s)(10.0GiB/3124msec)
 slat (usec): min=6, max=315, avg=18.13, stdev= 8.69
 clat (usec): min=570, max=22048, avg=9463.51, stdev=1283.30
 lat (usec): min=601, max=22075, avg=9481.64, stdev=1282.47
 clat percentiles (usec):
 |  1.00th=[ 6783],  5.00th=[ 9110], 10.00th=[ 9110], 20.00th=[ 9110],
 | 30.00th=[ 9110], 40.00th=[ 9241], 50.00th=[ 9241], 60.00th=[ 9503],
 | 70.00th=[ 9634], 80.00th=[ 9634], 90.00th=[ 9634], 95.00th=[ 9896],
 | 99.00th=[16319], 99.50th=[19268], 99.90th=[21627], 99.95th=[21890],
 | 99.99th=[22152]
 bw (  MiB/s): min= 2786, max= 3412, per=99.98%, avg=3277.33, stdev=241.55, samples=6
 iops        : min= 2786, max= 3412, avg=3277.33, stdev=241.55, samples=6
 lat (usec)   : 750=0.04%, 1000=0.04%
 lat (msec)   : 2=0.16%, 4=0.30%, 10=95.47%, 20=3.60%, 50=0.39%
 cpu          : usr=0.42%, sys=10.53%, ctx=10238, majf=0, minf=536
 IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
 issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
 READ: bw=3278MiB/s (3437MB/s), 3278MiB/s-3278MiB/s (3437MB/s-3437MB/s), io=10.0GiB (10.7GB), run=3124-3124msec

Disk stats (read/write):
 nvme0n1: ios=37911/0, merge=0/0, ticks=350894/0, in_queue=350893, util=93.96%

$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32 --direct=0 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][80.0%][r=2048MiB/s][r=2048 IOPS][eta 00m:01s]
TEST: (groupid=0, jobs=1): err= 0: pid=3198: Tue Jul 25 00:16:40 2023
 read: IOPS=2104, BW=2104MiB/s (2207MB/s)(10.0GiB/4866msec)
 slat (usec): min=28, max=5133, avg=134.55, stdev=233.41
 clat (usec): min=4680, max=31030, avg=13921.79, stdev=1822.86
 lat (usec): min=5963, max=31226, avg=14056.34, stdev=1782.46
 clat percentiles (usec):
 |  1.00th=[ 7767],  5.00th=[11469], 10.00th=[13042], 20.00th=[13435],
 | 30.00th=[13698], 40.00th=[13829], 50.00th=[13960], 60.00th=[14091],
 | 70.00th=[14222], 80.00th=[14484], 90.00th=[14746], 95.00th=[15008],
 | 99.00th=[21890], 99.50th=[24773], 99.90th=[29230], 99.95th=[29492],
 | 99.99th=[30016]
 bw (  MiB/s): min= 1820, max= 2328, per=99.54%, avg=2094.67, stdev=246.01, samples=9
 iops        : min= 1820, max= 2328, avg=2094.67, stdev=246.01, samples=9
 lat (msec)   : 10=2.64%, 20=95.90%, 50=1.46%
 cpu          : usr=0.06%, sys=91.00%, ctx=17435, majf=0, minf=536
 IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
 issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
 READ: bw=2104MiB/s (2207MB/s), 2104MiB/s-2104MiB/s (2207MB/s-2207MB/s), io=10.0GiB (10.7GB), run=4866-4866msec

Disk stats (read/write):
 nvme0n1: ios=39042/0, merge=0/0, ticks=165950/0, in_queue=165950, util=91.21%

因此,我认为 O_DIRECT 可能是达到最大性能的必要条件,但它对块大小非常敏感。我记得当我刚开始使用 O_DIRECT 时,我为 O_DIRECT 和非 O_DIRECT 使用相同的块大小(64KiB),我发现 O_DIRECT 使我的程序变慢了。由于 O_DIRECT 的性能受块大小的严重影响,选择不好的块大小会导致 O_DIRECT 的性能非常差,但我发现只有使用 O_DIRECT 才能获得最佳性能——您可以自己验证这一点,因为我将是否使用 O_DIRECT 作为我的程序中的一个命令行参数,这样您就可以轻松比较使用和不使用它在不同块大小下的性能。

起初,我使用的是 64KiB 的块大小,并且我没有预料到块大小会有如此大的影响,所以在我开始更改块大小之前,我对 O_DIRECT 的性能相当失望,因为它比非 O_DIRECT 的性能差得多。总之,所以我决定看看块大小 (BS) 和队列深度 (QD) 是否影响性能,并发现最佳的 BS 和 QD 组合是非 O_DIRECT 的 BS=256, QD=8,而对于 O_DIRECT,最佳组合是 BS=4096, QD=8。

表格

缓冲区大小(KB) 队列深度 O_DIRECT 调用循环中的进程 运行时间(秒)
1 1 0 0 7.134
1 1 1 0 29.881
1 4000 0 0 3.647
1 4000 1 0 29.624
1 4000 0 1 3.619
1 4000 1 1 22.177
4 1 0 1 2.732
4 1 1 1 9.972
4 64 0 1 1.665
4 64 1 1 4.423
4 4000 0 0 1.591
4 4000 1 0 6.452
4 4000 0 1 1.572
4 4000 1 1 6.468
16 1 0 1 1.16
16 1 1 1 4.88
16 64 0 1 1.051
16 64 1 1 1.816
64 1 0 1 0.829
64 1 1 1 3.122
64 2 0 0 1.144
64 2 1 0 1.464
64 4 0 0 0.542
64 4 1 0 0.489
64 8 0 0 0.422
64 8 1 0 0.371 - 0.602
64 8 0 1 0.611
64 8 1 1 0.795
64 16 0 0 0.429
64 16 1 0 0.358 - 0.645
64 16 0 1 0.613
64 16 1 1 0.793
64 32 0 0 0.479
64 32 1 0 0.374 - 0.533
64 64 0 1 0.645
64 64 1 1 0.783
64 4000 0 0 0.388
64 4000 1 0 0.837
64 4000 0 1 0.37
64 4000 1 1 0.877
128 4 0 0 0.494
128 8 0 0 0.324 - 0.373
128 16 0 0 0.333 - 0.359
256 4 0 0 0.415 - 0.442
256 8 0 0 0.322 - 0.359
256 16 0 0 0.322 - 0.343
256 32 0 0 0.331 - 0.351
256 32 0 1 0.405 - 0.428
512 2 0 0 0.832
512 2 1 0 1.295
512 2 0 1 0.894
512 2 1 1 0.748 - 1.383
512 4 0 0 0.51
512 4 1 0 0.321 - 0.349
512 8 0 0 0.4
512 8 1 0 0.320 - 0.355
512 16 0 0 0.397
512 16 1 0 0.320 - 0.337
512 32 0 0 0.398
512 32 1 0 0.319 - 0.334
512 64 0 0 0.398
512 64 1 0 0.325-0.388
512 128 0 0 0.399
512 128 1 0 0.409
512 256 0 0 0.4
512 4000 0 0 0.406
1024 4 0 0 0.442
1024 4 1 0 0.339 - 0.343
1024 8 0 0 0.399
1024 8 1 0 0.320 - 0.321
1024 16 0 0 0.398
1024 16 1 0 0.319 - 0.358
1024 32 0 0 0.4
1024 32 1 0 0.319 - 0.373
1024 64 0 0 0.402
1024 64 1 0 0.320 - 0.443
1024 128 0 0 0.404
1024 128 1 0 0.45
1024 128 0 1 0.406
1024 128 1 1 0.477
2048 4 0 0 0.424
2048 4 1 0 0.341 - 0.361
2048 8 0 0 0.4
2048 8 1 0 0.319 - 0.336
2048 16 0 0 0.4
2048 16 1 0 0.329 - 0.372
2048 32 0 0 0.404
2048 32 1 0 0.402 - 0.410
4096 4 0 0 0.415
4096 4 1 0 0.383 - 0.525
4096 4 0 1 0.644
4096 4 1 1 0.71
4096 8 0 0 0.388
4096 8 1 0 0.309 - 0.343
4096 8 1 1 0.425 - 0.506
4096 16 0 0 0.398
4096 16 1 0 0.318 - 0.356
4096 16 0 1 0.498
4096 16 1 1 0.364
4096 32 0 0 0.402
4096 32 1 0 0.362 - 0.379
4096 64 0 0 0.418
4096 64 1 0 0.388
8192 4 0 0 0.416
8192 4 1 0 0.342 - 0.429
8192 8 0 0 0.387
8192 8 1 0 0.308 - 0.345
8192 8 1 1 0.414 - 0.464
8192 16 0 0 0.4
8192 16 1 0 0.333 - 0.348
8192 32 0 0 0.428
8192 32 1 0 0.351
8192 64 0 0 0.427
8192 64 1 0 0.4
16384 4 0 0 0.42
16384 4 1 0 0.329 - 0.367
16384 8 0 0 0.401
16384 8 1 0 0.327 - 0.338
16384 16 0 0 0.424
16384 16 1 0 0.333 - 0.360
16384 32 0 0 0.434
16384 32 1 0 0.361 - 0.390

以上结果是针对我程序的一个旧版本,该版本为每次读取请求执行一次(或两次)malloc,并且不使用环形缓冲区。

malloc、free 和 O_DIRECT 之间的交互

我的 liburing b3sum 程序是原始 liburing cp 程序的修改版本,而原始 liburing cp 程序为每次读取请求执行一次 malloc(分配每个请求的缓冲区),所以我的程序最初也是这样。我的 liburing b3sum 程序的早期版本一次发出 64KiB 块的请求,这意味着在读取 1GiB 文件时会进行大量 malloc,并且没有使用 O_DIRECT。我注意到在完成块后调用 free 会使我的程序速度从大约 0.339s-0.348s 提高到 0.322s-0.332s,这是一个非常明显且显著的速度提升。

在我找到 O_DIRECT 和非 O_DIRECT 的最佳块大小时,我又尝试使用 `free` 来看看是否有区别。我发现取消注释代码中的 `free()` 后

The O_DIRECT version (8192 8 1 0) went from 0.307-0.342s to 0.330-0.355s
The non-O_DIRECT version (256 8 0 0) went from 0.322-0.345s to 0.306-0.327s

现在非 O_DIRECT 版本再次变快了!我不确定为什么调用 `free` 会减慢我的程序的 O_DIRECT 版本,但会加速非 O_DIRECT 版本。我的意思是,加速非 O_DIRECT 版本是有意义的,其思想是 `free()` 调用实际上并没有将内存返回给系统,而是将其标记为供我的程序使用,这样未来的 `malloc` 调用将只重用相同的内存块,而不是实际向内核请求更多内存。实际上,在从 readv 切换到 read 的过程中,我在创建读取请求的函数中添加了第二个 malloc,这增加了程序运行时间,从大约 0.333s 增加到大约 0.340s——当时我认为这是因为 read 比 readv 慢,但现在我认为这可能是由于额外的 malloc。顺便说一句,我在从 readv 切换到 read 时遇到的一个令人惊讶的事情是,io_uring read(与常规 read 不同)接受一个偏移量,所以它实际上更像是 pread 而不是常规的 read 系统调用(它不接受偏移量)。

总之,所以我决定将程序中的 malloc 数量减少到启动时只有 2 个。现在我将在开始时通过一个 malloc 分配所有缓冲区,现在我得到了 O_DIRECT 版本的 0.297s 和非 O_DIRECT 版本的 0.363s。因此,在启动时分配所有内存将 O_DIRECT 版本的速度提高了约 0.01 秒,同时减慢了非 O_DIRECT 版本约 0.057 秒!我还尝试了忙轮询(使用原子变量)而不是条件变量,这对运行时间没有影响。顺便说一下,是的,0.297s 比我当前系统上获得的任何数字都要低,但 fio 在我旧的系统上也更快——我怀疑是因为我使用了 LVM,而这不知何故使我的文件系统更快。

hdparm 结果

我当前系统的 hdparm

$ hdparm  --direct -t -T /dev/nvme0n1
/dev/nvme0n1:

 Timing O_DIRECT cached reads:   2478 MB in  2.00 seconds = 1238.85 MB/sec
 Timing O_DIRECT disk reads: 7970 MB in  3.00 seconds = 2656.42 MB/sec
 
$ hdparm  --direct -t -T /dev/nvme0n1
/dev/nvme0n1:
 Timing O_DIRECT cached reads:   3400 MB in  2.00 seconds = 1700.47 MB/sec
 Timing O_DIRECT disk reads: 9816 MB in  3.00 seconds = 3271.78 MB/sec
 
$ hdparm  --direct -t -T /dev/nvme0n1
/dev/nvme0n1:
 Timing O_DIRECT cached reads:   3380 MB in  2.00 seconds = 1690.65 MB/sec
 Timing O_DIRECT disk reads: 9834 MB in  3.00 seconds = 3277.73 MB/sec

我以前系统的 hdparm(使用 LVM 加密)

$ hdparm  --direct -t -T /dev/nvme0n1p3

/dev/nvme0n1p3:
 Timing O_DIRECT cached reads:   3510 MB in  2.00 seconds = 1755.66 MB/sec
 Timing O_DIRECT disk reads: 9026 MB in  3.00 seconds = 3008.23 MB/sec

hdparm 持续给出比 fio 低得多的读数,fio 使用 io_uring、libaio 和 POSIX aio 报告了 3.5GB/s。这可能是因为 hdparm 只使用普通的同步 read()。这是源代码。

static int read_big_block (int fd, char *buf)
{
	int i, rc;
	if ((rc = read(fd, buf, TIMING_BUF_BYTES)) != TIMING_BUF_BYTES) {
		if (rc) {
			if (rc == -1)
				perror("read() failed");
			else
				fprintf(stderr, "read(%u) returned %u bytes\n", TIMING_BUF_BYTES, rc);
		} else {
			fputs ("read() hit EOF - device too small\n", stderr);
		}
		return 1;
	}
	/* access all sectors of buf to ensure the read fully completed */
	for (i = 0; i < TIMING_BUF_BYTES; i += 512)
		buf[i] &= 1;
	return 0;
}

dd 真的在进行 read() 系统调用吗?

我想我需要证明 `dd` 真的在执行具有块大小的 read 系统调用,所以让我向您展示此命令生成的 ltrace 日志:echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ltrace -S -o ltrace1.out dd if=1GB.txt bs=64K | ltrace -S -o ltrace2.out ./example_64K

$ head -150 ltrace1.out 
SYS_brk(0)                                                                                                                        = 0x56024ac4f000
SYS_mmap(0, 8192, 3, 34)                                                                                                          = 0x7fed83534000
SYS_access("/etc/ld.so.preload", 04)                                                                                              = -2
SYS_openat(0xffffff9c, 0x7fed8355e0b1, 0x80000, 0)                                                                                = 3
SYS_newfstatat(3, 0x7fed8355ec99, 0x7ffe6f8fd8a0, 4096)                                                                           = 0
SYS_mmap(0, 0x63b6, 1, 2)                                                                                                         = 0x7fed8352d000
SYS_close(3)                                                                                                                      = 0
SYS_openat(0xffffff9c, 0x7fed83534140, 0x80000, 0)                                                                                = 3
SYS_read(3, "\177ELF\002\001\001\003", 832)                                                                                       = 832
SYS_pread(3, 0x7ffe6f8fd620, 784, 64)                                                                                             = 784
SYS_newfstatat(3, 0x7fed8355ec99, 0x7ffe6f8fd8a0, 4096)                                                                           = 0
SYS_pread(3, 0x7ffe6f8fd4f0, 784, 64)                                                                                             = 784
SYS_mmap(0, 0x1e0f50, 1, 2050)                                                                                                    = 0x7fed8334c000
SYS_mmap(0x7fed83372000, 0x155000, 5, 2066)                                                                                       = 0x7fed83372000
SYS_mmap(0x7fed834c7000, 0x53000, 1, 2066)                                                                                        = 0x7fed834c7000
SYS_mmap(0x7fed8351a000, 0x6000, 3, 2066)                                                                                         = 0x7fed8351a000
SYS_mmap(0x7fed83520000, 0xcf50, 3, 50)                                                                                           = 0x7fed83520000
SYS_close(3)                                                                                                                      = 0
SYS_mmap(0, 0x3000, 3, 34)                                                                                                        = 0x7fed83349000
SYS_arch_prctl(4098, 0x7fed83349740, 0xffff80127ccb5f30, 34)                                                                      = 0
SYS_set_tid_address(0x7fed83349a10, 0x7fed83349740, 0x7fed835690d0, 34)                                                           = 0x701f0
SYS_set_robust_list(0x7fed83349a20, 24, 0x7fed835690d0, 34)                                                                       = 0
SYS_334(0x7fed8334a060, 32, 0, 0x53053053)                                                                                        = 0
SYS_mprotect(0x7fed8351a000, 16384, 1)                                                                                            = 0
SYS_mprotect(0x56024a719000, 4096, 1)                                                                                             = 0
SYS_mprotect(0x7fed83566000, 8192, 1)                                                                                             = 0
SYS_prlimit64(0, 3, 0, 0x7ffe6f8fe3e0)                                                                                            = 0
SYS_munmap(0x7fed8352d000, 25526)                                                                                                 = 0
getenv("POSIXLY_CORRECT")                                                                                                         = nil
sigemptyset(<>)                                                                                                                   = 0
sigaddset(<9>, SIGUSR1)                                                                                                           = 0
sigaction(SIGINT, nil <unfinished ...>
SYS_rt_sigaction(2, 0, 0x7ffe6f8fe4c0, 8)                                                                                         = 0
<... sigaction resumed> , { 0, <>, 0x5f004554, 0x5f796c7261655f63 })                                                              = 0
sigaddset(<1,9>, SIGINT)                                                                                                          = 0
sigismember(<1,9>, SIGUSR1)                                                                                                       = 1
sigaction(SIGUSR1, { 0x56024a7086b0, <1,9>, 0, 0 } <unfinished ...>
SYS_rt_sigaction(10, 0x7ffe6f8fe420, 0, 8)                                                                                        = 0
<... sigaction resumed> , nil)                                                                                                    = 0
sigismember(<1,9>, SIGINT)                                                                                                        = 1
sigaction(SIGINT, { 0x56024a7086a0, <1,9>, 0, 0 } <unfinished ...>
SYS_rt_sigaction(2, 0x7ffe6f8fe420, 0, 8)                                                                                         = 0
<... sigaction resumed> , nil)                                                                                                    = 0
strrchr("dd", '/')                                                                                                                = nil
setlocale(LC_ALL, "" <unfinished ...>
SYS_318(0x7fed83525478, 8, 1, 0x7fed83361e40)                                                                                     = 8
SYS_brk(0)                                                                                                                        = 0x56024ac4f000
SYS_brk(0x56024ac70000)                                                                                                           = 0x56024ac70000
SYS_openat(0xffffff9c, 0x7fed834eaf50, 0x80000, 0)                                                                                = 3
SYS_newfstatat(3, 0x7fed834e1dd5, 0x7fed8351fb60, 4096)                                                                           = 0
SYS_mmap(0, 0x2e8610, 1, 2)                                                                                                       = 0x7fed83000000
SYS_close(3)                                                                                                                      = 0
<... setlocale resumed> )                                                                                                         = "en_GB.UTF-8"
bindtextdomain("coreutils", "/usr/share/locale")                                                                                  = "/usr/share/locale"
textdomain("coreutils")                                                                                                           = "coreutils"
__cxa_atexit(0x56024a708a20, 0, 0x56024a71a268, 0)                                                                                = 0
getpagesize()                                                                                                                     = 4096
getopt_long(3, 0x7ffe6f8fe7c8, "", 0x56024a719d20, nil)                                                                           = -1
strchr("if=1GB.txt", '=')                                                                                                         = "=1GB.txt"
strchr("bs=64K", '=')                                                                                                             = "=64K"
__errno_location()                                                                                                                = 0x7fed833496c0
__ctype_b_loc()                                                                                                                   = 0x7fed833496d8
strtoumax(0x7ffe6f8ffda0, 0x7ffe6f8fe5d0, 10, 0x7fed8301982c)                                                                     = 64
strchr("bcEGkKMPTwYZ0", 'K')                                                                                                      = "KMPTwYZ0"
strchr("bcEGkKMPTwYZ0", '0')                                                                                                      = "0"
strchr("64K", 'B')                                                                                                                = nil
open("1GB.txt", 0, 00 <unfinished ...>
SYS_openat(0xffffff9c, 0x7ffe6f8ffd95, 0, 0)                                                                                      = 3
<... open resumed> )                                                                                                              = 3
dup2(3, 0 <unfinished ...>
SYS_dup2(3, 0)                                                                                                                    = 0
<... dup2 resumed> )                                                                                                              = 0
__errno_location()                                                                                                                = 0x7fed833496c0
close(3 <unfinished ...>
SYS_close(3)                                                                                                                      = 0
<... close resumed> )                                                                                                             = 0
lseek(0, 0, 1 <unfinished ...>
SYS_lseek(0, 0, 1)                                                                                                                = 0
<... lseek resumed> )                                                                                                             = 0
__errno_location()                                                                                                                = 0x7fed833496c0
dcgettext(0, 0x56024a715475, 5, 0x7fed83444177 <unfinished ...>
SYS_openat(0xffffff9c, 0x7ffe6f8fe0d0, 0x80000, 0)                                                                                = 3
SYS_newfstatat(3, 0x7fed834e1dd5, 0x7ffe6f8fdf00, 4096)                                                                           = 0
SYS_read(3, "# Locale name alias data base.\n#"..., 4096)                                                                         = 2996
SYS_read(3, "", 4096)                                                                                                             = 0
SYS_close(3)                                                                                                                      = 0
SYS_openat(0xffffff9c, 0x56024ac50410, 0, 0)                                                                                      = -2
SYS_openat(0xffffff9c, 0x56024ac50670, 0, 0)                                                                                      = -2
<... dcgettext resumed> )                                                                                                         = 0x56024a715475
clock_gettime(1, 0x7ffe6f8fe540, 0x56024a715475, 0x56024a715475)                                                                  = 0
aligned_alloc(4096, 0x10000, 0, 24)                                                                                               = 0x56024ac53000
read(0 <unfinished ...>
SYS_read(0, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536)               = 65536
<... read resumed> , "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536)      = 65536
write(1, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536 <unfinished ...>
SYS_write(1, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536)              = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536)                         = 65536
<... read resumed> , "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536)                = 65536
write(1, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536 <unfinished ...>
SYS_write(1, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536)                        = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "\230\322GS\223", 65536)                                                                                              = 65536
<... read resumed> , "\230\322GS\223", 65536)                                                                                     = 65536
write(1, "\230\322GS\223", 65536 <unfinished ...>
SYS_write(1, "\230\322GS\223", 65536)                                                                                             = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536)             = 65536
<... read resumed> , "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536)    = 65536
write(1, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536 <unfinished ...>
SYS_write(1, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536)            = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536)    = 65536
<... read resumed> , "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536) = 65536
write(1, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536 <unfinished ...>
SYS_write(1, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536)   = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536)                                                               = 65536
<... read resumed> , "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536)                                                      = 65536
write(1, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536 <unfinished ...>
SYS_write(1, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536)                                                              = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "EU\230\b\306", 65536)                                                                                                = 65536
<... read resumed> , "EU\230\b\306", 65536)                                                                                       = 65536
write(1, "EU\230\b\306", 65536 <unfinished ...>
SYS_write(1, "EU\230\b\306", 65536)                                                                                               = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536)          = 65536
<... read resumed> , "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536) = 65536
write(1, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536 <unfinished ...>
SYS_write(1, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536)         = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536)                        = 65536
<... read resumed> , "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536)               = 65536
write(1, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536 <unfinished ...>
SYS_write(1, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536)                       = 65536
<... write resumed> )                                                                                                             = 65536
read(0 <unfinished ...>
SYS_read(0, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536)                  = 65536
<... read resumed> , "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536)         = 65536
write(1, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536 <unfinished ...>
SYS_write(1, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536)                 = 65536

看到了吗?这里没有任何猫腻——`dd` **确实**在执行所有那些 SYS_read 系统调用,每次请求 64KiB,然后将这 64KiB 写出。我还验证了使用 bs=2M 时,`dd` 会请求 2M。我还检查了 `cat`,为了保险起见:echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ltrace -S -o ltrace1.out cat 1GB.txt | ltrace -S -o ltrace2.out ./example_64K,并发现正如预期的那样,`cat` 正在以 128KiB 的大小读取文件并将其写出。它**确实**在执行参数为 128KiB 的 read() 系统调用(SYS_read)。总之,我觉得有趣的是,当使用 `dd` 将数据通过管道传输到我的程序时,bs 为 64KiB 是最佳的,但当使用 `dd` 将数据通过管道传输到 /dev/null 时,更大的 bs 2M 是最佳的。另外值得注意的是,`cat` 能够以最大可能的速度将数据通过管道传输到 /dev/null,尽管它使用了 128KiB 的缓冲区大小(当我使用 `dd` 且 bs=128KiB 时,它无法达到最大速度……)。

动机

那么我为什么要写这个程序呢?嗯,几年前我运行自己的 MediaWiki 实例,并遇到了一些问题(例如,数据库因某种错误而无法启动,搜索引擎有一天停止索引新页面,重命名页面导致链接失效等),所以我决定写自己的维基,我想:“好吧,既然我要写自己的维基,我也应该为我的维基写一个备份工具”,所以我决定写一个备份工具,然后我想:“但如果备份损坏了呢?”所以我决定在我的备份中加入前向纠错(ECC),这时我查看了 par2cmdline 并注意到它相当慢,所以我决定写我自己的 ECC 工具,然后我开始阅读 PAR3 规范,看到它使用了 BLAKE3 哈希。于是我开始研究哈希算法,我发现大多数非加密哈希算法(例如 xxhash)对碰撞概率没有任何保证,实际上我认为大多数非加密哈希算法用于哈希表,其中碰撞可以被检测和解决,而不是用于检测文件损坏,因为在那时甚至无法知道是否发生了碰撞。诚然,损坏的文件具有与原始文件相同哈希的概率很小,但我希望使该概率真正微不足道,以免我担心它。所以我开始查看加密哈希——md5、sha1、sha2 等等,我发现 BLAKE3 是我能找到的最快的单线程加密哈希(是的,即使我的系统没有 AVX512 只有 AVX2 😭😭😭,BLAKE3 在我的 AMD64 系统上甚至比 openssl 实现的 SHA1 和 SHA2 还要快),而且它不像其他哈希那样是可并行的,所以我决定使用 BLAKE3。总之,我开始实现我自己的 ECC 工具,第一步是计算文件的哈希值。这就是我创建这个程序的原因。

© . All rights reserved.