基于 io_uring 的 b3sum 实现





5.00/5 (2投票s)
基于 io_uring 的 b3sum 实现
摘要
我编写了两个版本的程序,用于计算文件的 BLAKE3 散列值(源代码在此),一个是用 C 编写的单线程版本,一个是用 C++ 编写的多线程版本,它们都使用了 io_uring
。在我的系统上,单线程版本比官方的 Rust b3sum
快大约 25%,并且比 cat 到 /dev/null 的速度稍快,也比我用来测量系统顺序读取速度的 fio 稍快(请参阅附加说明了解我传递给 fio 的参数和其他详细信息)。单线程版本能够以 2.899 秒的速度散列一个 10GiB 的文件,计算下来大约是 3533MiB/s,这大致与我 NVME 驱动器的广告读取速度(“3500MB/s”)相同,这让我相信它是我的系统上速度最快的文件散列程序,并且不可能更快,因为文件散列程序在完成从磁盘读取文件的最后一个块之前无法完成文件的散列。我的多线程实现比我的单线程实现慢大约 1%。
我提供了非正式的证明,证明我的程序两个版本都是正确的(即,将始终返回正确的结果)并且永远不会卡住(即,将最终完成)。非正式证明以及对代码的解释在本文章的代码解释部分提供。
我的程序极其简单、通用,并且可以轻松修改以满足您的需求,只要您的需求是一次读取文件几个块,并在等待下一个块到达时处理这些块——我认为这是一个极其常见的用例,所以我确保它非常容易适应您的需求——如果您想对文件的每个块做不同的事情,您所需要做的就是将对散列函数的调用替换为您自己的数据处理函数调用。
基准测试
对于这些测试,我使用了相同的 1 GiB(或 10 GiB)输入文件,并在每次测试前始终刷新页面缓存,从而确保程序始终从磁盘读取。每个命令都运行了 10 次,我使用了 time
的“real”结果来计算统计数据。我在 Debian 12 系统上运行了这些命令(uname -r 返回“6.1.0-9-amd64”
),使用了 ext4
,没有磁盘加密,也没有 LVM。
命令 | 最小值 | 中位数 | 最大值 |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 1 | 0.404s | 0.4105s | 0.416s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 2 | 0.474s | 0.4755s | 0.481s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 3 | 0.44s | 0.4415s | 0.451s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 4 | 0.443s | 0.4475s | 0.452s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 5 | 0.454s | 0.4585s | 0.462s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 6 | 0.456s | 0.4605s | 0.463s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 7 | 0.461s | 0.4635s | 0.468s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 8 | 0.461s | 0.464s | 0.47s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --no-mmap | 0.381s | 0.386s | 0.394s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./b3sum_linux 1GB.txt --no-mmap | 0.379s | 0.39s | 0.404s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt | ./example | 0.364s | 0.3745s | 0.381s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt > /dev/null | 0.302s | 0.302s | 0.303s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example | 0.338s | 0.341s | 0.348s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K of=/dev/null | 0.303s | 0.306s | 0.308s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example | 0.538s | 0.5415s | 0.544s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M of=/dev/null | 0.302s | 0.303s | 0.304s |
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting | 0.302s | 0.3025s | 0.303s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 512 2 1 0 2 0 0 | 0.301s | 0.301s | 0.302s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 512 2 1 0 2 0 0 | 0.303s | 0.304s | 0.305s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 128 20 0 0 8 0 0 | 0.375s | 0.378s | 0.384s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 128 20 0 0 8 0 0 | 0.304s | 0.305s | 0.307s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time xxhsum 1GB.txt | 0.318s | 0.3205s | 0.325s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null | 2.903s | 2.904s | 2.908s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 4 1 0 4 0 0 | 2.898s | 2.899s | 2.903s |
在上表中,liburing_b3sum_singlethread
和 liburing_b3sum_multithread
是我自己的基于 io_uring
的 b3sum
实现(更多细节如下),我已验证我的 b3sum 实现始终产生与官方 b3sum 实现相同的 BLAKE3 散列输出。1GB.txt 文件是使用此命令生成的
dd if=/dev/urandom of=1GB.txt bs=1G count=1
我使用此命令安装了 b3sum
cargo install b3sum
$ b3sum --version b3sum 1.4.1
我从 BLAKE3 Github Releases 页面下载了 b3sum_linux
程序(它是最新的 Linux 二进制文件)
$ ./b3sum_linux --version b3sum 1.4.1
我按照 BLAKE3 C 仓库中的说明,从 BLAKE3 C 仓库的 example.c 文件编译了示例程序
gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c
\ blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S
blake3_avx2_x86-64_unix.S \ blake3_avx512_x86-64_unix.S
我使用此命令安装了 xxhsum
apt install xxhash
$ xxhsum --version
xxhsum 0.8.1 by Yann Collet
compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0`
简而言之
从上面的结果可以看出,我的单线程实现比我的多线程实现稍快。
如果单线程版本更快,为什么我还要提到多线程版本?因为,如上表所示,单线程版本需要 O_DIRECT
才能快速运行(控制是否使用 O_DIRECT
的标志是命令行参数中文件名后的第三个数字)。即使没有 O_DIRECT
,多线程版本也很快(如表所示,多线程版本在有 O_DIRECT
的情况下散列 1GiB 文件需要 0.304s
,没有 O_DIRECT
的情况下需要 0.305s
)。为什么有人会关心程序是否使用 O_DIRECT
呢?因为 O_DIRECT
版本不会将文件放入页面缓存(它也不会从页面缓存中逐出文件——它只是绕过了页面缓存)。这对于您只是想尽快处理(例如散列)大量大文件、每个文件只访问一次的用例来说是很有用的,例如,如果您想生成系统上每个文件的散列列表。对于只有一个文件,您想先用此程序散列它,然后立即用另一个程序(例如 xxhsum
)的情况,它就没那么好了。这就是为什么我提到了我的多线程版本——即使不使用 O_DIRECT
,它也很快,不像单线程版本。
一个主要的警告是,我的程序目前使用 BLAKE3 库的 C 版本。BLAKE3
有两个版本——一个 Rust 版本和一个 C 版本,只有 Rust 版本支持多线程——C 版本目前不支持多线程。
与 Rust 实现不同,C 实现目前不支持多线程。此库的未来版本可以通过添加对 OpenMP 或类似库的可选依赖来增加支持。或者,我们可以公开一个低级 API,允许调用者自己实现并发。前者会更方便,出错的可能性更小,但后者将为调用者提供最大可能的控制。最佳选择取决于具体用例,因此如果您有 C 中多线程散列的用例,请在 GitHub 上提交一个 issue 并告知我们。
当您的 IO 速度快于您的单核散列速度时,多线程可能会有益。例如,如果您有一个快速的 NVME,假设您的 NVME 可以进行 7GB/s 的读取,而您的 CPU 单核只能以大约 4GB/s 的速度计算 BLAKE3 散列,那么多线程可能对您有益,因为多线程可以使多个 CPU 核心同时散列文件。因此,最明显的下一步是为 Rust b3sum
实现添加 io_uring
支持,以便 Linux 系统可以从更快的 IO 中受益。另一种选择是为 C BLAKE3 实现添加多线程支持。
(实际上,说了这些,我不确定多线程是否总是更快。在大多数存储设备上,顺序读取比随机读取快得多,而且目前的 C++ 多线程 Rust 实现似乎会遍历文件的各个部分,而不是顺序读取文件。因此,我认为在某些情况下,进行随机读取可能会大大减慢 IO 速度,以至于它实际上比仅仅顺序读取文件并在单核上进行散列要慢。但这显然取决于您拥有的硬件。理想的情况是多核散列与顺序读取,但我不知道 BLAKE3 算法是否允许这样做。如果不行,我想您可以单独散列文件的每个块,得到一个散列列表,然后您可以存储该列表,或者如果您想节省空间,您可以存储该列表的散列。这实际上是我原来的计划,直到我发现我的系统上单核 BLAKE3 比文件读取速度更快。)
所以,我很确定我的程序运行在我的系统的理论速度极限,即,它确实是我的系统上速度最快的文件散列程序,即没有进一步的优化空间了——这一事实比其他任何事情都让我更满意。我写这篇文章是因为我想写一篇解释我的程序,以及我是如何达到这个点的,实际上,我认为,据我所知,没有人使用 io_uring
进行文件散列程序,这真是令人惊讶——它简直是 io_uring
的完美用例,这是我能想到的最低的、最容易实现的东西。
另一个需要注意的点是 io_uring
仅限于 Linux,所以这个程序不能在其他操作系统上运行。
代码解释
我将首先解释单线程版本,然后解释多线程版本。
那么,这是单线程版本
/* SPDX-License-Identifier: MIT */
/*
* Compile with: gcc -Wall -O3 -D_GNU_SOURCE liburing_b3sum_singlethread.c
* -luring libblake3.a -o liburing_b3sum_singlethread
* For an explanation of how this code works,
* see my article/blog post: https://1f604.com/b3sum
*
* This program is a modified version of the liburing
* cp program from Shuveb Hussain's io_uring tutorial.
* Original source code here:
* https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
* The modifications were made by 1f604.
*
* The official io_uring documentation can be seen here:
* - https://kernel.dk/io_uring.pdf
* - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
*
* Acronyms: SQ = submission queue, SQE = submission queue entry,
* CQ = completion queue, CQE = completion queue event
*/
#include "blake3.h"
#include "liburing.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>
/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT
// requires aligned memory
/* ===============================================
* ========== Start of global variables ==========
* ===============================================
* Declared static because they are only supposed to be visible within this .c file.
*
* --- Command line options ---
* The following variables are set by the user from the command line.
*/
static int g_queuedepth; /* This variable is not really the queue depth,
* but is more accurately described as
* "the limit on the number of incomplete requests".
* Allow me to explain.
* io_uring allows you to have more requests
* in-flight than the size of your submission
* (and completion) queues. How is this possible?
* Well, when you call io_uring_submit,
* normally it will submit ALL of the requests
* in the submission queue, which means that
* when the call returns, your submission queue
* is now empty, even though the requests
* are still "in-flight" and haven't
* been completed yet!
* In earlier kernels, you could overflow
* the completion queue because of this.
* Thus it says in the official io_uring
* documentation
* (https://kernel.dk/io_uring.pdf):
* Since the sqe lifetime is only that of
* the actual submission of it, it's possible
* for the application to drive
* a higher pending
* request count than the SQ ring size
* would indicate. The application
* must take care
* not to do so, or it could risk
* overflowing the CQ ring.
* That is to say, the official documentation
* recommended that applications should ensure
* that the number of in-flight requests
* does not exceed the size
* of the submission queue.
* This g_queuedepth variable is therefore a limit
* on the total number of "incomplete"
* requests, which is the number of requests
* on the submission queue plus the number of
* requests that are still "in flight".
* See num_unfinished_requests for details
* on how this is implemented. */
static int g_use_o_direct; // whether to use O_DIRECT
static int g_process_in_inner_loop; // whether to process the data inside
// the inner loop
static int g_use_iosqe_io_drain; // whether to issue requests with
// the IOSQE_IO_DRAIN flag
static int g_use_iosqe_io_link; // whether to issue requests with
// the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable IORING_SETUP_IOPOLL,
// on my current system, it turns my process
// into an unkillable zombie that uses
// 100% CPU that never terminates.
// when I was using encrypted LVM,
// it just gave me error: Operation not supported.
// I decided to not allow users to enable that
// option because I didn't want them
// to accidentally launch an unkillable
// never-ending zombie process that
// uses 100% CPU.
// I observed this behavior in fio too
// when I enabled --hipri on fio,
// it also turned into an unkillable
// never-ending zombie process
// that uses 100% CPU.
static size_t g_blocksize; // This is the size of each buffer
// in the ringbuf, in bytes.
// It is also the size of each read from the file.
static size_t g_numbufs; // This is the number of buffers in the ringbuf.
/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
static int g_filedescriptor; // This is the file descriptor of the file
// we're hashing.
static size_t g_filesize; // This will be filled in by the function
// that gets the file size.
static size_t g_num_blocks_in_file; // The number of blocks in the file,
// where each block is g_blocksize bytes.
// This will be calculated by a ceiling
// division of filesize by blocksize.
static size_t g_size_of_last_block; // The size of the last block in the file.
// See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0; // Position of the "producer head".
// See explanation in my article/blog post
static int consumer_head = 0; // Position of the "consumer head".
// See explanation in my article/blog post
enum ItemState { // describes the state of a cell in the
// ring buffer array ringbuf,
// see my article/blog post for
// detailed explanation
AVAILABLE_FOR_CONSUMPTION, // consumption and processing in the context
// of this program refers to hashing
ALREADY_CONSUMED,
REQUESTED_BUT_NOT_YET_COMPLETED,
};
struct my_custom_data { // This is the user_data associated
// with read requests,
// which are placed on the submission ring.
// In applications using io_uring,
// the user_data struct
// is generally used to identify which request a
// completion is for. In the context
// of this program, this structure is
// used both to identify which
// block of the file the read syscall
// had just read, as well as for the producer
// and consumer to communicate with each other,
// since it holds the cell state.
// This can be thought of as a "cell"
// in the ring buffer,
// since it holds the state of the cell as well
// as a pointer to the data
// (i.e. a block read from the file)
// that is "in" the cell.
// Note that according to the official
// io_uring documentation,
// the user_data struct only needs to be
// valid until the submit is done,
// not until completion.
// Basically, when you submit, the kernel makes
// a copy of user_data and returns it
// to you with the CQE (completion queue entry).
unsigned char* buf_addr; // Pointer to the buffer where the read syscall
// is to place the bytes from the file into.
size_t nbytes_expected; // The number of bytes we expect the read syscall
// to return. This can be
// smaller than the size of the buffer
// because the last block of the file
// can be smaller than the other blocks.
//size_t nbytes_to_request; // The number of bytes to request.
// This is always g_blocksize.
// I made this decision because O_DIRECT requires
// nbytes to be a multiple of filesystem block size,
// and it's simpler to always just
// request g_blocksize.
off_t offset_of_block_in_file; // The offset of the block in the file
// that we want the read syscall
// to place into the memory location
// pointed to by buf_addr.
enum ItemState state; // Describes whether the item is available
// to be hashed, already hashed,
// or requested but not yet available for hashing.
int ringbuf_index; // The slot in g_ringbuf where this "cell" belongs.
// I added this because once we submit a request on
// submission queue, we lose track of it.
// When we get back a completion,
// we need an identifier
// to know which request the completion is for.
// Alternatively, we could use something to map
// the buf_addr to the ringbuf_index,
// but this is just simpler.
};
struct my_custom_data* g_ringbuf; // This is a pointer to an array of
// my_custom_data structs.
// These my_custom_data structs
// can be thought of as the
// "cells" in the ring buffer
// (each struct contains the cell state),
// thus the array that this points to can be
// thought of as the "ring buffer"
// referred to in my article/blog post,
// so read that to understand how this is used.
// See the allocate_ringbuf function for
// details on how and where the
// memory for the ring buffer is allocated.
/* ===============================================
* =========== End of global variables ===========
* ===============================================*/
static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
int rc;
int flags = 0;
rc = io_uring_queue_init(entries, ring, flags);
if (rc < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-rc));
return -1;
}
return 0;
}
static int get_file_size(int fd, size_t *size)
{
struct stat st;
if (fstat(fd, &st) < 0)
return -1;
if (S_ISREG(st.st_mode)) {
*size = st.st_size;
return 0;
} else if (S_ISBLK(st.st_mode)) {
unsigned long long bytes;
if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
return -1;
*size = bytes;
return 0;
}
return -1;
}
static void add_read_request_to_submission_queue
(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
assert(fileoffset_to_request % g_blocksize == 0);
int block_number = fileoffset_to_request /
g_blocksize; // the number of the block in the file
/* We do a modulo to map the file block number to the index in the ringbuf
e.g. if ring buf_addr has 4 slots, then
file block 0 -> ringbuf index 0
file block 1 -> ringbuf index 1
file block 2 -> ringbuf index 2
file block 3 -> ringbuf index 3
file block 4 -> ringbuf index 0
file block 5 -> ringbuf index 1
file block 6 -> ringbuf index 2
And so on.
*/
int ringbuf_idx = block_number % g_numbufs;
struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of a
// my_custom_data struct
// should never change.
my_data->offset_of_block_in_file = fileoffset_to_request;
assert (my_data->buf_addr); // We don't need to change my_data->buf_addr
// since we set it to point into the backing buffer
// at the start of the program.
my_data->nbytes_expected = expected_return_size;
my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
* 1. The producer is about
* to send it off in a request.
* 2. The consumer shouldn't be
* trying to read this
* buffer at this point.
* So it is okay to set the
* state to this here.
*/
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
if (!sqe) {
puts("ERROR: FAILED TO GET SQE");
exit(1);
}
io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr,
g_blocksize, fileoffset_to_request);
// io_uring_prep_read sets sqe->flags to 0,
// so we need to set the flags AFTER calling it.
if (g_use_iosqe_io_drain)
sqe->flags |= IOSQE_IO_DRAIN;
if (g_use_iosqe_io_link)
sqe->flags |= IOSQE_IO_LINK;
io_uring_sqe_set_data(sqe, my_data);
}
static void increment_buffer_index(int* head) // moves the producer or consumer
// head forward by one position
{
*head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}
static void resume_consumer() // Conceptually, this resumes the consumer "thread".
{ // As this program is single-threaded,
// we can think of it as cooperative multitasking.
while (g_ringbuf[consumer_head].state == AVAILABLE_FOR_CONSUMPTION){
// Consume the item.
// The producer has already checked that nbytes_expected
// is the same as the amount of bytes actually returned.
// If the read syscall returned something different to
// nbytes_expected then the program would have terminated with an error message.
// Therefore it is okay to assume here that nbytes_expected
// is the same as the amount of actual data in the buffer.
blake3_hasher_update(&g_hasher,
g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);
// We have finished consuming the item, so mark it as consumed and
// move the consumer head to point to the next cell in the ringbuffer.
g_ringbuf[consumer_head].state = ALREADY_CONSUMED;
increment_buffer_index(&consumer_head);
}
}
static void producer_thread()
{
int rc;
unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
unsigned long num_unfinished_requests = 0;
/* A brief note on how the num_unfinished_requests variable is used:
* As mentioned earlier, in io_uring it is possible to have more
* requests in-flight than the size of the completion ring.
* In earlier kernels this could cause the completion queue to overflow.
* In later kernels there was an option added (IORING_FEAT_NODROP) which,
* when enabled, means that if a completion event occurs
*and the completion queue is full, then the kernel will internally
* store the event until the completion queue has room for more entries.
* Therefore, on newer kernels, it isn't necessary, strictly speaking,
* for the application to limit the number of in-flight requests.
* But, since it is almost certainly the case that an application
* can submit requests at a faster rate than the system is capable of
* servicing them, if we don't have some backpressure mechanism,
* then the application will just keep on submitting more and more
* requests, which will eventually lead to the system running out of memory.
* Setting a hard limit on the total number of in-flight requests serves
* as a backpressure mechanism to prevent the number of requests
* buffered in the kernel from growing without bound.
* The implementation is very simple: we increment num_unfinished_requests
* whenever a request is placed onto the submission queue,
* and decrement it whenever an entry is removed from the completion queue.
* Once num_unfinished_requests hits the limit that we set,
* then we cannot issue any more requests until we receive more completions,
* therefore the number of new completions that we receive is exactly
* equal to the number of new requests that we will place, thus ensuring
* that the number of in-flight requests can never exceed g_queuedepth.
*/
off_t next_file_offset_to_request = 0;
struct io_uring uring;
if (setup_io_uring_context(g_queuedepth, &uring)) {
puts("FAILED TO SET UP CONTEXT");
exit(1);
}
struct io_uring* ring = ů
while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
// Step 1. Submit read requests.
// Step 2. Retrieve completions.
// Step 3. Run consumer.
/*
* Step 1: Make zero or more read requests until g_queuedepth limit is reached,
* or until the producer head reaches
* a cell that is not in the ALREADY_CONSUMED state
* (see my article/blog post for explanation).
*/
unsigned long num_unfinished_requests_prev =
num_unfinished_requests; // The only purpose of this
// variable is to keep
// track of whether or not
// we added any new requests
// to the submission queue.
while (num_blocks_left_to_request) {
if (num_unfinished_requests >= g_queuedepth)
break;
if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
break;
/* expected_return_size is the number of bytes that we expect
* this read request to return. expected_return_size will be
* the block size until the last block of the file.
* when we get to the last block of the file,
* expected_return_size will be the size of the last
* block of the file, which is calculated in
* calculate_numblocks_and_size_of_last_block
*/
size_t expected_return_size = g_blocksize;
if (num_blocks_left_to_request == 1) // if we're at the last block
// of the file
expected_return_size = g_size_of_last_block;
add_read_request_to_submission_queue(ring, expected_return_size,
next_file_offset_to_request);
next_file_offset_to_request += expected_return_size;
++num_unfinished_requests;
--num_blocks_left_to_request;
// We have added a request for the read syscall to write
// into the cell in the ringbuffer.
// The add_read_request_to_submission_queue has already
// marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
// so now we just need to move the producer head to point
// to the next cell in the ringbuffer.
increment_buffer_index(&producer_head);
}
// If we added any read requests to the submission queue, then submit them.
if (num_unfinished_requests_prev != num_unfinished_requests) {
rc = io_uring_submit(ring);
if (rc < 0) {
fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
exit(1);
}
}
/*
* Step 2: Remove all the items from the completion queue.
*/
bool first_iteration = 1; // On the first iteration of the loop,
// we wait for at least one cqe
// to be available,
// then remove one cqe.
// On each subsequent iteration,
// we try to remove one cqe
// without waiting.
// The loop terminates only
// when there are no more items
// left in the completion queue,
while (num_blocks_left_to_receive) { // Or when we've read in all of
// the blocks of the file.
struct io_uring_cqe *cqe;
if (first_iteration) { // On the first iteration
// we always wait until at
// least one cqe is available
rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed
// and give us one cqe.
first_iteration = 0;
} else {
rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are
// no more items left in the
// completion queue.
if (rc == -EAGAIN) { // A return code of -EAGAIN means that
// there are no more items
// left in the completion queue.
break;
}
}
if (rc < 0) {
fprintf(stderr, "io_uring_peek_cqe: %s\n",
strerror(-rc));
exit(1);
}
assert(cqe);
// At this point we have a cqe, so let's see what our syscall returned.
struct my_custom_data *data =
(struct my_custom_data*) io_uring_cqe_get_data(cqe);
// Check if the read syscall returned an error
if (cqe->res < 0) {
// we're not handling EAGAIN because it should never happen.
fprintf(stderr, "cqe failed: %s\n",
strerror(-cqe->res));
exit(1);
}
// Check if the read syscall returned an unexpected number of bytes.
if ((size_t)cqe->res != data->nbytes_expected) {
// We're not handling short reads because
// they should never happen on a disk-based filesystem.
if ((size_t)cqe->res < data->nbytes_expected) {
puts("panic: short read");
} else {
puts("panic: read returned more data than expected (wtf).
Is the file changing while you're reading it??");
}
exit(1);
}
assert(data->offset_of_block_in_file % g_blocksize == 0);
// If we reach this point, then it means that there were no errors:
// the read syscall returned exactly what we expected.
// Since the read syscall returned, this means it has finished
// filling in the cell with ONE block of data from the file.
// This means that the cell can now be read by the consumer,
// so we need to update the cell state.
g_ringbuf[data->ringbuf_index].state = AVAILABLE_FOR_CONSUMPTION;
--num_blocks_left_to_receive; // We received ONE block of data from the file
io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed,
// so that its slot can get reused
--num_unfinished_requests;
if (g_process_in_inner_loop)
resume_consumer();
}
/* Step 3: Run consumer. This might be thought of as handing over control
to the consumer "thread". See my article/blog post. */
resume_consumer();
}
resume_consumer();
close(g_filedescriptor);
io_uring_queue_exit(ring);
// Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
uint8_t output[BLAKE3_OUT_LEN];
blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);
// Print the hash as hexadecimal.
printf("BLAKE3 hash: ");
for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
printf("%02x", output[i]);
}
printf("\n");
}
static void process_cmd_line_args(int argc, char* argv[])
{
if (argc != 9) {
printf("%s: infile g_blocksize g_queuedepth g_use_o_direct
g_process_in_inner_loop g_numbufs
g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
exit(1);
}
g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
g_queuedepth = atoi(argv[3]);
if (g_queuedepth > 32768)
puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
g_use_o_direct = atoi(argv[4]);
g_process_in_inner_loop = atoi(argv[5]);
g_numbufs = atoi(argv[6]);
g_use_iosqe_io_drain = atoi(argv[7]);
g_use_iosqe_io_link = atoi(argv[8]);
}
static void open_and_get_size_of_file(const char* filename)
{
if (g_use_o_direct){
g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
puts("opening file with O_DIRECT");
} else {
g_filedescriptor = open(filename, O_RDONLY);
puts("opening file without O_DIRECT");
}
if (g_filedescriptor < 0) {
perror("open file");
exit(1);
}
if (get_file_size(g_filedescriptor, &g_filesize)){
puts("Failed getting file size");
exit(1);
}
}
static void calculate_numblocks_and_size_of_last_block()
{
// this is the mathematically correct way to do ceiling division
// (assumes no integer overflow)
g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
// calculate the size of the last block of the file
if (g_filesize % g_blocksize == 0)
g_size_of_last_block = g_blocksize;
else
g_size_of_last_block = g_filesize % g_blocksize;
}
static void allocate_ringbuf()
{
// We only make 2 memory allocations in this entire program
// and they both happen in this function.
assert(g_blocksize % ALIGNMENT == 0);
// First, we allocate the entire underlying ring buffer
// (which is a contiguous block of memory
// containing all the actual buffers) in a single allocation.
// This is one big piece of memory which is used to hold the
// actual data from the file.
// The buf_addr field in the my_custom_data struct points to a
// buffer within this ring buffer.
unsigned char* ptr_to_underlying_ring_buffer;
// We need aligned memory, because O_DIRECT requires it.
if (posix_memalign((void**)&ptr_to_underlying_ring_buffer,
ALIGNMENT, g_blocksize * g_numbufs)) {
puts("posix_memalign failed!");
exit(1);
}
// Second, we allocate an array containing all of the my_custom_data structs.
// This is not an array of pointers, but an array holding all of the actual structs.
// All the items are the same size, which makes this easy
g_ringbuf = (struct my_custom_data*)
malloc(g_numbufs * sizeof(struct my_custom_data));
// We partially initialize all of the my_custom_data structs here.
// (The other fields are initialized in the
// add_read_request_to_submission_queue function)
off_t cur_offset = 0;
for (int i = 0; i < g_numbufs; ++i) {
g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer +
cur_offset; // This will never change during the runtime of this program.
cur_offset += g_blocksize;
// g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to
// ALREADY_CONSUMED at the start. See my article/blog post for explanation.
g_ringbuf[i].ringbuf_index = i; // This will never change during the
// runtime of this program.
}
}
int main(int argc, char *argv[])
{
assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
process_cmd_line_args(argc, argv); // we need to first parse the command line
// arguments
open_and_get_size_of_file(argv[1]);
calculate_numblocks_and_size_of_last_block();
allocate_ringbuf();
// Initialize the hasher.
blake3_hasher_init(&g_hasher);
// Run the main loop
producer_thread();
}
它是如何工作的?
基本上,我的程序从磁盘读取文件,然后对其进行散列。更具体地说,我的程序发出文件块的读取请求,然后在返回的文件块到达后对其进行散列。因此,它几乎是使用 io_uring
的最简单的程序。
要真正理解我的程序,读者需要知道 io_uring
是一个异步 API,它由两个环组成:提交队列(SQ)和完成队列(CQ)。它的使用方法很简单:您将您的请求,称为提交队列条目(SQE),放入 SQ,然后调用 submit()
,然后轮询或等待 CQ 以获取完成,也称为完成队列事件(CQE)。关于 io_uring
要记住的重要一点是,默认情况下,提交给 io_uring
的请求可能以任何顺序完成。有两个标志可以用来强制 io_uring
按顺序完成请求(IOSQE_IO_DRAIN
和 IOSQE_IO_LINK
),但它们会大大降低程序的性能,从而失去了使用 io_uring
的初衷(当然,这考虑到它们的工作方式是有意义的——这些标志意味着下一个 sqe 要到前一个完成后才会开始,这几乎挫败了异步 API 的全部意义……实际上,我发现 IOSQE_IO_LINK
甚至不起作用——我仍然得到了乱序的完成——但它确实大大降低了我的程序速度……不过那是早期版本的时候了)——您实际上可以自己轻松检查这一点:我的程序将这些标志作为命令行参数公开。我实际上一开始就尝试使用这些标志,直到我无法获得可接受的性能时,我才最终接受我必须处理乱序的完成。实际上,有一些用例不需要按顺序处理文件块(例如文件复制)。不幸的是,文件散列不是其中之一——在散列文件时,我们通常需要按顺序处理文件块(至少,BLAKE3 C API 不提供按任意顺序散列块的选项)。因此,我们需要一种方法来处理乱序的完成。
所以,我能想到的最简单的方法就是分配一个足够大的数组来在内存中保存整个文件,然后,每当收到一个完成时,我们就将其放入数组中相应的位置。例如,如果一个文件由 100 个块组成,那么我们将分配一个包含 100 个插槽的数组,当我们收到,比如说,第 5 个块的完成时,我们将该完成放入数组的第 5 个插槽中。这样,文件散列函数就可以简单地迭代这个数组,在继续之前等待数组的下一个插槽被填充完成。
为了达到速度极限,我必须再进行一次优化。这项优化是使用固定大小的环形缓冲区,而不是分配一个足够大的单个缓冲区来容纳文件的所有块。除了速度更快之外,它还具有使用更少内存的优点——这实际上是我实现它的初衷——我没想到它会提高我的程序性能,我只实现了它,因为我用来散列大文件的程序导致我的电脑内存不足。因此,我很高兴地发现它实际上提高了我的程序性能。
现在,环形缓冲区的实现有点复杂,所以让我们思考一下这个问题:这个算法的正确性属性是什么?我这么想的
消费者将按顺序且只处理文件中的所有块一次。
按顺序且只处理文件中的所有块一次,这实际上是很容易实现的。棘手的部分是确保它将处理文件中的所有块。我们如何确保它将按顺序且只处理文件中的所有块一次?很简单:通过以下实现:消费者只需迭代数组并查看下一个单元格的状态和块偏移量——如果单元格的状态是“准备好处理”,并且块偏移量是它想要的块的偏移量,那么消费者就会处理它。消费者所需要做的就是内部跟踪它已处理的最后一个块的偏移量,并将其块大小添加到该偏移量以获得它想要的下一个块的偏移量。
但还有另一个我们不能忘记的考虑因素——队列深度。更准确地说,正如我在代码注释中解释的,它本质上是正在进行中的请求的数量。为什么这很重要?因为 io_uring
是一个异步 API——提交会立即返回,所以您可以不断地放入越来越多的请求,提交它们,并且(至少在某些内核版本上)最终系统会耗尽内存。因此,我们需要限制在任何给定时刻可以同时进行的请求的数量,最简单的方法是使用计数器——每当发出一个请求时,我们就递减计数器,每当收到一个完成时,我们就递增计数器。当计数器达到 0 时,我们就停止发出新请求,只等待完成。通过这种方式,我们可以确保同时进行的请求数量永远不会超过计数器的初始大小。
有了这个设计,我们一次就不能发出整个文件的请求。如果文件由,比如说,100 万个块组成,我们不能在程序开始时就发出 100 万个读取请求并只轮询完成,因为我们不能允许在任何给定时刻同时进行的请求数量超过一小部分。相反,我们必须限制自己一次只发送几个块的请求,并在发送任何新请求之前等待至少一个请求完成。
举个例子。假设我们有一个有 5 个插槽的环形缓冲区,以及一个队列深度限制为 3,这意味着一次最多只能有 3 个正在进行的请求。在程序开始时,我们可以发出 3 个请求。那么问题是,我们想发出哪 3 个请求?答案很明显:前 3 个块(对应于前 3 个插槽),并且我们想按块 0、然后块 1、然后块 2 的顺序发出它们。为什么?因为尽管不能保证请求的完成顺序与发出的顺序相同,但内核至少会按发出的顺序尝试这些请求(毕竟它是一个环形缓冲区)。我们希望第一个块首先完成,因为这将允许消费者尽快开始工作——为了效率,我们希望消费者始终处理一些数据,所以我们希望第一个块首先返回(而不是,比如说,第二个块首先返回)。因此,进一步发展这个原理,这意味着我们希望按顺序发出请求,以尽快解除消费者的阻塞。这意味着,如果消费者当前正在等待块 i 变为“可用于处理”,那么生产者必须在发出任何其他块的请求之前发出块 i 的请求。向前推理,生产者也应该在发出块 i+2 的请求之前发出块 i+1 的请求,并且应该在发出块 i+3 的请求之前发出块 i+2 的请求,以此类推。因此,生产者应该按 0、1、2、3、……的顺序发出块的请求,直到最后一个块。
因此,我们现在有了消费者,它基本上有一个“消费头”,当它可以移动时,它会不断地向前移动到环形缓冲区中的下一个单元格,但“生产者头”也是如此,我认为它更准确的名称是“请求者头”。因此,我们有两个指针
- 一个“消费者头”指针,它使消费者能够线性地遍历环形缓冲区,一次一个地顺序处理单元格。
- 一个“请求者头”指针,它使生产者能够线性地遍历环形缓冲区,一次一个地顺序发出单元格的请求。
“请求者头”指向生产者将发出下一个读取请求的单元格——然后读取系统调用将用文件中的块填充该单元格。一旦生产者发出了一个块的请求,它就应该发出下一个块的请求,正如我所解释的,除非它不能。它什么时候不能?有两种情况:第一,如果同时进行的请求数量已达到上限,那么在收到一个完成之前,它不能再发出任何新请求。第二,如果下一个单元格不可用,它就不能发出请求。重要的是要记住,块被映射到环形缓冲区中的单元格,并且发出下一个文件块的请求意味着它需要环形缓冲区中的下一个单元格是“空闲的”,即,可用于填充。现在,如果消费者当前正在读取该单元格,那么显然它不能为该单元格发出请求。但也有其他情况,它不能为该单元格发出请求,例如,如果它已经为该单元格发出了请求,但消费者尚未读取该单元格。我认为通过一个例子来理解这一点更容易。
基本上有两种情况。第一种是队列深度小于环形缓冲区的插槽数,第二种是队列深度大于环形缓冲区的插槽数。让我们逐一分析。请看这个图
首先,假设我们将队列深度设置为 3,环形缓冲区大小设置为 5 个插槽。最初(图 1.),环形缓冲区是空的,所以消费者不能移动,但生产者可以开始发出请求,然后它就这么做了。由于队列深度是 3,这意味着生产者一开始最多可以发出 3 个请求。所以现在(图 2.),我们已经发出了 3 个请求,每个块 0、1 和 2 分别对应插槽 0、1 和 2。这里我使用符号“r”来表示我们已经发出读取请求但尚未返回的插槽。这意味着消费者无法读取它们,因为插槽尚未包含数据。现在,假设插槽 1 和 2 的读取请求已返回,但插槽 0 的读取请求仍未返回(图 3.)。这意味着,尽管插槽 1 和 2 可供读取,但消费者仍然被阻塞,等待插槽 0 可用于处理。但是,由于我们已经收到了块 1 和 2 的完成,这意味着我们可以发出另外 2 个请求,所以我们为尚未发出请求的下一个 2 个块发出请求,即插槽 3 和 4。现在(图 4.),假设插槽 4 的读取调用已返回,但插槽 0 仍在等待。但是现在,正在进行的请求数量只有 2 个,这意味着生产者不受队列深度的限制而无法发出更多读取请求。然而,消费者仍然被阻塞,等待插槽 0 可用于消耗。此时,生产者是否可以覆盖它?现在,生产者想要为块 5 发出读取请求,写入插槽 0,但消费者尚未看到块 0,所以很明显,在消费者处理完它之前,生产者不应该为插槽 0 发出请求。由此,我们可以推断出一个通用原则。早些时候,我们将每个插槽映射到一个块列表,例如,如果环形缓冲区有 5 个插槽,那么我们将插槽 0 映射到块 0、块 5、块 10、块 15 等等。我希望这张图表明,在消费者完成块 0 之前,我们需要确保不要请求块 5,因为它们都映射到同一个插槽,这意味着如果我们请求块 5 在消费者完成块 0 之前,那么我们将覆盖块 0 将要存储的插槽。更广泛地说,通过确保我们仅在消费者完成前一个块之后才为每个插槽请求下一个块,我们确保我们不会覆盖正在处理或尚未处理的块。
所以,这是看待问题的一种方式。另一种看待方式是根据单元格的状态来思考。在图中,我们显示了单元格可能存在的两种状态——“r”状态和“a”状态。现在,我们可以问:生产者要发出对该单元格的请求,该单元格必须处于什么状态?嗯,如果单元格处于“r”状态,那么这意味着生产者已经为该单元格发出了请求,而消费者尚未消耗该单元格(显然,因为它还不可用),这意味着如果我们覆盖该单元格,那么我们将覆盖我们已经请求过的块。我所说的情况是,当生产者移动到下一个单元格,并发现该单元格处于“r”状态时——这只能意味着一件事,那就是生产者已经绕环形缓冲区整整一圈,现在正在查看一个单元格,该单元格正在等待接收我们接下来想要请求的块的“前一个”块,我们已经解释过在这种情况下发出请求是错误的。实际上,我认为这种看待方式——根据单元格状态——更容易理解。那么,有多少种单元格状态?我想到了几种
- 空(程序开始时)
- 已请求但尚未完成
- 读取完成,可供处理
- 正在处理中
- 处理完成
但我们真的需要表示所有这些状态吗?嗯,状态的最终目的是什么?它们是为了确保程序的正确性,那就是
- 消费者不消耗不正确的块,并且只按正确的顺序消耗正确的块,而且只消耗一次
- 消费者不会卡住
为了确保这些正确性属性,我们只需要确保以下几点
- 消费者正好在可以这样做的时候前进
- 生产者正好在可以这样做的时候前进
我使用“正好”这个词来传达两个要求:消费者/生产者不应在不安全时前进(正确性),并且一旦安全就必须前进(效率)。由于这是单元格状态的唯一目的,因此我们可以将状态精简为这 4 种
- 对消费者可用,但对生产者不可用
- 对生产者可用,但对消费者不可用
- 对两者都可用
- 对两者都不可用
事实上,状态 3 不可能存在,因为我们不能让一个单元格同时被消费者和生产者访问——那将是数据竞争(因为生产者在写入单元格,而消费者正在读取它)因此是未定义行为。所以实际上只有 3 种状态
- 对消费者可用,但对生产者不可用(我将称之为“a”状态——“可供处理”)
- 对生产者可用,但对消费者不可用(我将称之为“c”状态——“已消耗”)
- 对两者都不可用(我将称之为“r”状态——“已请求但尚未完成”)
然后,我们可以将前面提到的状态映射到这 3 种状态。
那么,现在我们有了环形缓冲区中单元格的完整生命周期:在程序开始时,所有单元格都是空的,所以消费者无法消耗任何一个,但生产者可以发出对它们的请求,所以最初我们需要将数组中的每个块标记为“c”状态,以便生产者发出请求,但消费者不会尝试消耗它们。一旦生产者发出了对该单元格的请求,该单元格将进入“r”状态,直到读取请求完成。为什么?因为如前所述,如果生产者已经发出了对单元格的读取请求,但读取请求尚未返回,那么生产者不应该覆盖该单元格,消费者也无法处理它,因为它还没有被填充,所以“r”状态在这里是合适的。接下来,读取请求返回,所以我们将单元格更改为“a”状态。为什么?因为在读取请求返回一个单元格后,该单元格现在可以被消费者处理,但我们不希望生产者在消费者完成之前就为该单元格发出请求,因此“a”状态在这里是合适的。当消费者处理单元格时,生产者不应该为它发出请求,所以单元格可以保持“a”状态。当消费者完成单元格时,现在生产者可以为它发出请求,但消费者不应该再次尝试消耗该单元格(因为它已经对文件块进行了散列,再次将其传递给哈希更新函数将导致不正确的结果),因此我们必须将单元格转换为“c”状态。最后,一旦生产者头再次到达单元格,循环就会重新开始。
但是这个设计如何满足我们希望程序具有的正确性属性呢?我们希望有些证明可以证明消费者不会卡住。所以,假设消费者头当前指向插槽 i。什么会阻止它移动到插槽 i+1?唯一的事情就是当前插槽不是“a”状态——如果它是“a”状态,那么消费者将处理它,然后继续下一个插槽。所以我们只需要证明,如果一个插槽处于非“a”状态,它最终将过渡到“a”状态。那么我们如何证明呢?嗯,只有两种非“a”状态:“r”和“c”。如果一个插槽处于“r”状态,那么它只是等待读取系统调用返回,当读取系统调用返回时,状态将变为“a”,所以所有“r”单元格最终都会变为“a”。如果一个插槽处于“c”状态,那么我们需要一些保证,生产者最终会将其转换为“r”状态。这也很容易证明——生产者会不断地遍历数组,将所有“c”转换为“r”,只有在同时进行的请求数量达到上限,或者生产者到达一个不是“c”状态的单元格时才会停止(请注意,这里我谈论的是概念上的高级抽象设计,而不是任何特定的实现)。我们只需要证明,生产者永远不会被一个非“c”单元格挡住而无法到达一个“c”单元格。好吧,通过图表可能更容易看到。
上图显示了一个大小为 5 的环形缓冲区,队列深度为 3(图显示了队列深度小于环形缓冲区大小时发生的情况——它只是意味着请求者循环在更多完成到来之前不会发出更多请求。正如您所见,当生产者头一次遍历环形缓冲区的一个插槽时,每次都请求文件的下一个块,所以消费者跟随生产者头,所以每次消费者处理一个新的插槽时,就是文件的下一个块。如果队列深度大于或等于环形缓冲区的大小,那么情况就更简单了:请求者循环将只为数组中处于“c”状态的每个单元格发出请求。开始时,消费者和生产者头都指向第一个插槽,即插槽 0。现在,此时,消费者头卡住了,无法移动,但生产者头可以移动,所以生产者头先移动——它提交请求,生产者头移动 3 个插槽,直到达到队列深度,然后我们检查完成。现在这里可能发生不同的事情。左侧是如果所有 3 个请求在那时都返回会发生什么。在这种情况下,所有处于“r”状态的插槽现在都处于“a”状态,所以当我们运行消费者时,消费者将愉快地遍历所有这些“a”插槽,在处理它们的同时进行处理,在完成处理后将它们变成“c”插槽,最后,它将到达仍然是“c”状态的插槽 3,所以现在消费者停止并将控制权交还给生产者。在另一个宇宙中,如果插槽 0 仍然处于“r”状态,那么消费者仍然在等待插槽 0,因此将控制权交还给生产者,生产者然后将继续发出插槽 3 和 4 的请求,因为正在进行的请求数量已降至队列深度以下。然后我们请求完成,看到插槽 4 已返回,所以我们将其转换为“a”,我们再次尝试运行消费者,但它仍在等待插槽 0,然后我们回到生产者,它也在等待插槽 0。
所以,我想证明生产者永远不会永久阻塞,这是因为,如图所示,当生产者遍历环形缓冲区时,它会留下处于“r”状态的单元格,这些单元格最终会过渡到“a”状态。同样,当消费者遍历环形缓冲区时,它会留下处于“c”状态的单元格。因此,我们确保“c”单元格和“a”/“r”单元格不会“混合”——消费者头“后面”但生产者头“前面”的所有单元格都是“c”,而消费者头“后面”但生产者头“前面”的所有单元格都是“r”或“a”。因此,如图所示,当生产者头到达处于“r”或“a”状态的插槽时,这意味着它已经追上了消费者头,因为如果它落后于消费者头,它就会看到“c”状态的单元格。因此,这意味着它与消费者头指向的同一个单元格。如果单元格处于“a”状态,那么消费者将处理它并将其转换为“c”状态,如果它处于“r”状态,那么它最终将变为“a”状态,然后消费者将处理它并将其转换为“c”状态。
上面的推理是针对通用的多线程实现,其中生产者和消费者在单独的线程上运行。但我们的单线程环形缓冲区实现呢?实际上,我上面显示的事件顺序与我的单线程环形缓冲区实现中的事件顺序是一致的,它由一个主循环组成,该循环由 3 个子循环组成,它们按顺序执行——一个“请求者循环”发出请求,“完成循环”检查完成,最后是“消费者循环”运行消费者。我想证明这个执行顺序永远不会卡住。所以,首先,请求者循环可以发出零个或多个请求。正如我们所讨论的,它将发出请求,直到同时进行的请求数量达到队列深度,或者直到生产者头到达一个不是“c”状态的单元格。所以这个循环运行得非常快,永远不会卡住。接下来,我们等待直到完成队列至少包含一个完成,然后我们从完成队列中移除所有完成。那么,这个循环会卡住吗?这只有在我们永远等待看到一个完成时才可能发生。而这只有在我们没有任何“r”状态的单元格时才可能发生。为什么?因为如果至少有一个单元格处于“r”状态,那么读取系统调用最终会返回,我们就会得到一个完成事件。那么,在我们等待完成队列时,是否存在没有“r”状态单元格的可能性?
让我们考虑一下发生这种情况所需的条件
- 没有正在进行的请求
- 请求者循环没有发出任何请求
那么,请求者循环没有发出任何请求,但也没有正在进行的请求的情况是什么?只有当生产者头指向的单元格不是“c”状态时才可能。并且由于没有“r”状态的单元格,这意味着该单元格必须处于“a”状态。我之前已经解释过,只有在生产者头追上了消费者头的情况下才可能出现这种情况(因为消费者在消费者头后面留下了一个“c”状态的单元格轨迹)。这意味着生产者和消费者头指向同一个单元格。现在,由于请求者循环没有发出任何请求,这意味着生产者头在请求者循环的开始到结束之间没有移动,所以请求者循环开始时,生产者和消费者头都指向同一个单元格,该单元格处于“a”状态。这意味着上一次消费者循环的运行必须以消费者头指向一个处于“a”状态的单元格结束(它不可能是“r”状态,因为我们只在完成循环中将单元格状态从“r”更新为“a”,所以从消费者循环开始到请求者循环结束,单元格的状态不能从“r”变为“a”。另一种思考方式是,如果消费者循环以消费者头指向一个处于“r”状态的单元格结束,那么该单元格在请求者循环中也会一直处于“r”状态,因为请求者循环不能将“r”单元格转换为“a”单元格,这意味着在进入完成循环时该单元格将处于“r”状态,而我刚刚解释了这里不可能)。因此,这是不可能的,因为消费者将尝试处理所有处于“a”状态的单元格,将它们转换为“c”状态,因此,消费者头不可能指向消费者循环结束时的“a”状态单元格。
因此,我们已经证明完成循环不可能卡住。最后一个循环,消费者循环,也卡不住,因为它只是遍历数组,处理单元格,直到遇到一个不是“c”状态的单元格,此时它就终止了。
所以,我们已经确定,我们的单线程环形缓冲区实现永远不会卡住。那么正确性呢?嗯,因为我们使用状态来确保正确性,所以我们只需要确保状态始终是正确的。在单线程实现中,这三个循环按顺序执行。这确保了状态始终是正确的。当请求者循环终止时,它触及的单元格,即转换为“r”状态的单元格,确实处于“r”状态。当完成循环终止时,它触及的单元格,即转换为“a”状态的单元格,确实处于“a”状态。当消费者循环终止时,它触及的单元格,即转换为“c”状态的单元格,确实处于“c”状态。由于循环按顺序运行,因此可以保证从每个循环的角度来看,单元格状态始终是正确的,因此,例如,消费者可以确信处于“a”状态的单元格确实可以供它处理。在多线程实现中,需要同步来防止生产者线程和消费者线程同时访问单元格状态,因为这将是数据竞争,因此是未定义行为。
现在,关于重新发出请求呢?例如,如果读取请求失败,或者您收到了一个短读,那么您可能需要重新发出读取请求(使用更新的参数)。现在,在我的程序中,消费者按顺序消耗,而生产者也按顺序发出请求,但如果您想重新发出请求,那么这意味着发出一个您之前已经请求过的块的请求。但这实际上完全没问题。为什么可以?因为我的程序实际上不关心请求发出的顺序。它只关心生产者和消费者头是否一次一个单元格地在数组中移动,单元格是否映射到正确的块,以及单元格的状态是否始终正确。也就是说,如果一个单元格过渡到“a”状态,那么消费者处理它就必须是没问题的。这实际上与读取请求返回错误或不完整读取并且您想重新发出请求的用例完全一致——在这种情况下,您只是不将单元格转换为“a”状态,而是将其保留在“r”状态并重新发出请求。从消费者和生产者的角度来看,这看起来就像单元格比平时更长地停留在“r”状态。您可以在完成循环中实现这一点——如果系统调用返回值(cqe 中的 res 字段)不符合预期,那么您可以立即在那里重新发出请求。
概述
以下是其工作原理的简单概述
- 最初,生产者将开始遍历数组,向前移动“请求者头”,并且它留下的插槽最初都标记为“已请求但尚未完成”,并且它们将随着时间的推移以不确定的顺序变为“可供消耗”。
- 消费者随后跟进,一旦下一个插槽变为“可供消耗”,就向前移动“消费者头”,并留下标记为“已消耗”的插槽轨迹。
- 生产者最终会绕回到数组的开头,它会等待数组的第一个插槽变为“已消耗”状态,因为仅有的另外 2 种状态是“已请求但尚未完成”和“可供消耗”,并且它不能为处于这些状态的插槽发出请求,因为它们在语义上是“尚未消耗”。
因此,生产者不断等待下一个插槽变为“已消耗”,而消费者不断等待下一个插槽变为“可供消耗”。请求者留下“已请求但尚未完成”的插槽轨迹,这些插槽最终将变为“可供消耗”,而消费者留下“已消耗”的插槽轨迹。
这是一个图示
因此,在上图中,P 箭头表示生产者“请求者头”的位置(它在概念上是生产者——不一定是实际线程),而 C 箭头表示消费者的“消费者头”。正如您所见,只要生产者头前面有一个“c”(“已消耗”)插槽,其“请求者头”就会向前移动,而只要消费者头前面有一个“a”(“可供消耗”)插槽,其“消费者头”也会向前移动。因此,我们可以看到,最终,请求者头“后面”的所有插槽都将变为“a”,而消费者头“后面”的所有插槽将立即标记为“c”。因此,我们拥有这样一个属性:消费者头“前面”但生产者头“后面”的所有插槽最终都会变为“a”,而消费者头“后面”但生产者头“前面”的所有插槽始终是“c”。开始时所有插槽都标记为“c”,所以生产者头会先移动,而消费者头不能移动。然后,当生产者留下“r”的轨迹时,一些“r”会变成“a”,所以消费者头也会开始移动。因此,我们建立了主要的循环,其中生产者向前移动,消耗“c”并留下变为“a”的“r”,而消费者则在后面跟随,消耗“a”并将它们变回“c”。
多线程实现
在本节中,我将解释多线程实现与单线程实现有何不同,以及为什么我认为它是正确的且不会卡住。要理解本节,您需要阅读前面的部分,其中解释了单线程实现的工作原理以及为什么我认为它是正确的且不会卡住。
这是我程序的多线程版本的代码
/* SPDX-License-Identifier: MIT */
/*
* Compile with: g++ -Wall -O3 -D_GNU_SOURCE liburing_b3sum_multithread.cc
* -luring libblake3.a -o liburing_b3sum_multithread
* For an explanation of how this code works,
* see my article/blog post: https://1f604.com/b3sum
*
* Note: the comments in this program are copied unchanged from the single-threaded
* implementation in order to make it easier for a reader using `diff` to see
* the code changes from the single-thread version
*
* This program is a modified version of the liburing cp program
* from Shuveb Hussain's io_uring tutorial.
* Original source code here:
* https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
* The modifications were made by 1f604.
*
* The official io_uring documentation can be seen here:
* - https://kernel.dk/io_uring.pdf
* - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
*
* Acronyms: SQ = submission queue, SQE = submission queue entry,
* CQ = completion queue, CQE = completion queue event
*/
#include "blake3.h"
#include "liburing.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>
// Multithreading stuff
#include <atomic>
#include <thread>
/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT
// requires aligned memory
/* ===============================================
* ========== Start of global variables ==========
* ===============================================
* Declared static because they are only supposed to be visible within this .c file.
*
* --- Command line options ---
* The following variables are set by the user from the command line.
*/
static int g_queuedepth; /* This variable is not really the queue depth,
* but is more accurately described as
* "the limit on the number of incomplete requests".
* Allow me to explain. io_uring allows you to
* have more requests in-flight than the
* size of your submission
* (and completion) queues. How is this possible?
* Well, when you call io_uring_submit,
* normally it will submit ALL of the requests
* in the submission queue, which means that
* when the call returns, your submission queue
* is now empty, even though the requests
* are still "in-flight" and haven't been
* completed yet!
* In earlier kernels, you could overflow
* the completion queue because of this.
* Thus it says in the official io_uring
* documentation
* (https://kernel.dk/io_uring.pdf):
* Since the sqe lifetime is only that of the
* actual submission of it, it's possible
* for the application to drive a higher pending
* request count than the SQ ring size would
* indicate. The application must take care
* not to do so, or it could risk
* overflowing the CQ ring.
* That is to say, the official documentation
* recommended that applications should ensure
* that the number of in-flight requests does not
* exceed the size of the submission queue.
* This g_queuedepth variable is therefore a limit
* on the total number of "incomplete"
* requests, which is the number of requests
* on the submission queue plus the number of
* requests that are still "in flight".
* See num_unfinished_requests for details
* on how this is implemented. */
static int g_use_o_direct; // whether to use O_DIRECT
static int g_process_in_inner_loop; // whether to process the data
// inside the inner loop
static int g_use_iosqe_io_drain; // whether to issue requests with the
// IOSQE_IO_DRAIN flag whether to issue
static int g_use_iosqe_io_link; // requests with the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable either IORING_SETUP_SQPOLL or
// IORING_SETUP_IOPOLL, on my current system,
//static int g_use_ioring_setup_sqpoll; // it turns my process into an unkillable zombie
// that uses 100% CPU that never terminates.
// when I was using encrypted LVM,
// it just gave me error:
// Operation not supported.
// I decided to not allow users to enable
// these options because I didn't want them
// to accidentally launch an unkillable
// never-ending zombie process that
// uses 100% CPU.
// I observed this problem in fio too when
// I enabled --hipri on fio, it also turned into
// an unkillable never-ending zombie process
// that uses 100% CPU.
static size_t g_blocksize; // This is the size of each buffer in the
// ringbuf, in bytes.
// It is also the size of each read from the file.
static size_t g_numbufs; // This is the number of buffers in the ringbuf.
/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
static int g_filedescriptor; // This is the file descriptor of the
// file we're hashing. This will be filled in
static size_t g_filesize; // by the function that gets the file size.
static size_t g_num_blocks_in_file; // The number of blocks in the file,
// where each block is g_blocksize bytes.
// This will be calculated by a ceiling
// division of filesize by blocksize.
static size_t g_size_of_last_block; // The size of the last block in the file.
// See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0; // Position of the "producer head".
// See explanation in my article/blog post
static int consumer_head = 0; // Position of the "consumer head".
// See explanation in my article/blog post
#define AVAILABLE_FOR_CONSUMPTION 1
#define ALREADY_CONSUMED 2
#define REQUESTED_BUT_NOT_YET_COMPLETED 3
struct my_custom_data { // This is the user_data associated
// with read requests,
// which are placed on the submission ring.
// In applications using io_uring,
// the user_data struct is generally used to
// identify which request a completion is for.
// In the context of this program,
// this structure is used both to identify which
// block of the file the read syscall had just read,
// as well as for the producer and consumer to
// communicate with each other, since it holds
// the cell state. This can be thought of
// as a "cell" in the ring buffer,
// since it holds the state of the cell as well
// as a pointer to the data (i.e. a block read
// from the file) that is "in" the cell.
// Note that according to the official
// io_uring documentation, the user_data struct
// only needs to be valid until the submit is done,
// not until completion.
// Basically, when you submit, the kernel makes
// a copy of user_data and returns it to you
// with the CQE (completion queue entry).
unsigned char* buf_addr; // Pointer to the buffer where the read syscall
// is to place the bytes from the file into.
size_t nbytes_expected; // The number of bytes we expect the read syscall to return.
// This can be smaller than the size of the buffer
// because the last block of the file
// can be smaller than the other blocks.
//size_t nbytes_to_request; // The number of bytes to request.
// This is always g_blocksize.
// I made this decision because O_DIRECT requires
// nbytes to be a multiple of filesystem block size,
// and it's simpler to always just request
// g_blocksize.
off_t offset_of_block_in_file; // The offset of the block in the file that
// we want the read syscall to place into
// the memory location
// pointed to by buf_addr.
std::atomic_int state; // Describes whether the item is available
// to be hashed, already hashed,
// or requested but not yet available for hashing.
int ringbuf_index; // The slot in g_ringbuf where this "cell" belongs.
// I added this because once we submit a request
// on submission queue, we lose track of it.
// When we get back a completion,
// we need an identifier to know which request
// the completion is for.
// Alternatively, we could use something to map
// the buf_addr to the ringbuf_index,
// but this is just simpler.
};
// multithreading function
static void update_cell_state(struct my_custom_data* data, int new_state)
{ // State updates need to be
// synchronized because both threads look at the state
// In effect they communicate via state updates.
// both the consumer and producer threads block/wait for state change to continue
data->state = new_state;
}
struct my_custom_data* g_ringbuf; // This is a pointer to an array of my_custom_data structs.
// These my_custom_data structs can be thought of as the
// "cells" in the ring buffer (each struct contains the cell state),
// thus the array that this points to can be
// thought of as the "ring buffer" referred to
// in my article/blog post, so read that to understand how this is used.
// See the allocate_ringbuf function for details on
// how and where the memory for the ring buffer is allocated.
/* ===============================================
* =========== End of global variables ===========
* ===============================================*/
static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
int rc;
int flags = 0;
rc = io_uring_queue_init(entries, ring, flags);
if (rc < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-rc));
return -1;
}
return 0;
}
static int get_file_size(int fd, size_t *size)
{
struct stat st;
if (fstat(fd, &st) < 0)
return -1;
if (S_ISREG(st.st_mode)) {
*size = st.st_size;
return 0;
} else if (S_ISBLK(st.st_mode)) {
unsigned long long bytes;
if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
return -1;
*size = bytes;
return 0;
}
return -1;
}
static void add_read_request_to_submission_queue
(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
assert(fileoffset_to_request % g_blocksize == 0);
int block_number = fileoffset_to_request / g_blocksize; // the number of the block in the file
/* We do a modulo to map the file block number to the index in the ringbuf
e.g. if ring buf_addr has 4 slots, then
file block 0 -> ringbuf index 0
file block 1 -> ringbuf index 1
file block 2 -> ringbuf index 2
file block 3 -> ringbuf index 3
file block 4 -> ringbuf index 0
file block 5 -> ringbuf index 1
file block 6 -> ringbuf index 2
And so on.
*/
int ringbuf_idx = block_number % g_numbufs;
struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of
// a my_custom_data struct should never change.
my_data->offset_of_block_in_file = fileoffset_to_request;
assert (my_data->buf_addr); // We don't need to change my_data->buf_addr
// since we set it to point into the backing buffer at the start of the program.
my_data->nbytes_expected = expected_return_size;
update_cell_state(my_data, REQUESTED_BUT_NOT_YET_COMPLETED);
/*my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
* 1. The producer is about
* to send it off in a request.
* 2. The consumer shouldn't be
* trying to read this buffer
* at this point.
* So it is okay to set the
* state to this here.
*/
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
if (!sqe) {
puts("ERROR: FAILED TO GET SQE");
exit(1);
}
io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr,
g_blocksize, fileoffset_to_request);
// io_uring_prep_read sets sqe->flags to 0,
// so we need to set the flags AFTER calling it.
if (g_use_iosqe_io_drain)
sqe->flags |= IOSQE_IO_DRAIN;
if (g_use_iosqe_io_link)
sqe->flags |= IOSQE_IO_LINK;
io_uring_sqe_set_data(sqe, my_data);
}
static void increment_buffer_index(int* head) // moves the producer or consumer
// head forward by one position
{
*head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}
static void consumer_thread() // Conceptually, this resumes the consumer "thread".
{ // As this program is single-threaded,
// we can think of it as cooperative multitasking.
for (int i = 0; i < g_num_blocks_in_file; ++i){
while (g_ringbuf[consumer_head].state
!= AVAILABLE_FOR_CONSUMPTION) {} // busywait
// Consume the item.
// The producer has already checked that nbytes_expected is
// the same as the amount of bytes actually returned.
// If the read syscall returned something different to nbytes_expected
// then the program would have terminated with an error message.
// Therefore it is okay to assume here that nbytes_expected
// is the same as the amount of actual data in the buffer.
blake3_hasher_update(&g_hasher, g_ringbuf[consumer_head].buf_addr,
g_ringbuf[consumer_head].nbytes_expected);
// We have finished consuming the item, so mark it as consumed
// and move the consumer head to point to the next cell in the ringbuffer.
update_cell_state(&g_ringbuf[consumer_head], ALREADY_CONSUMED);
increment_buffer_index(&consumer_head);
}
// Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
uint8_t output[BLAKE3_OUT_LEN];
blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);
// Print the hash as hexadecimal.
printf("BLAKE3 hash: ");
for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
printf("%02x", output[i]);
}
printf("\n");
}
static void producer_thread()
{
int rc;
unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
unsigned long num_unfinished_requests = 0;
/* A brief note on how the num_unfinished_requests variable is used:
* As mentioned earlier, in io_uring it is possible
* to have more requests in-flight than the size of the completion ring.
* In earlier kernels this could cause the completion queue to overflow.
* In later kernels there was an option added (IORING_FEAT_NODROP) which,
* when enabled, means that if a completion event occurs
* and the completion queue is full, then the kernel will internally
* store the event until the completion queue has room for more entries.
* Therefore, on newer kernels, it isn't necessary, strictly speaking,
* for the application to limit the number of in-flight requests.
* But, since it is almost certainly the case that an application
* can submit requests at a faster rate than the system is capable
* of servicing them, if we don't have some backpressure mechanism,
* then the application will just keep on submitting more and more
* requests, which will eventually lead to the system running out of memory.
* Setting a hard limit on the total number of in-flight
* requests serves as a backpressure mechanism to prevent the number
* of requests buffered in the kernel from growing without bound.
* The implementation is very simple: we increment
* num_unfinished_requests whenever a request is placed
* onto the submission queue, and decrement it whenever
* an entry is removed from the completion queue.
* Once num_unfinished_requests hits the limit that we set,
* then we cannot issue any more requests until we receive more completions,
* therefore the number of new completions that we receive is exactly
* equal to the number of new requests that we will place,
* thus ensuring that the number of in-flight
* requests can never exceed g_queuedepth.
*/
off_t next_file_offset_to_request = 0;
struct io_uring uring;
if (setup_io_uring_context(g_queuedepth, &uring)) {
puts("FAILED TO SET UP CONTEXT");
exit(1);
}
struct io_uring* ring = ů
while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
// Step 1. Submit read requests.
// Step 2. Retrieve completions.
// Step 3. Run consumer.
/*
* Step 1: Make zero or more read requests until g_queuedepth limit is reached,
* or until the producer head reaches
* a cell that is not in the ALREADY_CONSUMED state
* (see my article/blog post for explanation).
*/
unsigned long num_unfinished_requests_prev =
num_unfinished_requests; // The only purpose of this variable is to keep track of whether
// or not we added any new requests to the submission queue.
while (num_blocks_left_to_request) {
if (num_unfinished_requests >= g_queuedepth)
break;
// wait for state to change
if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
break;
/* expected_return_size is the number of bytes that we expect this read request to return.
* expected_return_size will be the block size until the last block of the file.
* when we get to the last block of the file, expected_return_size will be the size of the last
* block of the file, which is calculated in calculate_numblocks_and_size_of_last_block
*/
size_t expected_return_size = g_blocksize;
if (num_blocks_left_to_request == 1) // if we're at the last block of the file
expected_return_size = g_size_of_last_block;
add_read_request_to_submission_queue(ring, expected_return_size, next_file_offset_to_request);
next_file_offset_to_request += expected_return_size;
++num_unfinished_requests;
--num_blocks_left_to_request;
// We have added a request for the read syscall to write into the cell in the ringbuffer.
// The add_read_request_to_submission_queue has already marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
// so now we just need to move the producer head to point to the next cell in the ringbuffer.
increment_buffer_index(&producer_head);
}
// If we added any read requests to the submission queue, then submit them.
if (num_unfinished_requests_prev != num_unfinished_requests) {
rc = io_uring_submit(ring);
if (rc < 0) {
fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
exit(1);
}
}
/*
* Step 2: Remove all the items from the completion queue.
*/
bool first_iteration = 0; // On the first iteration of the loop,
// we wait for at least one cqe to be available,
// then remove one cqe.
// On each subsequent iteration,
// we try to remove one cqe without waiting.
// The loop terminates only when there are
// no more items left in the completion queue,
while (num_blocks_left_to_receive) { // Or when we've read in all of the blocks of the file.
struct io_uring_cqe *cqe;
if (first_iteration) { // On the first iteration,
// we always wait until at least one cqe is available
rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed and give us one cqe.
first_iteration = 0;
} else {
rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are
// no more items left in the completion queue.
if (rc == -EAGAIN) { // A return code of -EAGAIN means that there are
// no more items left in the completion queue.
break;
}
}
if (rc < 0) {
fprintf(stderr, "io_uring_peek_cqe: %s\n",
strerror(-rc));
exit(1);
}
assert(cqe);
// At this point we have a cqe, so let's see what our syscall returned.
struct my_custom_data *data = (struct my_custom_data*) io_uring_cqe_get_data(cqe);
// Check if the read syscall returned an error
if (cqe->res < 0) {
// we're not handling EAGAIN because it should never happen.
fprintf(stderr, "cqe failed: %s\n",
strerror(-cqe->res));
exit(1);
}
// Check if the read syscall returned an unexpected number of bytes.
if ((size_t)cqe->res != data->nbytes_expected) {
// We're not handling short reads because they should never happen on a disk-based filesystem.
if ((size_t)cqe->res < data->nbytes_expected) {
puts("panic: short read");
} else {
puts("panic: read returned more data than expected (wtf).
Is the file changing while you're reading it??");
}
exit(1);
}
assert(data->offset_of_block_in_file % g_blocksize == 0);
// If we reach this point, then it means that there were no errors:
// the read syscall returned exactly what we expected.
// Since the read syscall returned, this means it has finished
// filling in the cell with ONE block of data from the file.
// This means that the cell can now be read by the consumer,
// so we need to update the cell state.
update_cell_state(&g_ringbuf[data->ringbuf_index], AVAILABLE_FOR_CONSUMPTION);
--num_blocks_left_to_receive; // We received ONE block of data from the file
io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed, so that its slot can get reused
--num_unfinished_requests;
//if (g_process_in_inner_loop)
// resume_consumer();
}
/* Step 3: Run consumer. This might be thought of as handing over control
to the consumer "thread". See my article/blog post. */
//resume_consumer();
}
//resume_consumer();
close(g_filedescriptor);
io_uring_queue_exit(ring);
}
static void process_cmd_line_args(int argc, char* argv[])
{
if (argc != 9) {
printf("%s: infile g_blocksize g_queuedepth g_use_o_direct
g_process_in_inner_loop g_numbufs g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
exit(1);
}
g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
g_queuedepth = atoi(argv[3]);
if (g_queuedepth > 32768)
puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
g_use_o_direct = atoi(argv[4]);
g_process_in_inner_loop = atoi(argv[5]);
g_numbufs = atoi(argv[6]);
g_use_iosqe_io_drain = atoi(argv[7]);
g_use_iosqe_io_link = atoi(argv[8]);
}
static void open_and_get_size_of_file(const char* filename)
{
if (g_use_o_direct){
g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
puts("opening file with O_DIRECT");
} else {
g_filedescriptor = open(filename, O_RDONLY);
puts("opening file without O_DIRECT");
}
if (g_filedescriptor < 0) {
perror("open file");
exit(1);
}
if (get_file_size(g_filedescriptor, &g_filesize)){
puts("Failed getting file size");
exit(1);
}
}
static void calculate_numblocks_and_size_of_last_block()
{
// this is the mathematically correct way to do ceiling division
// (assumes no integer overflow)
g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
// calculate the size of the last block of the file
if (g_filesize % g_blocksize == 0)
g_size_of_last_block = g_blocksize;
else
g_size_of_last_block = g_filesize % g_blocksize;
}
static void allocate_ringbuf()
{
// We only make 2 memory allocations in this entire program and they both happen in this function.
assert(g_blocksize % ALIGNMENT == 0);
// First, we allocate the entire underlying ring buffer (which is a contiguous block of memory
// containing all the actual buffers) in a single allocation.
// This is one big piece of memory which is used to hold the actual data from the file.
// The buf_addr field in the my_custom_data struct points to a buffer within this ring buffer.
unsigned char* ptr_to_underlying_ring_buffer;
// We need aligned memory, because O_DIRECT requires it.
if (posix_memalign((void**)&ptr_to_underlying_ring_buffer, ALIGNMENT, g_blocksize * g_numbufs)) {
puts("posix_memalign failed!");
exit(1);
}
// Second, we allocate an array containing all of the my_custom_data structs.
// This is not an array of pointers, but an array holding all of the actual structs.
// All the items are the same size, which makes this easy
g_ringbuf = (struct my_custom_data*) malloc(g_numbufs * sizeof(struct my_custom_data));
// We partially initialize all of the my_custom_data structs here.
// (The other fields are initialized in the add_read_request_to_submission_queue function)
off_t cur_offset = 0;
for (int i = 0; i < g_numbufs; ++i) {
g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer +
cur_offset; // This will never change during the runtime of this program.
cur_offset += g_blocksize;
// g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to
// ALREADY_CONSUMED at the start. See my article/blog post for explanation.
g_ringbuf[i].ringbuf_index = i; // This will never change during the runtime of this program.
}
}
int main(int argc, char *argv[])
{
assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
process_cmd_line_args(argc, argv); // we need to first parse the command line arguments
open_and_get_size_of_file(argv[1]);
calculate_numblocks_and_size_of_last_block();
allocate_ringbuf();
// Initialize the hasher.
blake3_hasher_init(&g_hasher);
// Run the threads
std::thread t1(producer_thread), t2(consumer_thread);
t1.join();
t2.join();
}
单线程和多线程版本之间只有几个区别,其中两个至关重要,我将解释。好吧,让我们从不重要的区别开始。首先,多线程版本使用了 std::thread
(在 C++11 中添加),所以它必须使用 g++
编译,g++
实际上也可以编译 C 代码……所以您不妨使用 g++ 来编译单线程和多线程版本。所以,新增的两个包含是 <atomic
和 <thread>
。嗯,既然我们使用 std::thread
进行多线程,那么包含 <thread>
是显而易见的。但为什么是 <atomic
?为什么不是 <mutex
或 <condition_variable
?那是因为我们使用原子变量而不是互斥量或条件变量进行同步。
两个关键更改之一是 my_custom_data
结构中 state
字段的类型从 enum
更改为 std::atomic_int
。这是绝对必要的,否则生产者可能会在消费者“同时”读取它(反之亦然)。有关“同时”在此上下文中确切含义的正式定义,请参阅我的博客文章:https://1f604.blogspot.com/2023/06/atomic-weapons-part-1.html),这是一个数据竞争。它需要是原子的,因为您需要消费者线程“看到”生产者线程对内存所做的所有更改,在这种情况下,生产者线程正在用文件中的数据填充缓冲区。原子变量解决了这个问题,因为对原子变量的写入是 store/release,而对原子变量的读取是 load/acquire,并且可以保证,如果线程 A 写入一个原子变量,线程 B 读取同一个原子变量,那么当线程 B 读取原子变量并看到某个值时,线程 A 在将该值写入该原子变量之前所做的一切都将立即对 B 可见。在这种情况下,这意味着,一旦消费者线程读取了状态,它就会“看到”生产者在写入 state
变量之前填充的所有块。更具体地说,如果线程 A 仅在确保读取系统调用已完成写入该单元格之后才将单元格的 state
设置为“a”(即,该单元格现在已填充文件块),那么如果线程 B 看到单元格的 state
是“a”,那么可以保证线程 B 也会看到读取系统调用对单元格所做的更改,即完全写入的数据。另一种思考方式是,生产者和消费者线程通过 state
变量在概念上相互通信——这是共享内存,两个线程都会读写它,因此访问它必须同步,否则两个线程可以同时访问它,这是一个数据竞争,是未定义行为。
现在,如前所述,消费者线程需要等待消费者头指向的单元格变为“a”状态才能继续。在单线程版本中,它只是一个在遇到非“a”单元格时退出的循环。但现在它在一个单独的线程中,我们可以终于用消费者线程概念的直接实现来替换消费者循环,这是一个循环,一旦单元格变为“a”状态就会消耗当前单元格,然后向前移动到下一个单元格并重复。但这要求我们现在实现一个“等待”,而以前在消费者循环中没有任何等待,因为消费者需要等待单元格变为“a”状态才能处理它。无论如何,使用原子变量来实现等待非常简单——您只需要编写一个正文为空的 while 循环,如下所示:while(current_cell.state is not "a") {}
。这被称为“忙等待”,虽然据说对性能相当好,但在“分配足够内存来保存整个文件”版本的程序中,我没有注意到忙等待版本和条件变量等待版本之间的性能差异,所以我怀疑忙等待在此处不会带来任何性能优势。相反,我选择使用它,因为它更容易理解。
我程序的另一个关键更改是从 bool first_iteration = 1;
到 bool first_iteration = 0;
。为什么这是一个关键的更改?因为没有它,程序可能会卡住。如何?回想一下,在我们的单线程程序中,完成循环在第一次迭代中等待一个完成。现在我们已经将生产者函数转换为一个线程,保留这个等待可能会导致程序卡住。让我来演示一下。
上面的图显示了一个具有 2 个插槽的环形缓冲区。开始时,所有单元格都处于“c”状态。现在消费者被阻塞,等待插槽 0 变为“a”状态,而生产者启动请求者循环并遍历数组,将所有处于“a”状态的单元格变为“r”状态。所以现在,单元格 0 和 1 都处于“r”状态。接下来发生的是,由于生产者现在无法发出新请求,它进入完成循环并被阻塞,等待一个完成事件。到目前为止一切都很好。接下来,一个针对插槽 0 的完成到来,所以生产者将该单元格的状态设置为“a”,这意味着消费者现在已解除阻塞,因此消费者开始处理单元格 0。现在生产者已解除阻塞并可以继续,所以它返回到请求者循环并发现无事可做,所以它然后返回到完成循环并等待下一个完成。现在消费者仍在处理单元格 0,生产者正在等待下一个完成。此时,消费者可能会先完成单元格的处理,或者完成事件可能会先到达。假设完成事件先到达。然后生产者会将单元格 1 转换为“a”状态,这样现在两个单元格都处于“a”状态。然后生产者将返回到请求者循环,然后,它会发现它无法推进生产者头,因为单元格 0 仍然处于“a”状态(消费者尚未完成),所以它将退出请求者循环并开始完成循环,等待下一个完成事件。问题就出在这里:现在两个单元格都处于“a”状态,这意味着没有更多的请求在进行中,因此不会有完成事件!消费者将完成单元格 0 的处理,将其转换为“c”状态,但生产者仍在等待一个完成!然后消费者将完成单元格 1 的处理,将其转换为“c”状态,但生产者仍在等待一个永远不会到达的完成,因为没有正在进行的请求!!!所以,这就是为什么在完成循环的第一次迭代中移除等待是绝对至关重要的,否则程序会卡住。
那么,这个程序正确吗?让我们考虑正确性属性,即消费者必须按顺序且一次只处理文件中的所有块。为了证明这一点,只需证明消费者只会在可以处理它的单元格时才处理它就足够了。我们使用单元格状态来保证这个属性,我们说过,如果状态始终是正确的,那么程序就是正确的。让我们考虑一下状态始终是正确的意味着什么。
从消费者的角度来看,这意味着无论何时消费者看到一个单元格处于“a”状态,那么它总是意味着消费者处理它确实是安全的。一个单元格是安全的但不是“a”状态是可以的,因为消费者将简单地不消耗它,而且不会发生坏事。所以从消费者的角度来看,正确性仅仅是,如果一个单元格处于“a”状态,那么处理它确实是安全的,生产者通过首先获取完成事件然后写入状态来更新它来保证这一点,从而确保当消费者看到单元格状态已更改为“a”时,消费者也会看到生产者对单元格所做的更改(即,用文件中的一个块填充它)。
从生产者的角度来看,这意味着无论何时生产者看到一个单元格处于“c”状态,那么它总是意味着为该单元格发出请求确实是安全的。一个单元格可以安全地发出请求但不是“c”状态是可以的,因为生产者将简单地什么都不做,所以不会发生坏事。所以从生产者的角度来看,正确性仅仅是,如果一个单元格处于“c”状态,那么为它发出请求确实是安全的,消费者通过首先处理单元格然后将单元格状态更新为“c”来保证这一点——它只在完成处理单元格后才更新单元格状态,从而确保每当生产者看到单元格状态已更改为“c”时,这意味着消费者已经完成了该块,并且不再需要该单元格中的数据。
因此,通过确保单元格状态始终是正确的,我们确保消费者只会在可以处理该块时处理该块,并且生产者只会在可以发出请求时才发出请求。也就是说,消费者只会在可以读取块时读取块,生产者只会在可以写入块时写入块。
现在,关于最终终止属性呢?我们想证明程序永远不会卡住。让我们从消费者开始,因为如果我们能证明消费者永远不会卡住,那么就足以证明程序永远不会卡住。消费者线程非常简单——它只是等待状态变为“a”。所以消费者卡住的唯一方法是它正在等待的单元格永远不会变成“a”——这意味着它永远停留在“r”或永远停留在“c”。生产者负责将单元格转换为“a”状态(从“r”或“c”状态),这意味着消费者只有在生产者卡住时才会卡住。所以现在我们必须看看生产者。
生产者会卡住吗?生产者线程由两个循环组成:一个请求者循环,后跟一个完成循环。请求者循环会将请求放在 SQ 上,直到它不能再在 SQ 上放置更多请求为止(要么是因为未完成请求的数量已达到队列深度,要么是因为生产者头指向的单元格不是“c”状态),此时它将终止并提交请求。然后生产者线程将进入完成循环,它将遍历完成队列中的完成项,移除它们并将相应的单元格转换为“a”状态,直到完成队列中没有更多完成项,此时完成循环将终止。主要区别在于,如前所述,单线程版本将在完成循环的第一次迭代中等待一个完成,而多线程版本则不会。在多线程版本中,如果完成循环的第一次迭代中完成队列中没有完成项,那么完成循环将立即终止而无需等待。请求者循环也是如此(这总是情况)——如果它不能放置请求,那么请求者循环将立即终止。
生产者基本上有两个任务:首先,它不断检查生产者头指向的单元格,等待它变为“c”状态,以便它可以发出请求,从而完成将生产者头指向的单元格从“c”状态转换为“r”状态的任务,其次,它不断检查完成项,从而完成将单元格从“r”状态转换为“a”状态的任务。没有“等待”,生产者一直在连续执行这两个循环。这两个循环中没有阻塞指令,只是“发出请求直到不能再发出”然后“从完成队列中弹出直到不能再弹出”。因此,很容易看出每个循环都在有限的时间内运行——请求者循环最终总是会终止,因为最终生产者将不再能够推进生产者头(要么因为它到达了一个不是“c”状态的单元格,要么因为它达到了未完成请求的限制)——要看到这一点,想象一个无限快的消费者,因此所有“a”单元格一旦消费者通过它们,就会立即变成“c”单元格。即使在这种情况下,请求者循环仍会终止,因为它需要运行完成循环才能将“r”单元格转换为“a”单元格,这意味着最终环形缓冲区将被“r”单元格填充,从而迫使请求者循环终止。完成循环将始终终止,因为在任何给定时刻,同时进行的请求数量总是有限的,并且在完成循环中不能发出新请求,因此完成循环的每次完整迭代都会将同时进行的请求数量减少 1,所以最终不会有同时进行的请求,因此完成队列将为空,完成循环将终止。因此,请求者循环和完成循环都保证终止,这意味着生产者将始终连续执行这两个任务——它不能卡住其中一个任务。
那么,生产者在什么情况下会卡住?嗯,它实际上不会被永久阻塞在等待上,因为它一直在连续运行这两个循环。这意味着我们可以确信处于“r”状态的单元格最终都会过渡到“a”状态,因为这就是生产者第二个循环所做的。生产者拥有的另一项任务是将“c”状态的单元格转换为“r”状态,而且我们已经在此处确立,每当生产者头指向的单元格过渡到“c”状态时,它最终都保证会过渡到“r”状态,因为这就是生产者的第一个循环所做的。
那么,生产者卡住的唯一方法是生产者头始终等待它指向的单元格转换为“c”状态。现在,如果单元格处于“r”状态,那么如上所述,它最终会过渡到“a”状态,因此我们可以简单地考虑单元格处于“a”状态的情况。实际上,这适用于整个环形缓冲区——因为“r”单元格最终总是会变成“a”单元格,所以我们应该考虑生产者头卡住的任何情况,就好像环形缓冲区中的所有单元格都处于“a”状态或“c”状态一样。以下是这种情况的一个例子。
在上述情况中,生产者卡在等待 P 下的单元格转换为“c”状态,而消费者卡在等待 C 下的单元格转换为“a”状态。两个头都必须卡住,因为如果一个头没有卡住,它最终会移动到另一个头并解除另一个头的阻塞。例如,如果消费者没有卡住,那么它最终会到达 P 并将 P 下的单元格转换为“c”状态。同样,如果生产者没有卡住,那么它最终会到达 C 并将 C 下的单元格转换为“r”状态,而“r”状态最终会变成“a”状态。因此,为了让其中一个卡住,另一个也必须卡住。也就是说,为了让消费者卡住,生产者也必须卡住,为了让生产者卡住,消费者也必须卡住。因此,为了让程序卡住,我们必须有一个生产者和消费者同时卡住的情况。
因此,为了证明程序不会卡住,我们只需要证明生产者和消费者同时卡住是不可能的。让我们从程序的开头开始。在程序开始时,所有单元格都处于“c”状态,只有生产者头可以移动。然后生产者将开始遍历环形缓冲区,将“c”单元格转换为“r”单元格,再转换为“a”单元格,消费者头将跟随。现在,在什么情况下生产者会卡住?当生产者头指向一个处于“r”或“a”状态的单元格时,它就会卡住。现在,由于生产者是唯一能够将单元格转换为“r”或“a”状态的实体,并且环形缓冲区最初所有单元格都处于“c”状态,并且生产者一次移动一个单元格,将它接触到的每个单元格转换为“r”状态(然后变为“a”状态),这意味着生产者被阻塞在一个处于“r”或“a”状态的单元格上的唯一可能性是生产者已经绕环形缓冲区整整一圈,即整个环形缓冲区现在已充满处于“r”或“a”状态的单元格。为了形象地说明这一点,想象一下,开始时所有单元格都处于“c”状态,并且随着生产者向前移动,它就像一条蛇,身体/尾巴由处于“r”状态的单元格组成,然后变成“a”状态,而消费者就像另一条蛇,身体/尾巴由处于“c”状态的单元格组成,消费者蛇跟在生产者蛇后面。如果生产者比消费者快,最终它会“吃掉”所有这些“c”状态的单元格,并追上自己的尾巴……这是它能够到达“a”或“r”状态单元格的唯一方式。
因此,由于环形缓冲区的初始配置(所有单元格都处于“c”状态),生产者卡住的唯一可能情况是环形缓冲区中的所有单元格都处于“r”或“a”状态,我们知道所有“r”单元格最终都会过渡到“a”单元格,所以最终环形缓冲区将完全充满“a”单元格。但是,在这种状态下,消费者不可能卡住,因为它什么都不在等待!因此,在生产者卡住的唯一可能情况下,消费者没有卡住,因此永远不会出现生产者和消费者同时卡住的情况,并且由于我们已经确定程序不会卡住,除非生产者和消费者同时卡住,这意味着程序永远不会卡住。
实际有多少请求正在进行中?
一个自然的问题是,队列深度和环形缓冲区大小的好尺寸是多少?我发现非常小的数字,在 2-4 的范围内,对于单线程 O_DIRECT
版本来说效果很好。我编写了一个单线程环形缓冲区实现的仪器版本,并发现当队列深度和环形缓冲区大小为 4 时,进入消费者循环时正在进行的请求数量总是 3,这完全符合逻辑——在我的机器上,散列速度大于文件读取速度,这意味着一旦读取系统调用返回块,消费者就会立即消耗该块。因此,如果我们发出接下来的 4 个块的请求,并等待至少一个块返回,结果是,一旦第一个块返回,完成循环就会结束,消费者循环就会运行,消耗该块。然后我们回到请求者循环,它发出 1 个请求,然后再次等待完成队列。由于我们一收到块就完成了等待,磁盘刚刚开始读取文件的下一个块,而 CPU 就开始处理该块了。由于 CPU 处理块的速度比磁盘返回块的速度快得多,因此当 CPU 完成处理该块时,磁盘尚未完成读取文件的下一个块,因此我们必须再次等待。因此,正在进行的请求数量总是比队列深度和环形缓冲区大小少一,所以如果队列深度和环形缓冲区大小为 4,它总是 3,如果队列深度和环形缓冲区大小为 2,它总是 1。
$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./instru 1GB.txt 512 4 1 0 4 0 0
opening file with O_DIRECT
BLAKE3 hash: 90e7144c7a091d505102e433fb4f2160d1d56d7f5b39291206819e5648c1e639
Number of requests in-flight:
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 0, 0,
Num records: 2049
real 0m0.302s
user 0m0.237s
sys 0m0.022s
我想一个自然的后续问题是:如果 CPU 比磁盘慢,我们会看到什么?假设队列深度和环形缓冲区大小是 4。那么最初,请求者循环发出 4 个请求,完成循环得到一个块,所以现在有 3 个请求正在进行中。现在,消费者循环运行,但当消费者完成后,假设磁盘已经检索了文件的下一个 2 个块,所以现在只有一个请求在进行中。然后请求者循环再次运行并发出 1 个新请求,所以现在只有 2 个请求在进行中。这次完成循环返回 2 个块而不是 1 个,现在消费者必须消耗这两个块。当消费者完成处理这两个块时,磁盘已经完成了读取的另外 2 个正在进行的块,并花费了一些时间空闲。但是,当消费者循环开始时,有 2 个块正在进行中,尽管当消费者循环结束时,这两个请求早已完成。事实上,当我运行代码并故意减慢消费者速度时,我看到在消费者循环开始时,正在进行的请求数量大多是 2(偶尔是 3)。
$ echo 1 > /proc/sys/vm/drop_caches; sleep 1;
time ./liburing_b3sum_singlethread_instrumented 1GB.txt 512 4 1 0 4 0 0
opening file with O_DIRECT
BLAKE3 hash: 90e7144c7a091d505102e433fb4f2160d1d56d7f5b39291206819e5648c1e639
Number of requests in-flight:
3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0,
Num records: 1042
real 0m0.431s
user 0m0.414s
sys 0m0.016s
附件中是经过仪器化的版本
// This is the instrumented version of liburing_b3sum_singlethread.
/* SPDX-License-Identifier: MIT */
/*
* Compile with: gcc -Wall -O3 -D_GNU_SOURCE liburing_b3sum_singlethread_instrumented.c
* -luring libblake3.a -o liburing_b3sum_singlethread_instrumented
* For an explanation of how this code works, see my article/blog post: https://1f604.com/b3sum
*
* This program is a modified version of the liburing cp program
* from Shuveb Hussain's io_uring tutorial.
* Original source code here: https://github.com/axboe/liburing/blob/master/examples/io_uring-cp.c
* The modifications were made by 1f604.
*
* The official io_uring documentation can be seen here:
* - https://kernel.dk/io_uring.pdf
* - https://kernel-recipes.org/en/2022/wp-content/uploads/2022/06/axboe-kr2022-1.pdf
*
* Acronyms: SQ = submission queue, SQE = submission queue entry,
* CQ = completion queue, CQE = completion queue event
*/
#include "dependencies/rust-blake3-1.3.1/c/blake3.h"
#include "liburing.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/ioctl.h>
// Instrumentation
static int in_flight_count[9000];
static unsigned long num_unfinished_requests = 0;
static int consume_idx = 0;
/* Constants */
static const int ALIGNMENT = 4 * 1024; // Needed only because O_DIRECT requires aligned memory
/* ===============================================
* ========== Start of global variables ==========
* ===============================================
* Declared static because they are only supposed to be visible within this .c file.
*
* --- Command line options ---
* The following variables are set by the user from the command line.
*/
static int g_queuedepth; /* This variable is not really the queue depth,
* but is more accurately described as
* "the limit on the number of incomplete requests".
* Allow me to explain.
* io_uring allows you to have more requests
* in-flight than the size of your submission
* (and completion) queues. How is this possible?
* Well, when you call io_uring_submit,
* normally it will submit ALL of the requests
* in the submission queue, which means that
* when the call returns, your submission queue is now empty,
* even though the requests
* are still "in-flight" and haven't been completed yet!
* In earlier kernels, you could overflow the
* completion queue because of this.
* Thus it says in the official io_uring documentation
* (https://kernel.dk/io_uring.pdf):
* Since the sqe lifetime is only that of the
* actual submission of it, it's possible
* for the application to drive a higher pending
* request count than the SQ ring size
* would indicate. The application must take care
* not to do so, or it could risk
* overflowing the CQ ring.
* That is to say, the official documentation recommended
* that applications should ensure that the number of in-flight
* requests does not exceed the size of the submission queue.
* This g_queuedepth variable is therefore a limit on the
* total number of "incomplete" requests,
* which is the number of requests on the submission queue plus the number of
* requests that are still "in flight".
* See num_unfinished_requests for details on how this is implemented. */
static int g_use_o_direct; // whether to use O_DIRECT
static int g_process_in_inner_loop; // whether to process the data inside the inner loop
static int g_use_iosqe_io_drain; // whether to issue requests with the IOSQE_IO_DRAIN flag
static int g_use_iosqe_io_link; // whether to issue requests with the IOSQE_IO_LINK flag
//static int g_use_ioring_setup_iopoll; // when I enable either IORING_SETUP_SQPOLL or IORING_SETUP_IOPOLL,
// on my current system,
//static int g_use_ioring_setup_sqpoll; // it turns my process into an unkillable zombie that uses
// 100% CPU that never terminates.
// when I was using encrypted LVM, it just gave me error:
// Operation not supported. I decided to not allow users
// to enable these options because I didn't want them
// to accidentally launch an unkillable never-ending zombie
// process that uses 100% CPU. I observed this problem in fio
// too when I enabled --hipri on fio, it also turned into
// an unkillable never-ending zombie process that uses 100% CPU.
static size_t g_blocksize; // This is the size of each buffer in the ringbuf, in bytes.
// It is also the size of each read from the file.
static size_t g_numbufs; // This is the number of buffers in the ringbuf.
/* --- Non-command line argument global variables --- */
blake3_hasher g_hasher;
blake3_hasher g_hasher2;
static int g_filedescriptor; // This is the file descriptor of the file we're hashing.
static size_t g_filesize; // This will be filled in by the function that gets the file size.
static size_t g_num_blocks_in_file; // The number of blocks in the file, where each block is g_blocksize bytes.
// This will be calculated by a ceiling division of filesize by blocksize.
static size_t g_size_of_last_block; // The size of the last block in the file. See calculate_numblocks_and_size_of_last_block.
static int producer_head = 0; // Position of the "producer head". see explanation in my article/blog post
static int consumer_head = 0; // Position of the "consumer head". see explanation in my article/blog post
enum ItemState { // describes the state of a cell in the ring buffer array ringbuf,
// see my article/blog post for detailed explanation
AVAILABLE_FOR_CONSUMPTION, // consumption and processing in the context of this program refers to hashing
ALREADY_CONSUMED,
REQUESTED_BUT_NOT_YET_COMPLETED,
};
struct my_custom_data { // This is the user_data associated with read requests,
// which are placed on the submission ring.
// In applications using io_uring, the user_data struct is
// generally used to identify which request a completion is for.
// In the context of this program, this structure is used both to identify which
// block of the file the read syscall had just read, as well as
// for the producer and consumer to communicate with each other,
// since it holds the cell state. This can be thought of as a "cell"
// in the ring buffer, since it holds the state of the cell as well
// as a pointer to the data (i.e. a block read from the file)
// that is "in" the cell.
// Note that according to the official io_uring documentation,
// the user_data struct only needs to be valid until the submit is done,
// not until completion. Basically, when you submit, the kernel makes
// a copy of user_data and returns it to you with the CQE (completion queue entry).
unsigned char* buf_addr; // Pointer to the buffer where the read syscall is to place
// the bytes from the file into. The number of bytes we expect
size_t nbytes_expected; //the read syscall to return. This can be smaller than the size of the buffer
// because the last block of the file can be smaller than the other blocks.
//size_t nbytes_to_request; // The number of bytes to request. This is always g_blocksize.
// I made this decision because O_DIRECT requires
// nbytes to be a multiple of filesystem block size,
// and it's simpler to always just request g_blocksize.
off_t offset_of_block_in_file; // The offset of the block in the file that we want
// the read syscall to place into the memory location
// pointed to by buf_addr.
enum ItemState state; // Describes whether the item is available to be hashed,
// already hashed, or requested but not yet available for hashing.
int ringbuf_index; // The slot in g_ringbuf where this "cell" belongs.
// I added this because once we submit a request on submission queue,
// we lose track of it.
// When we get back a completion, we need an identifier to know
// which request the completion is for.
// Alternatively, we could use something to map
// the buf_addr to the ringbuf_index, but this is just simpler.
};
struct my_custom_data* g_ringbuf; // This is a pointer to an array of my_custom_data structs.
// These my_custom_data structs can be thought of as the
// "cells" in the ring buffer (each struct contains the cell state),
// thus the array that this points to can be thought of as the
// "ring buffer" referred to in my article/blog post,
// so read that to understand how this is used.
// See the allocate_ringbuf function for details on
// how and where the memory for the ring buffer is allocated.
/* ===============================================
* =========== End of global variables ===========
* ===============================================*/
static int setup_io_uring_context(unsigned entries, struct io_uring *ring)
{
int rc;
int flags = 0;
rc = io_uring_queue_init(entries, ring, flags);
if (rc < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-rc));
return -1;
}
return 0;
}
static int get_file_size(int fd, size_t *size)
{
struct stat st;
if (fstat(fd, &st) < 0)
return -1;
if (S_ISREG(st.st_mode)) {
*size = st.st_size;
return 0;
} else if (S_ISBLK(st.st_mode)) {
unsigned long long bytes;
if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
return -1;
*size = bytes;
return 0;
}
return -1;
}
static void add_read_request_to_submission_queue
(struct io_uring *ring, size_t expected_return_size, off_t fileoffset_to_request)
{
assert(fileoffset_to_request % g_blocksize == 0);
int block_number = fileoffset_to_request / g_blocksize; // the number of the block in the file
/* We do a modulo to map the file block number to the index in the ringbuf
e.g. if ring buf_addr has 4 slots, then
file block 0 -> ringbuf index 0
file block 1 -> ringbuf index 1
file block 2 -> ringbuf index 2
file block 3 -> ringbuf index 3
file block 4 -> ringbuf index 0
file block 5 -> ringbuf index 1
file block 6 -> ringbuf index 2
And so on.
*/
int ringbuf_idx = block_number % g_numbufs;
struct my_custom_data* my_data = &g_ringbuf[ringbuf_idx];
assert(my_data->ringbuf_index == ringbuf_idx); // The ringbuf_index of a
// my_custom_data struct should never change.
my_data->offset_of_block_in_file = fileoffset_to_request;
assert (my_data->buf_addr); // We don't need to change my_data->buf_addr
// since we set it to point into the backing buffer at the start of the program.
my_data->nbytes_expected = expected_return_size;
my_data->state = REQUESTED_BUT_NOT_YET_COMPLETED; /* At this point:
* 1. The producer is about
* to send it off in a request.
* 2. The consumer shouldn't be
* trying to read this buffer at this point.
* So it is okay to set the state to this here.
*/
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
if (!sqe) {
puts("ERROR: FAILED TO GET SQE");
exit(1);
}
io_uring_prep_read(sqe, g_filedescriptor, my_data->buf_addr, g_blocksize, fileoffset_to_request);
// io_uring_prep_read sets sqe->flags to 0, so we need to set the flags AFTER calling it.
if (g_use_iosqe_io_drain)
sqe->flags |= IOSQE_IO_DRAIN;
if (g_use_iosqe_io_link)
sqe->flags |= IOSQE_IO_LINK;
io_uring_sqe_set_data(sqe, my_data);
}
static void increment_buffer_index(int* head) // moves the producer or consumer head forward by one position
{
*head = (*head + 1) % g_numbufs; // wrap around when we reach the end of the ringbuf.
}
static void resume_consumer() // Conceptually, this resumes the consumer "thread".
{ // As this program is single-threaded,
// we can think of it as cooperative multitasking.
in_flight_count[consume_idx] = num_unfinished_requests;
++consume_idx;
while (g_ringbuf[consumer_head].state == AVAILABLE_FOR_CONSUMPTION){
// Consume the item.
// The producer has already checked that nbytes_expected
// is the same as the amount of bytes actually returned.
// If the read syscall returned something different to
// nbytes_expected then the program would have terminated with an error message.
// Therefore it is okay to assume here that nbytes_expected
// is the same as the amount of actual data in the buffer.
blake3_hasher_update(&g_hasher,
g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);
// blake3_hasher_update(&g_hasher2,
// g_ringbuf[consumer_head].buf_addr, g_ringbuf[consumer_head].nbytes_expected);
// We have finished consuming the item, so mark it as consumed
// and move the consumer head to point to the next cell in the ringbuffer.
g_ringbuf[consumer_head].state = ALREADY_CONSUMED;
increment_buffer_index(&consumer_head);
}
}
static void producer_thread()
{
int rc;
unsigned long num_blocks_left_to_request = g_num_blocks_in_file;
unsigned long num_blocks_left_to_receive = g_num_blocks_in_file;
/* A brief note on how the num_unfinished_requests variable is used:
* As mentioned earlier, in io_uring it is possible to have more requests
* in-flight than the size of the completion ring.
* In earlier kernels this could cause the completion queue to overflow.
* In later kernels there was an option added (IORING_FEAT_NODROP) which,
* when enabled, means that if a completion event occurs
* and the completion queue is full, then the kernel will internally
* store the event until the completion queue has room for more entries.
* Therefore, on newer kernels, it isn't necessary, strictly speaking,
* for the application to limit the number of in-flight requests.
* But, since it is almost certainly the case that an application
* can submit requests at a faster rate than the system is capable of
* servicing them, if we don't have some backpressure mechanism,
* then the application will just keep on submitting more and more
* requests, which will eventually lead to the system running out of memory.
* Setting a hard limit on the total number of in-flight requests serves
* as a backpressure mechanism to prevent the number of requests
* buffered in the kernel from growing without bound.
* The implementation is very simple: we increment num_unfinished_requests
* whenever a request is placed onto the submission queue,
* and decrement it whenever an entry is removed from the completion queue.
* Once num_unfinished_requests hits the limit that we set, then we cannot
* issue any more requests until we receive more completions,
* therefore the number of new completions that we receive is exactly
* equal to the number of new requests that we will place,
* thus ensuring that the number of in-flight
* requests can never exceed g_queuedepth.
*/
off_t next_file_offset_to_request = 0;
struct io_uring uring;
if (setup_io_uring_context(g_queuedepth, &uring)) {
puts("FAILED TO SET UP CONTEXT");
exit(1);
}
struct io_uring* ring = ů
while (num_blocks_left_to_receive) { // This loop consists of 3 steps:
// Step 1. Submit read requests.
// Step 2. Retrieve completions.
// Step 3. Run consumer.
/*
* Step 1: Make zero or more read requests until g_queuedepth
* limit is reached, or until the producer head reaches
* a cell that is not in the ALREADY_CONSUMED state
* (see my article/blog post for explanation).
*/
unsigned long num_unfinished_requests_prev =
num_unfinished_requests; // The only purpose of this variable is to keep track
// of whether or not we added any new requests
// to the submission queue.
while (num_blocks_left_to_request) {
if (num_unfinished_requests >= g_queuedepth)
break;
if (g_ringbuf[producer_head].state != ALREADY_CONSUMED)
break;
/* expected_return_size is the number of bytes that we expect this read request to return.
* expected_return_size will be the block size until the last block of the file.
* when we get to the last block of the file, expected_return_size will be the size of the last
* block of the file, which is calculated in calculate_numblocks_and_size_of_last_block
*/
size_t expected_return_size = g_blocksize;
if (num_blocks_left_to_request == 1) // if we're at the last block of the file
expected_return_size = g_size_of_last_block;
add_read_request_to_submission_queue(ring, expected_return_size, next_file_offset_to_request);
next_file_offset_to_request += expected_return_size;
++num_unfinished_requests;
--num_blocks_left_to_request;
// We have added a request for the read syscall to write into the cell in the ringbuffer.
// The add_read_request_to_submission_queue has already marked it as REQUESTED_BUT_NOT_YET_COMPLETED,
// so now we just need to move the producer head to point to the next cell in the ringbuffer.
increment_buffer_index(&producer_head);
}
// If we added any read requests to the submission queue, then submit them.
if (num_unfinished_requests_prev != num_unfinished_requests) {
rc = io_uring_submit(ring);
if (rc < 0) {
fprintf(stderr, "io_uring_submit: %s\n", strerror(-rc));
exit(1);
}
}
/*
* Step 2: Remove all the items from the completion queue.
*/
bool first_iteration = 1; // On the first iteration of the loop,
// we wait for at least one cqe to be available,
// then remove one cqe.
// On each subsequent iteration,
// we try to remove one cqe without waiting.
// The loop terminates only when there are
// no more items left in the completion queue,
while (num_blocks_left_to_receive) { // Or when we've read in all of the blocks of the file.
struct io_uring_cqe *cqe;
if (first_iteration) { // On the first iteration we always wait
// until at least one cqe is available
rc = io_uring_wait_cqe(ring, &cqe); // This should always succeed and give us one cqe.
first_iteration = 0;
} else {
rc = io_uring_peek_cqe(ring, &cqe); // This will fail once there are
// no more items left in the completion queue.
if (rc == -EAGAIN) { // A return code of -EAGAIN means that
// there are no more items left in the
// completion queue.
break;
}
}
if (rc < 0) {
fprintf(stderr, "io_uring_peek_cqe: %s\n",
strerror(-rc));
exit(1);
}
assert(cqe);
// At this point we have a cqe, so let's see what our syscall returned.
struct my_custom_data *data = (struct my_custom_data*)
io_uring_cqe_get_data(cqe);
// Check if the read syscall returned an error
if (cqe->res < 0) {
// we're not handling EAGAIN because it should never happen.
fprintf(stderr, "cqe failed: %s\n",
strerror(-cqe->res));
exit(1);
}
// Check if the read syscall returned an unexpected number of bytes.
if ((size_t)cqe->res != data->nbytes_expected) {
// We're not handling short reads because
// they should never happen on a disk-based filesystem.
if ((size_t)cqe->res < data->nbytes_expected) {
puts("panic: short read");
} else {
puts("panic: read returned more data than expected (wtf).
Is the file changing while you're reading it??");
}
exit(1);
}
assert(data->offset_of_block_in_file % g_blocksize == 0);
// If we reach this point, then it means that there were no errors:
// the read syscall returned exactly what we expected.
// Since the read syscall returned, this means it has finished
// filling in the cell with ONE block of data from the file.
// This means that the cell can now be read by the consumer,
// so we need to update the cell state.
g_ringbuf[data->ringbuf_index].state = AVAILABLE_FOR_CONSUMPTION;
--num_blocks_left_to_receive; // We received ONE block of data from the file
io_uring_cqe_seen(ring, cqe); // mark the cqe as consumed,
// so that its slot can get reused
--num_unfinished_requests;
if (g_process_in_inner_loop)
resume_consumer();
}
/* Step 3: Run consumer. This might be thought of as handing over
control to the consumer "thread". See my article/blog post. */
resume_consumer();
}
resume_consumer();
close(g_filedescriptor);
io_uring_queue_exit(ring);
// Finalize the hash. BLAKE3_OUT_LEN is the default output length, 32 bytes.
uint8_t output[BLAKE3_OUT_LEN];
uint8_t output2[BLAKE3_OUT_LEN];
blake3_hasher_finalize(&g_hasher, output, BLAKE3_OUT_LEN);
blake3_hasher_finalize(&g_hasher2, output2, BLAKE3_OUT_LEN);
// Print the hash as hexadecimal.
printf("BLAKE3 hash: ");
for (size_t i = 0; i < BLAKE3_OUT_LEN; ++i) {
printf("%02x", output[i]);
}
printf("\n");
}
static void process_cmd_line_args(int argc, char* argv[])
{
if (argc != 9) {
printf("%s: infile g_blocksize g_queuedepth g_use_o_direct
g_process_in_inner_loop g_numbufs
g_use_iosqe_io_drain g_use_iosqe_io_link\n", argv[0]);
exit(1);
}
g_blocksize = atoi(argv[2]) * 1024; // The command line argument is in KiBs
g_queuedepth = atoi(argv[3]);
if (g_queuedepth > 32768)
puts("Warning: io_uring queue depth limit on Kernel 6.1.0 is 32768...");
g_use_o_direct = atoi(argv[4]);
g_process_in_inner_loop = atoi(argv[5]);
g_numbufs = atoi(argv[6]);
g_use_iosqe_io_drain = atoi(argv[7]);
g_use_iosqe_io_link = atoi(argv[8]);
}
static void open_and_get_size_of_file(const char* filename)
{
if (g_use_o_direct){
g_filedescriptor = open(filename, O_RDONLY | O_DIRECT);
puts("opening file with O_DIRECT");
} else {
g_filedescriptor = open(filename, O_RDONLY);
puts("opening file without O_DIRECT");
}
if (g_filedescriptor < 0) {
perror("open file");
exit(1);
}
if (get_file_size(g_filedescriptor, &g_filesize)){
puts("Failed getting file size");
exit(1);
}
}
static void calculate_numblocks_and_size_of_last_block()
{
// this is the mathematically correct way to do ceiling division
// (assumes no integer overflow)
g_num_blocks_in_file = (g_filesize + g_blocksize - 1) / g_blocksize;
// calculate the size of the last block of the file
if (g_filesize % g_blocksize == 0)
g_size_of_last_block = g_blocksize;
else
g_size_of_last_block = g_filesize % g_blocksize;
}
static void allocate_ringbuf()
{
// We only make 2 memory allocations in this entire program
// and they both happen in this function.
assert(g_blocksize % ALIGNMENT == 0);
// First, we allocate the entire underlying ring buffer
// (which is a contiguous block of memory
// containing all the actual buffers) in a single allocation.
// This is one big piece of memory which is used to hold
// the actual data from the file.
// The buf_addr field in the my_custom_data struct points to a buffer
// within this ring buffer.
unsigned char* ptr_to_underlying_ring_buffer;
// We need aligned memory, because O_DIRECT requires it.
if (posix_memalign((void**)&ptr_to_underlying_ring_buffer,
ALIGNMENT, g_blocksize * g_numbufs)) {
puts("posix_memalign failed!");
exit(1);
}
// Second, we allocate an array containing all of the my_custom_data structs.
// This is not an array of pointers, but an array holding all of the actual structs.
// All the items are the same size, which makes this easy
g_ringbuf = (struct my_custom_data*)
malloc(g_numbufs * sizeof(struct my_custom_data));
// We partially initialize all of the my_custom_data structs here.
// (The other fields are initialized in the
// add_read_request_to_submission_queue function)
off_t cur_offset = 0;
for (int i = 0; i < g_numbufs; ++i) {
g_ringbuf[i].buf_addr = ptr_to_underlying_ring_buffer +
cur_offset; // This will never change during the runtime of this program.
cur_offset += g_blocksize;
// g_ringbuf[i].bufsize = g_blocksize; // all the buffers are the same size.
g_ringbuf[i].state = ALREADY_CONSUMED; // We need to set all cells to
// ALREADY_CONSUMED at the start. See my article/blog post for explanation.
g_ringbuf[i].ringbuf_index = i; // This will never change during
// the runtime of this program.
}
}
int main(int argc, char *argv[])
{
assert(sizeof(size_t) >= 8); // we want 64 bit size_t for files larger than 4GB...
process_cmd_line_args(argc, argv); // we need to first parse the
// command line arguments
open_and_get_size_of_file(argv[1]);
calculate_numblocks_and_size_of_last_block();
allocate_ringbuf();
// Initialize the hasher.
blake3_hasher_init(&g_hasher);
blake3_hasher_init(&g_hasher2);
// Run the main loop
producer_thread();
puts("Number of requests in-flight:");
for (int i = 0; i < consume_idx; ++i){
printf("%d, ", in_flight_count[i]);
}
printf("\n");
printf("Num records: %d\n", consume_idx);
}
其他说明
我的系统上任何可能的文件哈希程序的理论速度极限是多少?要计算这一点,我们只需要知道两件事:第一,文件从文件系统中读取的速度有多快,第二,哈希计算的速度有多快。
在我用于测试 b3sum 程序的系统中,一个 1GiB 的文件(1073741824 字节)从文件系统中读取的最短时间是 0.302 秒(根据fio。有关更多详细信息,请参见下文),因此文件从文件系统中读取的最快速度约为 3.311GiB/s 或 3.555GB/s。
要衡量 BLAKE3 哈希的计算速度,只需运行示例程序两次,而不刷新页面缓存——第二次运行时长会大大缩短,反映出哈希速度,因为文件已经缓存到内存中了。在我的系统中,一个已缓存到内存中的 1GiB 文件可以在 0.252 秒内完成哈希,这大约是 3.968 GiB/s 或 4.26 GB/s。
因此,我们知道在我的系统中,哈希速度大于文件读取速度。这意味着什么?考虑一个简单的生产者-消费者实现,其中生产者线程一次读取一个块的文件系统中的文件,并将块交给消费者线程,消费者线程然后哈希每个传入的块。由于消费者线程消耗块的速度比生产者线程生成块的速度快,这意味着程序受限于生产者——在生产者完成读取所有块之前,程序无法完成。
因此,在该系统上,任何 b3sum 程序的理论速度极限是 3.311GiB/s 或 3.555GB/s——文件从文件系统中读取的速度。简单来说,这意味着在我的系统上,一个 1GiB 的文件不可能在不到 0.302 秒的时间内完成哈希。
这是我在当前系统上(也是本文开头基准测试所使用的系统)使用 128k 块大小时运行 fio 的结果。
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=100%
--io_size=1g --blocksize=128k --ioengine=io_uring --fsync=10000 --iodepth=4
--direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB,
(T) 128KiB-128KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=3797: Mon Jul 17 19:31:29 2023
read: IOPS=27.0k, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=2, max=118, avg= 8.32, stdev= 2.03
clat (usec): min=69, max=604, avg=138.31, stdev=33.95
lat (usec): min=78, max=722, avg=146.63, stdev=34.28
clat percentiles (usec):
| 1.00th=[ 122], 5.00th=[ 128], 10.00th=[ 130], 20.00th=[ 130],
| 30.00th=[ 131], 40.00th=[ 131], 50.00th=[ 131], 60.00th=[ 131],
| 70.00th=[ 133], 80.00th=[ 133], 90.00th=[ 137], 95.00th=[ 204],
| 99.00th=[ 334], 99.50th=[ 379], 99.90th=[ 449], 99.95th=[ 482],
| 99.99th=[ 603]
lat (usec) : 100=0.02%, 250=98.27%, 500=1.68%, 750=0.02%
cpu : usr=4.64%, sys=33.77%, ctx=8150, majf=0, minf=141
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s),
io=1024MiB (1074MB), run=303-303msec
fio 的结果取决于块大小和 I/O 深度。我发现使用 512KiB 的块大小在我的当前系统上得到了最好的结果。302ms 是我从 fio 看到的最低值。
$ fio --name TEST --eta-newline=5s --filename=temp.file
--rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring
--fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4444: Tue Jul 25 17:39:46 2023
read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=6, max=259, avg=17.83, stdev= 7.24
clat (usec): min=162, max=648, avg=276.04, stdev=49.54
lat (usec): min=175, max=770, avg=293.87, stdev=50.74
clat percentiles (usec):
| 1.00th=[ 251], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 262], 40.00th=[ 262], 50.00th=[ 262], 60.00th=[ 262],
| 70.00th=[ 262], 80.00th=[ 265], 90.00th=[ 330], 95.00th=[ 367],
| 99.00th=[ 529], 99.50th=[ 594], 99.90th=[ 619], 99.95th=[ 644],
| 99.99th=[ 652]
lat (usec) : 250=0.68%, 500=97.71%, 750=1.61%
cpu : usr=3.64%, sys=12.91%, ctx=2047, majf=0, minf=265
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s
(3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec
Disk stats (read/write):
nvme0n1: ios=1977/0, merge=0/0, ticks=481/0, in_queue=481, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g
--io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2
--direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB,
(W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4463: Tue Jul 25 17:39:47 2023
read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
slat (usec): min=5, max=258, avg=13.00, stdev= 7.63
clat (usec): min=214, max=657, avg=281.06, stdev=49.22
lat (usec): min=234, max=767, avg=294.06, stdev=50.43
clat percentiles (usec):
| 1.00th=[ 258], 5.00th=[ 262], 10.00th=[ 265], 20.00th=[ 265],
| 30.00th=[ 265], 40.00th=[ 265], 50.00th=[ 269], 60.00th=[ 269],
| 70.00th=[ 269], 80.00th=[ 269], 90.00th=[ 334], 95.00th=[ 375],
| 99.00th=[ 537], 99.50th=[ 603], 99.90th=[ 619], 99.95th=[ 652],
| 99.99th=[ 660]
lat (usec) : 250=0.49%, 500=97.71%, 750=1.81%
cpu : usr=0.00%, sys=12.29%, ctx=2048, majf=0, minf=268
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3391MiB/s (3555MB/s), 3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec
Disk stats (read/write):
nvme0n1: ios=1979/0, merge=0/0, ticks=496/0, in_queue=496, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g
--io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2
--direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4482: Tue Jul 25 17:39:47 2023
read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=6, max=262, avg=18.51, stdev= 7.30
clat (usec): min=223, max=646, avg=275.47, stdev=48.93
lat (usec): min=243, max=773, avg=293.98, stdev=50.54
clat percentiles (usec):
| 1.00th=[ 251], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 262], 40.00th=[ 262], 50.00th=[ 262], 60.00th=[ 262],
| 70.00th=[ 262], 80.00th=[ 262], 90.00th=[ 326], 95.00th=[ 367],
| 99.00th=[ 529], 99.50th=[ 586], 99.90th=[ 611], 99.95th=[ 644],
| 99.99th=[ 644]
lat (usec) : 250=0.83%, 500=97.56%, 750=1.61%
cpu : usr=0.00%, sys=16.89%, ctx=2048, majf=0, minf=266
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec
Disk stats (read/write):
nvme0n1: ios=1977/0, merge=0/0, ticks=480/0, in_queue=480, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g
--io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2
--direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4501: Tue Jul 25 17:39:48 2023
read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=6, max=260, avg=18.08, stdev= 7.25
clat (usec): min=230, max=649, avg=275.81, stdev=49.25
lat (usec): min=251, max=769, avg=293.89, stdev=50.52
clat percentiles (usec):
| 1.00th=[ 251], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 262], 40.00th=[ 262], 50.00th=[ 262], 60.00th=[ 262],
| 70.00th=[ 262], 80.00th=[ 265], 90.00th=[ 330], 95.00th=[ 367],
| 99.00th=[ 529], 99.50th=[ 594], 99.90th=[ 619], 99.95th=[ 644],
| 99.99th=[ 652]
lat (usec) : 250=0.39%, 500=98.00%, 750=1.61%
cpu : usr=1.32%, sys=15.23%, ctx=2049, majf=0, minf=266
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec
Disk stats (read/write):
nvme0n1: ios=1978/0, merge=0/0, ticks=481/0, in_queue=482, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000
--iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4520: Tue Jul 25 17:39:49 2023
read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=10, max=297, avg=31.39, stdev= 7.77
clat (usec): min=177, max=639, avg=262.32, stdev=49.90
lat (usec): min=229, max=800, avg=293.72, stdev=51.18
clat percentiles (usec):
| 1.00th=[ 237], 5.00th=[ 243], 10.00th=[ 245], 20.00th=[ 247],
| 30.00th=[ 247], 40.00th=[ 247], 50.00th=[ 247], 60.00th=[ 247],
| 70.00th=[ 247], 80.00th=[ 249], 90.00th=[ 314], 95.00th=[ 355],
| 99.00th=[ 523], 99.50th=[ 586], 99.90th=[ 603], 99.95th=[ 627],
| 99.99th=[ 644]
lat (usec) : 250=82.28%, 500=16.11%, 750=1.61%
cpu : usr=0.00%, sys=25.83%, ctx=2047, majf=0, minf=266
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec
Disk stats (read/write):
nvme0n1: ios=1977/0, merge=0/0, ticks=477/0, in_queue=477, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000
--iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4539: Tue Jul 25 17:39:50 2023
read: IOPS=6759, BW=3380MiB/s (3544MB/s)(1024MiB/303msec)
slat (usec): min=6, max=263, avg=18.32, stdev= 7.38
clat (usec): min=163, max=689, avg=275.59, stdev=50.07
lat (usec): min=177, max=952, avg=293.91, stdev=51.72
clat percentiles (usec):
| 1.00th=[ 251], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 260], 40.00th=[ 262], 50.00th=[ 262], 60.00th=[ 262],
| 70.00th=[ 262], 80.00th=[ 265], 90.00th=[ 326], 95.00th=[ 367],
| 99.00th=[ 545], 99.50th=[ 594], 99.90th=[ 644], 99.95th=[ 644],
| 99.99th=[ 693]
lat (usec) : 250=0.63%, 500=97.75%, 750=1.61%
cpu : usr=0.00%, sys=16.89%, ctx=2047, majf=0, minf=267
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3380MiB/s (3544MB/s), 3380MiB/s-3380MiB/s (3544MB/s-3544MB/s), io=1024MiB (1074MB), run=303-303msec
Disk stats (read/write):
nvme0n1: ios=1978/0, merge=0/0, ticks=482/0, in_queue=482, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000
--iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4558: Tue Jul 25 17:39:51 2023
read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
slat (usec): min=6, max=260, avg=17.98, stdev= 7.13
clat (usec): min=180, max=745, avg=275.94, stdev=50.17
lat (usec): min=194, max=1006, avg=293.92, stdev=52.07
clat percentiles (usec):
| 1.00th=[ 251], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 262], 40.00th=[ 262], 50.00th=[ 262], 60.00th=[ 262],
| 70.00th=[ 262], 80.00th=[ 265], 90.00th=[ 326], 95.00th=[ 367],
| 99.00th=[ 545], 99.50th=[ 594], 99.90th=[ 644], 99.95th=[ 660],
| 99.99th=[ 742]
lat (usec) : 250=0.63%, 500=97.61%, 750=1.76%
cpu : usr=2.66%, sys=13.62%, ctx=2048, majf=0, minf=268
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3391MiB/s (3555MB/s),
3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec
Disk stats (read/write):
nvme0n1: ios=1978/0, merge=0/0, ticks=480/0, in_queue=481, util=61.04%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=1g --blocksize=512k --ioengine=io_uring
--fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
TEST: (groupid=0, jobs=1): err= 0: pid=4577: Tue Jul 25 17:39:52 2023
read: IOPS=6781, BW=3391MiB/s (3555MB/s)(1024MiB/302msec)
slat (usec): min=6, max=262, avg=19.10, stdev= 7.35
clat (usec): min=218, max=654, avg=274.66, stdev=49.05
lat (usec): min=251, max=807, avg=293.75, stdev=50.54
clat percentiles (usec):
| 1.00th=[ 249], 5.00th=[ 258], 10.00th=[ 260], 20.00th=[ 260],
| 30.00th=[ 260], 40.00th=[ 260], 50.00th=[ 260], 60.00th=[ 260],
| 70.00th=[ 262], 80.00th=[ 262], 90.00th=[ 326], 95.00th=[ 367],
| 99.00th=[ 537], 99.50th=[ 586], 99.90th=[ 611], 99.95th=[ 644],
| 99.99th=[ 652]
lat (usec) : 250=1.03%, 500=97.41%, 750=1.56%
cpu : usr=0.00%, sys=17.61%, ctx=2048, majf=0, minf=268
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2048,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3391MiB/s (3555MB/s),
3391MiB/s-3391MiB/s (3555MB/s-3555MB/s), io=1024MiB (1074MB), run=302-302msec
Disk stats (read/write):
nvme0n1: ios=1979/0, merge=0/0, ticks=480/0, in_queue=481, util=61.04%
以下是将 I/O 大小设置为 10GiB 的结果,但不知何故这使得速度变慢了。
$ fio --name TEST --eta-newline=5s --filename=temp.file
--rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring
--fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3382MiB/s][r=3382 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3014: Tue Jul 25 00:02:27 2023
read: IOPS=3370, BW=3371MiB/s (3534MB/s)(10.0GiB/3038msec)
slat (usec): min=6, max=1227, avg=19.20, stdev=15.52
clat (usec): min=578, max=22014, avg=9451.40, stdev=1298.66
lat (usec): min=601, max=22032, avg=9470.60, stdev=1296.89
clat percentiles (usec):
| 1.00th=[ 5997], 5.00th=[ 9110], 10.00th=[ 9110], 20.00th=[ 9110],
| 30.00th=[ 9110], 40.00th=[ 9241], 50.00th=[ 9241], 60.00th=[ 9503],
| 70.00th=[ 9634], 80.00th=[ 9634], 90.00th=[ 9634], 95.00th=[ 9896],
| 99.00th=[16319], 99.50th=[19268], 99.90th=[21365], 99.95th=[21627],
| 99.99th=[21890]
bw ( MiB/s): min= 3360, max= 3406, per=100.00%, avg=3378.00, stdev=16.15, samples=6
iops : min= 3360, max= 3406, avg=3378.00, stdev=16.15, samples=6
lat (usec) : 750=0.04%, 1000=0.04%
lat (msec) : 2=0.16%, 4=0.39%, 10=95.45%, 20=3.54%, 50=0.39%
cpu : usr=0.59%, sys=8.23%, ctx=10225, majf=0, minf=30
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=3371MiB/s (3534MB/s), 3371MiB/s-3371MiB/s (3534MB/s-3534MB/s), io=10.0GiB (10.7GB), run=3038-3038msec
Disk stats (read/write):
nvme0n1: ios=39088/0, merge=30/0, ticks=361238/0, in_queue=361238, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring
--fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3386MiB/s][r=3386 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3033: Tue Jul 25 00:02:39 2023
read: IOPS=3375, BW=3375MiB/s (3539MB/s)(10.0GiB/3034msec)
slat (usec): min=6, max=582, avg=30.97, stdev=13.43
clat (usec): min=514, max=2248, avg=1152.31, stdev=134.69
lat (usec): min=570, max=2265, avg=1183.28, stdev=134.35
clat percentiles (usec):
| 1.00th=[ 1074], 5.00th=[ 1074], 10.00th=[ 1074], 20.00th=[ 1090],
| 30.00th=[ 1090], 40.00th=[ 1090], 50.00th=[ 1106], 60.00th=[ 1106],
| 70.00th=[ 1156], 80.00th=[ 1188], 90.00th=[ 1221], 95.00th=[ 1467],
| 99.00th=[ 1778], 99.50th=[ 1909], 99.90th=[ 2008], 99.95th=[ 2057],
| 99.99th=[ 2245]
bw ( MiB/s): min= 3364, max= 3412, per=100.00%, avg=3381.33, stdev=17.00, samples=6
iops : min= 3364, max= 3412, avg=3381.33, stdev=17.00, samples=6
lat (usec) : 750=0.05%, 1000=0.08%
lat (msec) : 2=99.74%, 4=0.14%
cpu : usr=0.76%, sys=12.30%, ctx=10240, majf=0, minf=523
IO depths : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
READ: bw=3375MiB/s (3539MB/s), 3375MiB/s-3375MiB/s (3539MB/s-3539MB/s), io=10.0GiB (10.7GB), run=3034-3034msec
Disk stats (read/write):
nvme0n1: ios=39137/0, merge=0/0, ticks=40995/0, in_queue=40995, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=128k --ioengine=io_uring
--fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB,
(T) 128KiB-128KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3383MiB/s][r=27.1k IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3071: Tue Jul 25 00:02:57 2023
read: IOPS=27.0k, BW=3373MiB/s (3537MB/s)(10.0GiB/3036msec)
slat (usec): min=3, max=120, avg=10.17, stdev= 1.31
clat (usec): min=81, max=2362, avg=137.30, stdev=37.04
lat (usec): min=100, max=2372, avg=147.47, stdev=37.04
clat percentiles (usec):
| 1.00th=[ 120], 5.00th=[ 126], 10.00th=[ 128], 20.00th=[ 129],
| 30.00th=[ 129], 40.00th=[ 129], 50.00th=[ 129], 60.00th=[ 130],
| 70.00th=[ 130], 80.00th=[ 130], 90.00th=[ 137], 95.00th=[ 204],
| 99.00th=[ 306], 99.50th=[ 359], 99.90th=[ 445], 99.95th=[ 478],
| 99.99th=[ 685]
bw ( MiB/s): min= 3362, max= 3407, per=100.00%, avg=3378.79, stdev=17.59, samples=6
iops : min=26896, max=27258, avg=27030.33, stdev=140.69, samples=6
lat (usec) : 100=0.01%, 250=98.11%, 500=1.86%, 750=0.02%
lat (msec) : 4=0.01%
cpu : usr=5.77%, sys=36.54%, ctx=81525, majf=0, minf=138
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=81920,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
READ: bw=3373MiB/s (3537MB/s), 3373MiB/s-3373MiB/s
(3537MB/s-3537MB/s), io=10.0GiB (10.7GB), run=3036-3036msec
Disk stats (read/write):
nvme0n1: ios=78208/2, merge=0/1, ticks=10769/0, in_queue=10770, util=97.03%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=256k --ioengine=io_uring
--fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB,
(T) 256KiB-256KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3387MiB/s][r=13.5k IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3090: Tue Jul 25 00:03:06 2023
read: IOPS=13.5k, BW=3377MiB/s (3541MB/s)(10.0GiB/3032msec)
slat (usec): min=4, max=181, avg=14.10, stdev= 2.00
clat (usec): min=186, max=908, avg=281.24, stdev=49.54
lat (usec): min=221, max=923, avg=295.33, stdev=49.53
clat percentiles (usec):
| 1.00th=[ 255], 5.00th=[ 262], 10.00th=[ 265], 20.00th=[ 265],
| 30.00th=[ 265], 40.00th=[ 265], 50.00th=[ 265], 60.00th=[ 265],
| 70.00th=[ 265], 80.00th=[ 269], 90.00th=[ 343], 95.00th=[ 375],
| 99.00th=[ 506], 99.50th=[ 570], 99.90th=[ 644], 99.95th=[ 717],
| 99.99th=[ 906]
bw ( MiB/s): min= 3365, max= 3410, per=100.00%, avg=3383.04, stdev=15.92, samples=6
iops : min=13460, max=13642, avg=13532.00, stdev=63.66, samples=6
lat (usec) : 250=0.35%, 500=98.51%, 750=1.09%, 1000=0.04%
cpu : usr=1.29%, sys=25.87%, ctx=40940, majf=0, minf=267
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=40960,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
READ: bw=3377MiB/s (3541MB/s), 3377MiB/s-3377MiB/s
(3541MB/s-3541MB/s), io=10.0GiB (10.7GB), run=3032-3032msec
Disk stats (read/write):
nvme0n1: ios=39157/2, merge=0/1, ticks=11087/0, in_queue=11088, util=96.90%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=512k --ioengine=io_uring
--fsync=10000 --iodepth=4 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=4
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3384MiB/s][r=6768 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3109: Tue Jul 25 00:03:14 2023
read: IOPS=6743, BW=3372MiB/s (3536MB/s)(10.0GiB/3037msec)
slat (usec): min=4, max=401, avg=11.26, stdev= 2.87
clat (usec): min=199, max=1699, avg=580.70, stdev=84.16
lat (usec): min=250, max=2043, avg=591.96, stdev=84.45
clat percentiles (usec):
| 1.00th=[ 537], 5.00th=[ 545], 10.00th=[ 545], 20.00th=[ 545],
| 30.00th=[ 545], 40.00th=[ 545], 50.00th=[ 545], 60.00th=[ 545],
| 70.00th=[ 545], 80.00th=[ 611], 90.00th=[ 660], 95.00th=[ 734],
| 99.00th=[ 963], 99.50th=[ 988], 99.90th=[ 1156], 99.95th=[ 1287],
| 99.99th=[ 1680]
bw ( MiB/s): min= 3362, max= 3404, per=100.00%, avg=3377.33, stdev=15.27, samples=6
iops : min= 6724, max= 6808, avg=6754.67, stdev=30.53, samples=6
lat (usec) : 250=0.01%, 500=0.03%, 750=95.28%, 1000=4.30%
lat (msec) : 2=0.38%
cpu : usr=1.02%, sys=11.20%, ctx=20478, majf=0, minf=14
IO depths : 1=0.1%, 2=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
READ: bw=3372MiB/s (3536MB/s), 3372MiB/s-3372MiB/s
(3536MB/s-3536MB/s), io=10.0GiB (10.7GB), run=3037-3037msec
Disk stats (read/write):
nvme0n1: ios=39091/2, merge=0/1, ticks=21153/3, in_queue=21158, util=96.76%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=512k --ioengine=io_uring
--fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3388MiB/s][r=6776 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3128: Tue Jul 25 00:03:24 2023
read: IOPS=6756, BW=3378MiB/s (3543MB/s)(10.0GiB/3031msec)
slat (usec): min=9, max=311, avg=29.94, stdev= 3.30
clat (usec): min=198, max=892, avg=264.98, stdev=50.38
lat (usec): min=234, max=922, avg=294.93, stdev=50.66
clat percentiles (usec):
| 1.00th=[ 239], 5.00th=[ 245], 10.00th=[ 247], 20.00th=[ 247],
| 30.00th=[ 249], 40.00th=[ 249], 50.00th=[ 249], 60.00th=[ 251],
| 70.00th=[ 251], 80.00th=[ 251], 90.00th=[ 322], 95.00th=[ 359],
| 99.00th=[ 510], 99.50th=[ 570], 99.90th=[ 652], 99.95th=[ 742],
| 99.99th=[ 889]
bw ( MiB/s): min= 3367, max= 3413, per=100.00%, avg=3384.83, stdev=15.96, samples=6
iops : min= 6734, max= 6826, avg=6769.67, stdev=31.91, samples=6
lat (usec) : 250=61.26%, 500=37.62%, 750=1.06%, 1000=0.05%
cpu : usr=0.76%, sys=24.09%, ctx=20480, majf=0, minf=267
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=3378MiB/s (3543MB/s), 3378MiB/s-3378MiB/s (3543MB/s-3543MB/s), io=10.0GiB (10.7GB), run=3031-3031msec
Disk stats (read/write):
nvme0n1: ios=39179/0, merge=0/0, ticks=9334/0, in_queue=9334, util=96.90%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=512k --ioengine=io_uring
--fsync=10000 --iodepth=2 --direct=0 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB,
(T) 512KiB-512KiB, ioengine=io_uring, iodepth=2
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1795MiB/s][r=3590 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3147: Tue Jul 25 00:03:34 2023
read: IOPS=3685, BW=1843MiB/s (1932MB/s)(10.0GiB/5557msec)
slat (usec): min=28, max=499, avg=74.93, stdev=74.14
clat (usec): min=30, max=4387, avg=434.37, stdev=184.91
lat (usec): min=229, max=4499, avg=509.29, stdev=147.50
clat percentiles (usec):
| 1.00th=[ 137], 5.00th=[ 200], 10.00th=[ 204], 20.00th=[ 210],
| 30.00th=[ 306], 40.00th=[ 314], 50.00th=[ 461], 60.00th=[ 545],
| 70.00th=[ 570], 80.00th=[ 635], 90.00th=[ 660], 95.00th=[ 660],
| 99.00th=[ 725], 99.50th=[ 824], 99.90th=[ 848], 99.95th=[ 848],
| 99.99th=[ 906]
bw ( MiB/s): min= 1643, max= 1962, per=99.97%, avg=1842.19, stdev=143.95, samples=11
iops : min= 3286, max= 3924, avg=3684.36, stdev=287.89, samples=11
lat (usec) : 50=0.17%, 100=0.01%, 250=23.12%, 500=27.00%, 750=48.97%
lat (usec) : 1000=0.72%
lat (msec) : 10=0.01%
cpu : usr=0.29%, sys=80.89%, ctx=26899, majf=0, minf=266
IO depths : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=20480,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2
Run status group 0 (all jobs):
READ: bw=1843MiB/s (1932MB/s), 1843MiB/s-1843MiB/s
(1932MB/s-1932MB/s), io=10.0GiB (10.7GB), run=5557-5557msec
Disk stats (read/write):
nvme0n1: ios=39591/3, merge=0/1, ticks=7997/1, in_queue=7998, util=92.45%
我还尝试了其他块大小和队列深度。我尝试了块大小 64k、128k、256k、512k、1024k,以及队列深度 2、4、8、16、32、64,但从未快于 302ms。对于读取 1Gib 文件,似乎 512k 块大小和队列深度为 2 是最快的。我还尝试将 --nonvectored 设置为 0 和 1,都没有区别。运行时间相同。这表明 read 和 readv 在性能上是相同的……
nonvectored=int : [io_uring] [io_uring_cmd]
With this option, fio will use non-vectored read/write commands, where address must contain the address directly. Default is -1.
我还尝试了启用和禁用 --fixedbufs
,结果也没有任何区别。
我还尝试了 --registerfiles
和 --sqthread_poll
,它们也没有任何区别。尝试使用 --hipri
会创建一个无法终止的永不结束的僵尸进程,该进程使用 100% 的 CPU——我非常确定这是因为 IORING_SETUP_IOPOLL,因为当我我的程序中启用 IORING_SETUP_IOPOLL 时,发生了同样的事情。
有趣的是,我的 liburing_b3sum_singlethread
程序能够在短短 2.896 秒内处理一个 10GiB 的文件,尽管 cat 到 /dev/null 比这还要慢。
$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 8 1 0 20 0 0
opening file with O_DIRECT
BLAKE3 hash: 65f593d91b9301369724bb8e0f4da385baf46c73a5fb0ac996a5d5d714028ff6
real 0m2.896s
user 0m2.187s
sys 0m0.150s
$ echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null
real 0m2.900s
user 0m0.013s
sys 0m1.844s
请注意,这些数字与本文开头显示的数字不同。我认为我的系统出于某种我未知的原因,性能会波动,有时快,有时比平时慢,但进行比较时,我总是在同一时间段运行程序,以确保这种波动不会影响排名。
LVM 加密的影響
事实上,我大部分的项目开发都是在我以前的系统上完成的,那个系统使用了 LVM 加密。后来,我重新安装了 Debian,使用了非加密的 ext4,这就是我现在的系统。所以硬件和操作系统是完全相同的,唯一的显著区别是我的旧系统使用了 LVM 加密,而我的新系统没有。
这是我旧系统(使用 LVM 加密)上 cat 和 dd 到 /dev/null 的性能。
$ echo 1 > /proc/sys/vm/drop_caches
$ time cat 1GB.txt > /dev/null
real 0m0.540s
user 0m0.000s
sys 0m0.182s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=8K
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.661612 s, 1.6 GB/s
real 0m0.664s
user 0m0.004s
sys 0m0.226s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=16K
65536+0 records in
65536+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.581368 s, 1.8 GB/s
real 0m0.585s
user 0m0.014s
sys 0m0.196s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=32K
32768+0 records in
32768+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.574081 s, 1.9 GB/s
real 0m0.577s
user 0m0.004s
sys 0m0.198s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64K
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.574562 s, 1.9 GB/s
real 0m0.577s
user 0m0.000s
sys 0m0.198s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=128K
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.563443 s, 1.9 GB/s
real 0m0.566s
user 0m0.004s
sys 0m0.195s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=256K
4096+0 records in
4096+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.58437 s, 1.8 GB/s
real 0m0.587s
user 0m0.000s
sys 0m0.190s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=4M
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.590212 s, 1.8 GB/s
real 0m0.593s
user 0m0.000s
sys 0m0.184s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=32M
32+0 records in
32+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.605234 s, 1.8 GB/s
real 0m0.608s
user 0m0.000s
sys 0m0.230s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64M
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.627107 s, 1.7 GB/s
real 0m0.630s
user 0m0.000s
sys 0m0.232s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=64K iflag=direct
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.09124 s, 513 MB/s
real 0m2.093s
user 0m0.000s
sys 0m0.200s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=512M iflag=direct
2+0 records in
2+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.648582 s, 1.7 GB/s
real 0m0.652s
user 0m0.000s
sys 0m0.061s
$ echo 1 > /proc/sys/vm/drop_caches
$ time dd if=1GB.txt of=/dev/null bs=1G iflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.539423 s, 2.0 GB/s
real 0m0.544s
user 0m0.004s
sys 0m0.102s
以下是我当前机器上 dd 的结果。
命令 | 最小值 | 中位数 | 最大值 |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K | ./example_64K | 0.363s | 0.3725s | 0.381s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K | ./example_2M | 0.362s | 0.37s | 0.384s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K | ./example_64K | 0.35s | 0.3555s | 0.361s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K | ./example_2M | 0.346s | 0.353s | 0.357s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K | ./example_64K | 0.333s | 0.341s | 0.349s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K | ./example_2M | 0.337s | 0.346s | 0.358s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example_64K | 0.332s | 0.3385s | 0.342s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example_2M | 0.335s | 0.337s | 0.345s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K | ./example_64K | 0.351s | 0.3675s | 0.371s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K | ./example_2M | 0.344s | 0.357s | 0.505s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K | ./example_64K | 0.532s | 0.55s | 0.556s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K | ./example_2M | 0.53s | 0.547s | 0.723s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example_64K | 0.542s | 0.545s | 0.548s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example_2M | 0.542s | 0.5445s | 0.548s |
以上结果表明,在我的系统上,dd 的最佳 bs 为 64KiB。为了确保这不是因为示例程序一次读取 64KiB,我创建了另一个版本的示例程序,该程序一次读取 2M(很简单,只需将 example.c 源代码中的 65536 改为 2 * 1024 * 1024)。它们在所有块大小下的性能(见上表)都相同,这让我认为这完全没有区别。
这里要强调的重要一点是,dd 的最佳块大小是 64KiB。更大的块大小速度更慢。以下是我当前系统上使用 O_DIRECT
的结果。
命令 | 最小值 | 中位数 | 最大值 |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=8K iflag=direct | ./example_64K | 1.334s | 1.343s | 1.475s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=16K iflag=direct | ./example_64K | 0.924s | 0.927s | 0.984s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=32K iflag=direct | ./example_64K | 0.687s | 0.721s | 0.721s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K iflag=direct | ./example_64K | 0.555s | 0.555s | 0.56s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K iflag=direct oflag=direct | ./example_64K | 0.561s | 0.567s | 0.573s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128K iflag=direct | ./example_64K | 0.56s | 0.5745s | 0.591s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=256K iflag=direct | ./example_64K | 0.572s | 0.573s | 0.574s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=512K iflag=direct | ./example_64K | 0.546s | 0.551s | 0.552s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=1M iflag=direct | ./example_64K | 0.553s | 0.556s | 0.557s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct | ./example_64K | 0.576s | 0.5865s | 0.594s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct | ./example_2M | 0.582s | 0.583s | 0.583s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=128M iflag=direct | ./example_64K | 0.565s | 0.579s | 0.585s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=1G iflag=direct | ./example_64K | 0.557s | 0.5605s | 0.564s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M iflag=direct oflag=direct | ./example_2M | 0.745s | 0.754s | 0.755s |
有趣的是,在我的旧系统上,64K O_DIRECT
的速度明显比在新系统上慢。此外,使用 O_DIRECT 时,最佳块大小明显大于不使用 O_DIRECT 时。这是一个反复出现的主题——我也发现它对于非 io_uring 的多线程程序以及 io_uring 版本都成立。同样有趣的是,dd 使用 O_DIRECT 比不使用 O_DIRECT 的 dd 慢得多。我还发现,当使用 dd 将数据通过管道传输到我的程序时,bs 为 64KiB 是最佳的,但当使用 dd 将数据通过管道传输到 /dev/null 时,则使用更大的 bs,即 2M 是最佳的。还值得注意的是,cat 能够以最大可能的速度将数据通过管道传输到 /dev/null,尽管它使用了 128KiB 的缓冲区大小(当我使用 bs=128KiB 的 dd 时,它无法达到最大速度……)。
总之,这是我在旧系统上运行 fio 的结果。
fio --name TEST --eta-newline=5s --filename=temp.file
--rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring
--fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3522MiB/s][r=3522 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=39134: Tue Jul 11 10:51:55 2023
read: IOPS=3505, BW=3506MiB/s (3676MB/s)(10.0GiB/2921msec)
Run status group 0 (all jobs):
READ: bw=3506MiB/s (3676MB/s), 3506MiB/s-3506MiB/s
(3676MB/s-3676MB/s), io=10.0GiB (10.7GB), run=2921-2921msec
以下是我当前系统上获得的结果。
fio --name TEST --eta-newline=5s --filename=temp.file
--rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring
--fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3374MiB/s][r=3374 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=1160: Mon Jul 17 17:27:35 2023
read: IOPS=3363, BW=3364MiB/s (3527MB/s)(10.0GiB/3044msec)
Run status group 0 (all jobs):
READ: bw=3364MiB/s (3527MB/s), 3364MiB/s-3364MiB/s
(3527MB/s-3527MB/s), io=10.0GiB (10.7GB), run=3044-3044msec
有趣的是,在我旧的、使用 LVM 加密的系统上,我的程序的 io_uring
版本实际上比 cat 到 /dev/null 快了近一倍。是的,我在计时之前刷新了页面缓存。但在我当前、非加密的系统上,io_uring
版本和 cat 到 /dev/null 的速度几乎相同。我的猜测是加密增加了延迟,而我通过 io_uring 同时进行多个请求来缓解了这个问题,但 cat 程序无法做到这一点,因为它使用的是同步读取,必须等上一个读取完成后下一个才能开始。不过这只是我的猜测,尚未验证。但这也能解释为什么在我旧系统上,使用 io_uring
的 fio 比 cat 快。
但这并不能解释为什么我旧系统上的 fio 比当前系统上的快。对此我有两个猜测。
- 我的 SSD 速度不知何故变慢了。也许我需要 TRIM?
- 我的软件不知何故变差了。也许加密磁盘因为使用了 LVM 而更快?
我认为第一个猜测的可能性很小。我曾对 SSD 进行过大量测试,并从 fio 获得了一致的速度读数。第二个猜测似乎更合理,请参见:https://lkml.org/lkml/2010/5/7/223
在使用磁盘加密时,fio 的更多详细信息。
我使用的 fio 版本是 3.33。
这是我使用的命令。
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=1024k --ioengine=mmap --fsync=10000 --iodepth=32
--direct=1 --numjobs=1 --runtime=60 --group_reporting
总之,以下是我在我机器上获得的数据。
sync (read): READ: bw=1313MiB/s (1377MB/s),
1313MiB/s-1313MiB/s (1377MB/s-1377MB/s), io=10.0GiB (10.7GB), run=7798-7798msec
psync (pread): READ: bw=1316MiB/s (1380MB/s),
1316MiB/s-1316MiB/s (1380MB/s-1380MB/s), io=10.0GiB (10.7GB), run=7783-7783msec
vsync (readv): READ: bw=1295MiB/s (1357MB/s),
1295MiB/s-1295MiB/s (1357MB/s-1357MB/s), io=2017MiB (2115MB), run=1558-1558msec
pvsync (preadv): READ: bw=1332MiB/s (1397MB/s),
1332MiB/s-1332MiB/s (1397MB/s-1397MB/s), io=10.0GiB (10.7GB), run=7688-7688msec
pvsync2 (preadv2): READ: bw=1304MiB/s (1368MB/s),
1304MiB/s-1304MiB/s (1368MB/s-1368MB/s), io=10.0GiB (10.7GB), run=7851-7851msec
io_uring: READ: bw=3501MiB/s (3671MB/s),
3501MiB/s-3501MiB/s (3671MB/s-3671MB/s), io=10.0GiB (10.7GB), run=2925-2925msec
libaio: READ: bw=3518MiB/s (3689MB/s),
3518MiB/s-3518MiB/s (3689MB/s-3689MB/s), io=10.0GiB (10.7GB), run=2911-2911msec
posixaio (aio_read): READ: bw=3518MiB/s (3689MB/s),
3518MiB/s-3518MiB/s (3689MB/s-3689MB/s), io=10.0GiB (10.7GB), run=2911-2911msec
mmap: READ: bw=664MiB/s (696MB/s),
664MiB/s-664MiB/s (696MB/s-696MB/s), io=10.0GiB (10.7GB), run=15431-15431msec
splice (splice and vmsplice): READ: bw=879MiB/s (921MB/s),
879MiB/s-879MiB/s (921MB/s-921MB/s), io=10.0GiB (10.7GB), run=11653-11653msec
请注意,3.5GB/s 是我 NVME 驱动器宣传的读取速度限制,并且经过第三方测试(我在网上找到的)验证,因此我非常确定 io_uring、libaio 和 posix aio 提供的数据已经达到了我存储设备的物理极限。
进一步说明
根据我在线阅读的一些资料,我曾认为 O_DIRECT
应该比不使用 O_DIRECT
更快,因为它更“直接”,即绕过了页面缓存。我在运行 fio
时进行了有无 O_DIRECT
的测试,发现了一些支持证据。在我的旧系统上,当我运行
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000
--iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
它给出了 3.5GB/s。
但是当我运行
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read
--size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000
--iodepth=32 --direct=0 --numjobs=1 --runtime=60 --group_reporting
它只给出了 2.3GB/s。
所以在我的旧系统上,O_DIRECT
使 fio 运行得更快。
在我的当前系统上,我得到了类似的结果。
$ fio --name TEST --eta-newline=5s --filename=temp.file
--rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=io_uring
--fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3364MiB/s][r=3364 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3178: Tue Jul 25 00:15:53 2023
read: IOPS=3277, BW=3278MiB/s (3437MB/s)(10.0GiB/3124msec)
slat (usec): min=6, max=315, avg=18.13, stdev= 8.69
clat (usec): min=570, max=22048, avg=9463.51, stdev=1283.30
lat (usec): min=601, max=22075, avg=9481.64, stdev=1282.47
clat percentiles (usec):
| 1.00th=[ 6783], 5.00th=[ 9110], 10.00th=[ 9110], 20.00th=[ 9110],
| 30.00th=[ 9110], 40.00th=[ 9241], 50.00th=[ 9241], 60.00th=[ 9503],
| 70.00th=[ 9634], 80.00th=[ 9634], 90.00th=[ 9634], 95.00th=[ 9896],
| 99.00th=[16319], 99.50th=[19268], 99.90th=[21627], 99.95th=[21890],
| 99.99th=[22152]
bw ( MiB/s): min= 2786, max= 3412, per=99.98%, avg=3277.33, stdev=241.55, samples=6
iops : min= 2786, max= 3412, avg=3277.33, stdev=241.55, samples=6
lat (usec) : 750=0.04%, 1000=0.04%
lat (msec) : 2=0.16%, 4=0.30%, 10=95.47%, 20=3.60%, 50=0.39%
cpu : usr=0.42%, sys=10.53%, ctx=10238, majf=0, minf=536
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=3278MiB/s (3437MB/s), 3278MiB/s-3278MiB/s
(3437MB/s-3437MB/s), io=10.0GiB (10.7GB), run=3124-3124msec
Disk stats (read/write):
nvme0n1: ios=37911/0, merge=0/0, ticks=350894/0, in_queue=350893, util=93.96%
$ fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g
--io_size=10g --blocksize=1024k --ioengine=io_uring --fsync=10000 --iodepth=32
--direct=0 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [R(1)][80.0%][r=2048MiB/s][r=2048 IOPS][eta 00m:01s]
TEST: (groupid=0, jobs=1): err= 0: pid=3198: Tue Jul 25 00:16:40 2023
read: IOPS=2104, BW=2104MiB/s (2207MB/s)(10.0GiB/4866msec)
slat (usec): min=28, max=5133, avg=134.55, stdev=233.41
clat (usec): min=4680, max=31030, avg=13921.79, stdev=1822.86
lat (usec): min=5963, max=31226, avg=14056.34, stdev=1782.46
clat percentiles (usec):
| 1.00th=[ 7767], 5.00th=[11469], 10.00th=[13042], 20.00th=[13435],
| 30.00th=[13698], 40.00th=[13829], 50.00th=[13960], 60.00th=[14091],
| 70.00th=[14222], 80.00th=[14484], 90.00th=[14746], 95.00th=[15008],
| 99.00th=[21890], 99.50th=[24773], 99.90th=[29230], 99.95th=[29492],
| 99.99th=[30016]
bw ( MiB/s): min= 1820, max= 2328, per=99.54%, avg=2094.67, stdev=246.01, samples=9
iops : min= 1820, max= 2328, avg=2094.67, stdev=246.01, samples=9
lat (msec) : 10=2.64%, 20=95.90%, 50=1.46%
cpu : usr=0.06%, sys=91.00%, ctx=17435, majf=0, minf=536
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=2104MiB/s (2207MB/s), 2104MiB/s-2104MiB/s
(2207MB/s-2207MB/s), io=10.0GiB (10.7GB), run=4866-4866msec
Disk stats (read/write):
nvme0n1: ios=39042/0, merge=0/0, ticks=165950/0, in_queue=165950, util=91.21%
因此,我认为 O_DIRECT
可能是达到最大性能所必需的,但它确实对块大小非常敏感。我记得当我刚开始使用 O_DIRECT
时,我对 O_DIRECT
和非 O_DIRECT
使用相同的块大小(64KiB),我发现 O_DIRECT
使我的程序变慢了。由于 O_DIRECT
的性能受块大小影响很大,选择错误的块大小会导致 O_DIRECT
的性能非常差,但我发现只有使用 O_DIRECT
才能获得最佳性能——您也可以自己验证这一点,因为我在我的程序中将是否使用 O_DIRECT
设置为了命令行参数,因此您可以轻松地比较在不同块大小下是否使用 O_DIRECT
的性能。
一开始,我使用的是 64KiB 的块大小,并且我没有预料到块大小会造成如此大的差异,所以最初我对 O_DIRECT
的性能相当失望,它比非 O_DIRECT
的性能差很多,直到我开始更改块大小。总之,我决定看看块大小(BS)和队列深度(QD)是否会影响性能,发现最佳的 BS 和 QD 组合是非 O_DIRECT
的 BS=256, QD=8;而对于 O_DIRECT
,最佳组合是 BS=4096, QD=8。
表格
缓冲区大小(KB) | 队列深度 | O_DIRECT | 循环调用进程 | 运行时间(秒) |
1 | 1 | 0 | 0 | 7.134 |
1 | 1 | 1 | 0 | 29.881 |
1 | 4000 | 0 | 0 | 3.647 |
1 | 4000 | 1 | 0 | 29.624 |
1 | 4000 | 0 | 1 | 3.619 |
1 | 4000 | 1 | 1 | 22.177 |
4 | 1 | 0 | 1 | 2.732 |
4 | 1 | 1 | 1 | 9.972 |
4 | 64 | 0 | 1 | 1.665 |
4 | 64 | 1 | 1 | 4.423 |
4 | 4000 | 0 | 0 | 1.591 |
4 | 4000 | 1 | 0 | 6.452 |
4 | 4000 | 0 | 1 | 1.572 |
4 | 4000 | 1 | 1 | 6.468 |
16 | 1 | 0 | 1 | 1.16 |
16 | 1 | 1 | 1 | 4.88 |
16 | 64 | 0 | 1 | 1.051 |
16 | 64 | 1 | 1 | 1.816 |
64 | 1 | 0 | 1 | 0.829 |
64 | 1 | 1 | 1 | 3.122 |
64 | 2 | 0 | 0 | 1.144 |
64 | 2 | 1 | 0 | 1.464 |
64 | 4 | 0 | 0 | 0.542 |
64 | 4 | 1 | 0 | 0.489 |
64 | 8 | 0 | 0 | 0.422 |
64 | 8 | 1 | 0 | 0.371 - 0.602 |
64 | 8 | 0 | 1 | 0.611 |
64 | 8 | 1 | 1 | 0.795 |
64 | 16 | 0 | 0 | 0.429 |
64 | 16 | 1 | 0 | 0.358 - 0.645 |
64 | 16 | 0 | 1 | 0.613 |
64 | 16 | 1 | 1 | 0.793 |
64 | 32 | 0 | 0 | 0.479 |
64 | 32 | 1 | 0 | 0.374 - 0.533 |
64 | 64 | 0 | 1 | 0.645 |
64 | 64 | 1 | 1 | 0.783 |
64 | 4000 | 0 | 0 | 0.388 |
64 | 4000 | 1 | 0 | 0.837 |
64 | 4000 | 0 | 1 | 0.37 |
64 | 4000 | 1 | 1 | 0.877 |
128 | 4 | 0 | 0 | 0.494 |
128 | 8 | 0 | 0 | 0.324 - 0.373 |
128 | 16 | 0 | 0 | 0.333 - 0.359 |
256 | 4 | 0 | 0 | 0.415 - 0.442 |
256 | 8 | 0 | 0 | 0.322 - 0.359 |
256 | 16 | 0 | 0 | 0.322 - 0.343 |
256 | 32 | 0 | 0 | 0.331 - 0.351 |
256 | 32 | 0 | 1 | 0.405 - 0.428 |
512 | 2 | 0 | 0 | 0.832 |
512 | 2 | 1 | 0 | 1.295 |
512 | 2 | 0 | 1 | 0.894 |
512 | 2 | 1 | 1 | 0.748 - 1.383 |
512 | 4 | 0 | 0 | 0.51 |
512 | 4 | 1 | 0 | 0.321 - 0.349 |
512 | 8 | 0 | 0 | 0.4 |
512 | 8 | 1 | 0 | 0.320 - 0.355 |
512 | 16 | 0 | 0 | 0.397 |
512 | 16 | 1 | 0 | 0.320 - 0.337 |
512 | 32 | 0 | 0 | 0.398 |
512 | 32 | 1 | 0 | 0.319 - 0.334 |
512 | 64 | 0 | 0 | 0.398 |
512 | 64 | 1 | 0 | 0.325-0.388 |
512 | 128 | 0 | 0 | 0.399 |
512 | 128 | 1 | 0 | 0.409 |
512 | 256 | 0 | 0 | 0.4 |
512 | 4000 | 0 | 0 | 0.406 |
1024 | 4 | 0 | 0 | 0.442 |
1024 | 4 | 1 | 0 | 0.339 - 0.343 |
1024 | 8 | 0 | 0 | 0.399 |
1024 | 8 | 1 | 0 | 0.320 - 0.321 |
1024 | 16 | 0 | 0 | 0.398 |
1024 | 16 | 1 | 0 | 0.319 - 0.358 |
1024 | 32 | 0 | 0 | 0.4 |
1024 | 32 | 1 | 0 | 0.319 - 0.373 |
1024 | 64 | 0 | 0 | 0.402 |
1024 | 64 | 1 | 0 | 0.320 - 0.443 |
1024 | 128 | 0 | 0 | 0.404 |
1024 | 128 | 1 | 0 | 0.45 |
1024 | 128 | 0 | 1 | 0.406 |
1024 | 128 | 1 | 1 | 0.477 |
2048 | 4 | 0 | 0 | 0.424 |
2048 | 4 | 1 | 0 | 0.341 - 0.361 |
2048 | 8 | 0 | 0 | 0.4 |
2048 | 8 | 1 | 0 | 0.319 - 0.336 |
2048 | 16 | 0 | 0 | 0.4 |
2048 | 16 | 1 | 0 | 0.329 - 0.372 |
2048 | 32 | 0 | 0 | 0.404 |
2048 | 32 | 1 | 0 | 0.402 - 0.410 |
4096 | 4 | 0 | 0 | 0.415 |
4096 | 4 | 1 | 0 | 0.383 - 0.525 |
4096 | 4 | 0 | 1 | 0.644 |
4096 | 4 | 1 | 1 | 0.71 |
4096 | 8 | 0 | 0 | 0.388 |
4096 | 8 | 1 | 0 | 0.309 - 0.343 |
4096 | 8 | 1 | 1 | 0.425 - 0.506 |
4096 | 16 | 0 | 0 | 0.398 |
4096 | 16 | 1 | 0 | 0.318 - 0.356 |
4096 | 16 | 0 | 1 | 0.498 |
4096 | 16 | 1 | 1 | 0.364 |
4096 | 32 | 0 | 0 | 0.402 |
4096 | 32 | 1 | 0 | 0.362 - 0.379 |
4096 | 64 | 0 | 0 | 0.418 |
4096 | 64 | 1 | 0 | 0.388 |
8192 | 4 | 0 | 0 | 0.416 |
8192 | 4 | 1 | 0 | 0.342 - 0.429 |
8192 | 8 | 0 | 0 | 0.387 |
8192 | 8 | 1 | 0 | 0.308 - 0.345 |
8192 | 8 | 1 | 1 | 0.414 - 0.464 |
8192 | 16 | 0 | 0 | 0.4 |
8192 | 16 | 1 | 0 | 0.333 - 0.348 |
8192 | 32 | 0 | 0 | 0.428 |
8192 | 32 | 1 | 0 | 0.351 |
8192 | 64 | 0 | 0 | 0.427 |
8192 | 64 | 1 | 0 | 0.4 |
16384 | 4 | 0 | 0 | 0.42 |
16384 | 4 | 1 | 0 | 0.329 - 0.367 |
16384 | 8 | 0 | 0 | 0.401 |
16384 | 8 | 1 | 0 | 0.327 - 0.338 |
16384 | 16 | 0 | 0 | 0.424 |
16384 | 16 | 1 | 0 | 0.333 - 0.360 |
16384 | 32 | 0 | 0 | 0.434 |
16384 | 32 | 1 | 0 | 0.361 - 0.390 |
以上结果是我程序的旧版本,该版本为每次读取请求都进行 malloc(或 2 次)并且不使用环形缓冲区。
malloc、free 和 O_DIRECT 之间的交互
我的 liburing b3sum 程序是原始 liburing cp 程序的修改版本,原始 liburing cp 程序为每次读取请求都进行 malloc(分配每个请求的缓冲区),所以我的程序最初也是这样。我的 liburing b3sum 程序的早期版本一次发出 64KiB 块的请求,这意味着在读取 1GiB 文件时进行了大量的 malloc,并且没有使用 O_DIRECT
,我注意到在我完成后调用 free 来释放这些块,使我的程序从大约 0.339s-0.348s 加速到 0.322s-0.332s,这是一个极其明显且显著的速度提升。
在我找到 O_DIRECT
和非 O_DIRECT
的最佳块大小后,我又尝试使用 free
来查看是否有区别。我发现取消注释代码中的 free() 后,
The O_DIRECT version (8192 8 1 0) went from 0.307-0.342s to 0.330-0.355s
The non-O_DIRECT version (256 8 0 0) went from 0.322-0.345s to 0.306-0.327s
现在非 O_DIRECT
版本又更快了!我不确定为什么调用 free 会减慢我的程序的 O_DIRECT
版本但加速非 O_DIRECT
版本。我的意思是,这使得非 O_DIRECT
版本加速是有道理的,想法是调用 free() 实际上并没有将内存返回给系统,而是将其标记为我的程序使用,这样未来的 malloc 调用就会简单地重用同一块内存,而不是实际向内核请求更多内存。实际上,在从 readv 切换到 read 的过程中,我在创建读取请求的函数中添加了第二个 malloc,这使得程序的运行时间从大约 0.333s 增加到大约 0.340s——当时我以为是因为 read 比 readv 慢,但现在我认为这可能是由于额外的 malloc。顺便说一句,当我从 readv 切换到 read 时,一个令我惊讶的事情是 io_uring read
(与常规 read 不同)接受一个偏移量,所以它实际上更像是 pread 而不是常规的 read 系统调用(它不接受偏移量)。
总之,我决定将我的程序中的 malloc 调用次数减少到启动时仅 2 次。现在我将在开始时分配所有缓冲区,现在 O_DIRECT
版本的运行时间为 0.297 秒,而 non-O_DIRECT
版本的运行时间为 0.363 秒。所以一次性分配所有内存将 O_DIRECT
版本加速了约 0.01 秒,但将非 O_DIRECT
版本减慢了约 0.057 秒!我还尝试了忙轮询(使用原子操作)而不是条件变量,这并没有对运行时间产生任何影响。顺便说一句,是的,0.297 秒低于我在当前系统上获得的任何数字,但 fio 在我旧系统上也有更快的速度——我怀疑这是因为我使用了 LVM,这在某种程度上使我的文件系统更快。
hdparm 结果
我当前系统上的 hdparm
$ hdparm --direct -t -T /dev/nvme0n1
/dev/nvme0n1:
Timing O_DIRECT cached reads: 2478 MB in 2.00 seconds = 1238.85 MB/sec
Timing O_DIRECT disk reads: 7970 MB in 3.00 seconds = 2656.42 MB/sec
$ hdparm --direct -t -T /dev/nvme0n1
/dev/nvme0n1:
Timing O_DIRECT cached reads: 3400 MB in 2.00 seconds = 1700.47 MB/sec
Timing O_DIRECT disk reads: 9816 MB in 3.00 seconds = 3271.78 MB/sec
$ hdparm --direct -t -T /dev/nvme0n1
/dev/nvme0n1:
Timing O_DIRECT cached reads: 3380 MB in 2.00 seconds = 1690.65 MB/sec
Timing O_DIRECT disk reads: 9834 MB in 3.00 seconds = 3277.73 MB/sec
我以前的系统(使用 LVM 加密)上的 hdparm
$ hdparm --direct -t -T /dev/nvme0n1p3
/dev/nvme0n1p3:
Timing O_DIRECT cached reads: 3510 MB in 2.00 seconds = 1755.66 MB/sec
Timing O_DIRECT disk reads: 9026 MB in 3.00 seconds = 3008.23 MB/sec
hdparm 一致地给出比 fio 低得多的读数,fio 使用 io_uring、libaio 和 POSIX aio 报告了 3.5GB/s。这可能是因为 hdparm 只使用普通的同步 read()。这是源代码。
static int read_big_block (int fd, char *buf)
{
int i, rc;
if ((rc = read(fd, buf, TIMING_BUF_BYTES)) != TIMING_BUF_BYTES) {
if (rc) {
if (rc == -1)
perror("read() failed");
else
fprintf(stderr, "read(%u) returned %u bytes\n", TIMING_BUF_BYTES, rc);
} else {
fputs ("read() hit EOF - device too small\n", stderr);
}
return 1;
}
/* access all sectors of buf to ensure the read fully completed */
for (i = 0; i < TIMING_BUF_BYTES; i += 512)
buf[i] &= 1;
return 0;
}
dd 真的在进行 read() 系统调用吗?
我想我需要证明 dd 确实是在进行带有 bs 大小的 read 系统调用,所以让我给您展示一下此命令生成的 ltrace 日志:echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ltrace -S -o ltrace1.out dd if=1GB.txt bs=64K | ltrace -S -o ltrace2.out ./example_64K
$ head -150 ltrace1.out
SYS_brk(0) = 0x56024ac4f000
SYS_mmap(0, 8192, 3, 34) = 0x7fed83534000
SYS_access("/etc/ld.so.preload", 04) = -2
SYS_openat(0xffffff9c, 0x7fed8355e0b1, 0x80000, 0) = 3
SYS_newfstatat(3, 0x7fed8355ec99, 0x7ffe6f8fd8a0, 4096) = 0
SYS_mmap(0, 0x63b6, 1, 2) = 0x7fed8352d000
SYS_close(3) = 0
SYS_openat(0xffffff9c, 0x7fed83534140, 0x80000, 0) = 3
SYS_read(3, "\177ELF\002\001\001\003", 832) = 832
SYS_pread(3, 0x7ffe6f8fd620, 784, 64) = 784
SYS_newfstatat(3, 0x7fed8355ec99, 0x7ffe6f8fd8a0, 4096) = 0
SYS_pread(3, 0x7ffe6f8fd4f0, 784, 64) = 784
SYS_mmap(0, 0x1e0f50, 1, 2050) = 0x7fed8334c000
SYS_mmap(0x7fed83372000, 0x155000, 5, 2066) = 0x7fed83372000
SYS_mmap(0x7fed834c7000, 0x53000, 1, 2066) = 0x7fed834c7000
SYS_mmap(0x7fed8351a000, 0x6000, 3, 2066) = 0x7fed8351a000
SYS_mmap(0x7fed83520000, 0xcf50, 3, 50) = 0x7fed83520000
SYS_close(3) = 0
SYS_mmap(0, 0x3000, 3, 34) = 0x7fed83349000
SYS_arch_prctl(4098, 0x7fed83349740, 0xffff80127ccb5f30, 34) = 0
SYS_set_tid_address(0x7fed83349a10, 0x7fed83349740, 0x7fed835690d0, 34) = 0x701f0
SYS_set_robust_list(0x7fed83349a20, 24, 0x7fed835690d0, 34) = 0
SYS_334(0x7fed8334a060, 32, 0, 0x53053053) = 0
SYS_mprotect(0x7fed8351a000, 16384, 1) = 0
SYS_mprotect(0x56024a719000, 4096, 1) = 0
SYS_mprotect(0x7fed83566000, 8192, 1) = 0
SYS_prlimit64(0, 3, 0, 0x7ffe6f8fe3e0) = 0
SYS_munmap(0x7fed8352d000, 25526) = 0
getenv("POSIXLY_CORRECT") = nil
sigemptyset(<>) = 0
sigaddset(<9>, SIGUSR1) = 0
sigaction(SIGINT, nil <unfinished ...>
SYS_rt_sigaction(2, 0, 0x7ffe6f8fe4c0, 8) = 0
<... sigaction resumed> , { 0, <>, 0x5f004554, 0x5f796c7261655f63 }) = 0
sigaddset(<1,9>, SIGINT) = 0
sigismember(<1,9>, SIGUSR1) = 1
sigaction(SIGUSR1, { 0x56024a7086b0, <1,9>, 0, 0 } <unfinished ...>
SYS_rt_sigaction(10, 0x7ffe6f8fe420, 0, 8) = 0
<... sigaction resumed> , nil) = 0
sigismember(<1,9>, SIGINT) = 1
sigaction(SIGINT, { 0x56024a7086a0, <1,9>, 0, 0 } <unfinished ...>
SYS_rt_sigaction(2, 0x7ffe6f8fe420, 0, 8) = 0
<... sigaction resumed> , nil) = 0
strrchr("dd", '/') = nil
setlocale(LC_ALL, "" <unfinished ...>
SYS_318(0x7fed83525478, 8, 1, 0x7fed83361e40) = 8
SYS_brk(0) = 0x56024ac4f000
SYS_brk(0x56024ac70000) = 0x56024ac70000
SYS_openat(0xffffff9c, 0x7fed834eaf50, 0x80000, 0) = 3
SYS_newfstatat(3, 0x7fed834e1dd5, 0x7fed8351fb60, 4096) = 0
SYS_mmap(0, 0x2e8610, 1, 2) = 0x7fed83000000
SYS_close(3) = 0
<... setlocale resumed> ) = "en_GB.UTF-8"
bindtextdomain("coreutils", "/usr/share/locale") = "/usr/share/locale"
textdomain("coreutils") = "coreutils"
__cxa_atexit(0x56024a708a20, 0, 0x56024a71a268, 0) = 0
getpagesize() = 4096
getopt_long(3, 0x7ffe6f8fe7c8, "", 0x56024a719d20, nil) = -1
strchr("if=1GB.txt", '=') = "=1GB.txt"
strchr("bs=64K", '=') = "=64K"
__errno_location() = 0x7fed833496c0
__ctype_b_loc() = 0x7fed833496d8
strtoumax(0x7ffe6f8ffda0, 0x7ffe6f8fe5d0, 10, 0x7fed8301982c) = 64
strchr("bcEGkKMPTwYZ0", 'K') = "KMPTwYZ0"
strchr("bcEGkKMPTwYZ0", '0') = "0"
strchr("64K", 'B') = nil
open("1GB.txt", 0, 00 <unfinished ...>
SYS_openat(0xffffff9c, 0x7ffe6f8ffd95, 0, 0) = 3
<... open resumed> ) = 3
dup2(3, 0 <unfinished ...>
SYS_dup2(3, 0) = 0
<... dup2 resumed> ) = 0
__errno_location() = 0x7fed833496c0
close(3 <unfinished ...>
SYS_close(3) = 0
<... close resumed> ) = 0
lseek(0, 0, 1 <unfinished ...>
SYS_lseek(0, 0, 1) = 0
<... lseek resumed> ) = 0
__errno_location() = 0x7fed833496c0
dcgettext(0, 0x56024a715475, 5, 0x7fed83444177 <unfinished ...>
SYS_openat(0xffffff9c, 0x7ffe6f8fe0d0, 0x80000, 0) = 3
SYS_newfstatat(3, 0x7fed834e1dd5, 0x7ffe6f8fdf00, 4096) = 0
SYS_read(3, "# Locale name alias data base.\n#"..., 4096) = 2996
SYS_read(3, "", 4096) = 0
SYS_close(3) = 0
SYS_openat(0xffffff9c, 0x56024ac50410, 0, 0) = -2
SYS_openat(0xffffff9c, 0x56024ac50670, 0, 0) = -2
<... dcgettext resumed> ) = 0x56024a715475
clock_gettime(1, 0x7ffe6f8fe540, 0x56024a715475, 0x56024a715475) = 0
aligned_alloc(4096, 0x10000, 0, 24) = 0x56024ac53000
read(0 <unfinished ...>
SYS_read(0, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536) = 65536
<... read resumed> , "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536) = 65536
write(1, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536 <unfinished ...>
SYS_write(1, "\277\377#IX0=\251\353\201\343v\345(\221\032I\253\333\v2#\370J\304\352\300\342\016(\006\255"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536) = 65536
<... read resumed> , "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536) = 65536
write(1, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536 <unfinished ...>
SYS_write(1, "y]mq\351\312\36167\341\023\336\220e\242\265G\020VY]\352\022Z/\241-C\357\270K\237"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "\230\322GS\223", 65536) = 65536
<... read resumed> , "\230\322GS\223", 65536) = 65536
write(1, "\230\322GS\223", 65536 <unfinished ...>
SYS_write(1, "\230\322GS\223", 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536) = 65536
<... read resumed> , "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536) = 65536
write(1, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536 <unfinished ...>
SYS_write(1, "\251\252-\322p\365\3644\237\333\272\024\335\226\262\337\353\263+a\305\224Bg2\357]U\240G\2625"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536) = 65536
<... read resumed> , "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536) = 65536
write(1, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536 <unfinished ...>
SYS_write(1, "\330K\253,\3775\362\272\206\213>\272\217\233\234\027\367\372Q\342Y\223c\234\301\241r\311\362\237\001V"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536) = 65536
<... read resumed> , "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536) = 65536
write(1, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536 <unfinished ...>
SYS_write(1, "*WG\314\356\253\230\0319\256E\004\022\214\343", 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "EU\230\b\306", 65536) = 65536
<... read resumed> , "EU\230\b\306", 65536) = 65536
write(1, "EU\230\b\306", 65536 <unfinished ...>
SYS_write(1, "EU\230\b\306", 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536) = 65536
<... read resumed> , "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536) = 65536
write(1, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536 <unfinished ...>
SYS_write(1, "Q\366\034&\357\020\0371Ln\254\347\202\260Q\254\343\242\203\254f\375\2164\022\376\253FJ\320W\262"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536) = 65536
<... read resumed> , "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536) = 65536
write(1, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536 <unfinished ...>
SYS_write(1, "}n\t7\226\001\306Q\3029}\3408l!\351\026\177F\037x\237BQ\2242\215\030\021\342\364F"..., 65536) = 65536
<... write resumed> ) = 65536
read(0 <unfinished ...>
SYS_read(0, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536) = 65536
<... read resumed> , "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536) = 65536
write(1, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536 <unfinished ...>
SYS_write(1, "\rx\213\343\021\213M\243\345Vl\304+G6J\250\202\372\266`k\354W\364>\233\337Q\323\037\253"..., 65536) = 65536
看到了吗?这里没有任何猫腻——dd 确实进行了所有这些 SYS_read 系统调用,请求 64KiB 的数据,然后将其写出。我还验证了当您使用 bs=2M 时,dd 会一次请求 2M 的数据。为了保险起见,我也看了 cat:echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ltrace -S -o ltrace1.out cat 1GB.txt | ltrace -S -o ltrace2.out ./example_64K
,并发现,正如预期的那样,cat 一次读取文件 128KiB 并将其写出。它确实使用 128KiB 作为参数进行了这些 read() 系统调用(SYS_read)。总之,我发现当使用 dd 将数据通过管道传输到我的程序时,bs 为 64KiB 是最佳的,但当使用 dd 将数据通过管道传输到 /dev/null 时,使用更大的 bs,即 2M 是最佳的。还值得注意的是,cat 能够以最大可能的速度将数据通过管道传输到 /dev/null,尽管它使用了 128KiB 的缓冲区大小(当我使用 bs=128KiB 的 dd 时,它无法达到最大速度……)。
动机
那么我为什么写这个程序呢?嗯,几年前,我运行自己的 MediaWiki 实例,并遇到了一些问题(例如,数据库启动失败并出现一些错误,搜索引擎有一天停止索引新页面,重命名页面导致链接损坏等),所以我决定写自己的维基,并且我想:“好吧,如果我要写自己的维基,我也应该为我的维基写一个备份工具”,所以我决定写一个备份工具,然后我想:“但是如果备份损坏了怎么办?”,所以我决定包含前向错误校正(ECC)与我的备份,这时我查看了 par2cmdline 并注意到它相当慢,所以我决定写我自己的 ECC 工具,然后我开始阅读 PAR3 规范,并看到它使用了 BLAKE3 哈希。所以我开始查看哈希算法,我发现大多数非加密哈希算法(例如 xxhash)在碰撞概率上没有保证,实际上我认为大多数非加密哈希算法设计用于哈希表,其中碰撞可以被检测和解决,而不是用于检测文件损坏,因为根本无法知道是否发生了碰撞。诚然,损坏的文件与原始文件具有相同哈希的概率很小,但我希望使该概率真正可以忽略不计,这样我就不必担心它。所以我开始查看加密哈希——md5、sha1、sha2、sha3 等等,我发现 BLAKE3 是我能找到的最快的单线程加密哈希(是的,尽管我的系统有 sha_ni 但没有 AVX512——只有 AVX2 😭😭😭,BLAKE3 甚至比 OpenSSL 的 SHA1 和 SHA2 实现还要快),而且它不像其他哈希那样是可并行化的,所以我决定采用 BLAKE3。总之,我开始实现我自己的 ECC 工具,第一步是计算文件的哈希值。这就是我创建这个程序的原因。