C#, C++ 和 C 的文件 I/O 和字符串性能对比(三种方式)





5.00/5 (18投票s)
评估不同语言和方法的文件和字符串处理基准测试
引言
我一直对比较不同编程语言和解决基本编程问题的不同方法的性能感兴趣,因此我用 C#、C++ 和 C 开发了应用程序,并得出结论,以决定在不同情况下使用哪些语言和方法。
基准测试
我提出的问题是,程序将一个输入 CSV 文件加载到一个数据结构数组中,并使用这些数据结构将数据写回到一个输出 CSV 文件。这是一个基本的文件 I/O 和数据结构序列化基准测试。
我选择使用 Unicode 编码的文本文件作为 CSV 数据,因为 C# 是一种 Unicode 语言,而 C/C++ 在这一点上可以很好地处理 Unicode 数据。我选择每行 CSV 文本有七个数据字段,典型的统计数据:名字、姓氏、地址行 1、地址行 2、城市、州和邮政编码。为了简单起见,每个字段限制为 127 个 Unicode 字符。
CSV 生成程序(附加代码中的“gen
”项目)将 100,000 行这种随机 CSV 数据输出到桌面上的一个文件中。
// C# script to generate the data used by the test programs
using System.Text;
// These constants are shared with the C program that uses fixed-length fields
const int FIELD_LEN = 128;
const int FIELD_CHAR_LEN = FIELD_LEN - 1;
const int RECORD_COUNT = 100 * 1000;
var rnd = new Random(0); // same output each gen
string rnd_str()
{
StringBuilder sb = new StringBuilder(FIELD_CHAR_LEN);
int rnd_len = rnd.Next(1, FIELD_CHAR_LEN);
for (int r = 1; r <= rnd_len; ++r)
sb.Append((char)((sbyte)'A' + rnd.Next(0, 25)));
return sb.ToString();
}
string output_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (output_file_path == null)
throw new NullReferenceException("output_file_path");
using (var output = new StreamWriter(output_file_path, false, Encoding.Unicode))
{
for (int r = 1; r <= RECORD_COUNT; ++r)
output.Write($"{rnd_str()},{rnd_str()},{rnd_str()},{rnd_str()},
{rnd_str()},{rnd_str()},{rnd_str()}\n");
}
让我们直奔主题:结果是什么?
这是不同程序加载和写入阶段的时间。程序会循环执行加载/写入周期四次。我取了每个程序所有运行的最佳加载和写入时间,即“最佳中的最佳”。
方法 / 语言 | 加载 (ms) | 写入 (ms) |
方法 1:C# 循环 (net) | 317 | 178 |
方法 2:C# 批量 (net2) | 223 | 353 |
方法 3:C++ 循环 (class) | 2,489 | 1,379 |
方法 4:C 批量 (struct) | 107 | 147 |
方法 5:C++ 批量 (class2) | 202 | 136 |
方法 6:C/C++ 批量 (recordio) | 108 | 149 |
结论和要点
C# 程序(循环和批量)代码简洁易读,性能良好。循环使用 StreamReader
/ StreamWriter
,直观易于开发和维护。批量程序使用 File
类函数 ReadAllLines
/ WriteAllLines
,在读取方面比 C# 循环程序快得多(包括 LINQ),但在写入方面较慢。因此,你会选择 ReadAllLines
/ LINQ 进行加载,并使用 StreamWriter
进行写入。
最大的消息是方法 3:C++ 循环 (class) 有一些非常严重的问题。这归结为加载时的 std::getline
调用和写入时的流输出;其他代码成本很小。我很有兴趣让大家重现这些数字,并报告如何解决这些性能问题。
C 批量程序之所以能够轻松胜过其他程序进行数据加载,是因为它使用了固定大小的字符串,这些字符串打包在 struct
中,并按顺序存储在单个数组中,因此数据局部性极佳。我们在大约 100 毫秒内读取了 90 MB 的 CSV。太棒了!出于某种原因,C 程序在将数据写入输出文件时速度较慢;代码看起来很简洁,我不确定原因。
受 C 程序启发,“C++ 批量”程序在读取数据方面远不如 C,但在写入数据方面却胜过所有其他程序。你应该选择 C 进行读取,选择 C++ 进行写入。
方法 6,recordio - 押韵于 rodeo - C 和 C++ 批量方法的混合体,封装在一个可重用的包中,性能为 108 / 149,胜过 C# 的最佳性能 223 / 178。写入性能的差异并不显著。加载性能的 2 倍提升不容忽视。使用固定长度的字符串打包到单个 struct
向量中,这任何 C++ wstring
或 C# 字符串存储都无法比拟。代码简洁,无论是在实现方面还是在测试驱动程序对可重用类用法的说明方面,都可以尝试一下。
我的总体建议是,在 C/C++ 中执行此类操作,使用具有固定宽度字段的结构体,就像你设想数据库中的存储方式一样,并利用 recordio 或类似工具获得出色的文件和字符串 I/O。
如果你使用 C# 代码,请使用 File.ReadAllLines
和 LINQ 来加载数据,并使用 StreamWriter
将其写回,从而以出色的生产力和安全性获得可观的性能。
这里有更多关于不同方法的详细信息。
方法 1:C# 循环
在附加代码中,这是“net
”项目。这可能是最直观的语言和方法。你使用 StreamReader
吞噬数据,然后使用 StreamWriter
将所有数据写出。
// C# performance test program
using System.Diagnostics;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
// Read lines into a list of objects
var nets = new List<info>();
using (var input = new StreamReader(input_file_path, Encoding.Unicode))
{
while (true)
{
string? line = input.ReadLine();
if (line == null)
break;
else
nets.Add(new info(line.Split(',')));
}
}
Console.WriteLine($".NET load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
// Write the objects to an output CSV file
using (var output = new StreamWriter("output.csv", false, Encoding.Unicode))
{
foreach (var cur in nets)
output.Write($"{cur.firstname},{cur.lastname},
{cur.address1},{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n");
}
Console.WriteLine($".NET write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
// NOTE: Using struct did not change performance, probably because the
// contents of the strings are not stored consecutively, so
// any data locality with the array of info objects is irrelevant
class info
{
public info(string[] parts)
{
firstname = parts[0];
lastname = parts[1];
address1 = parts[2];
address2 = parts[3];
city = parts[4];
state = parts[5];
zipcode = parts[6];
}
public string firstname;
public string lastname;
public string address1;
public string address2;
public string city;
public string state;
public string zipcode;
}
方法 2:C# 批量
在附加源代码中,这是“net2
”项目。你可能会对自己说,“循环很乏味。我可以使用 File
类函数,如 ReadAllLines
来批量处理。我信任 .NET!” 这确实代码更少……
// C# performance test program
using System.Diagnostics;
using System.Runtime.ConstrainedExecution;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
// Read CSV file into a list of objects
var nets =
File.ReadAllLines(input_file_path, Encoding.Unicode)
.Select(line => new info(line.Split(',')));
Console.WriteLine($".NET 2 load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
// Write the objects to an output CSV file
int count = nets.Count();
string[] strs = new string[count];
int idx = 0;
foreach (var cur in nets)
strs[idx++] = $"{cur.firstname},{cur.lastname},{cur.address1},
{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n";
File.WriteAllLines("output.csv", strs, Encoding.Unicode);
Console.WriteLine($".NET 2 write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
方法 3:C++ 循环
C++ 在 Unicode 文件 I/O 和流处理方面已经取得了长足的进步。现在很容易编写出与 C# 的循环方法 1 相媲美的 C++ 代码,其可读性和简洁性也很高。
// C++ loop performance test program
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
// Our record type, just a bunch of wstrings
struct info
{
info(const std::vector<std::wstring>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
// Split a string by a separator, returning a vector of substrings
std::vector<std::wstring> split(const std::wstring& str, const wchar_t seperator)
{
std::vector<std::wstring> retVal;
retVal.reserve(FIELD_COUNT); // cheat...
std::wstring acc;
acc.reserve(FIELD_CHAR_LEN); // ...a little
for (wchar_t c : str)
{
if (c == seperator)
{
retVal.push_back(acc);
acc.clear();
}
else
acc.push_back(c);
}
if (!acc.empty())
retVal.push_back(acc);
return retVal;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Read the file into a vector of line strings
std::vector<std::wstring> lines;
{
std::wifstream input(argv[1], std::ios::binary);
input.imbue(std::locale(input.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::consume_header | std::little_endian)>));
if (!input)
{
std::cout << "Opening input file failed\n";
return 1;
}
std::wstring line;
while (std::getline(input, line))
lines.push_back(line);
}
// Process the lines into a vector of structs
std::vector<info> infos;
infos.reserve(lines.size());
for (const auto& line : lines)
infos.emplace_back(split(line, ','));
t.report("class load ");
// Write the structs to an output CSV file
{
std::wofstream output("output.csv", std::ios::binary);
output.imbue(std::locale(output.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::generate_header | std::little_endian)>));
if (!output)
{
std::cout << "Opening output file failed\n";
return 1;
}
for (const auto& record : infos)
{
output
<< record.firstname << ','
<< record.lastname << ','
<< record.address1 << ','
<< record.address2 << ','
<< record.city << ','
<< record.state << ','
<< record.zipcode << '\n';
}
}
t.report("class write");
}
}
对于文件加载步骤,快速的时间检查表明,所有时间都花在了 std::getline()
调用上。对于文件写入步骤,所有时间都花在了输出循环中,还能花在哪儿呢?这个谜团留给读者作为练习。这个简单的代码有什么问题?
方法 4:C 批量
如果我们愿意将整个文本加载到内存中,那么我们可以通过原地切片和分割数据来玩技巧,并利用固定长度的字符串缓冲区和利用数据局部性和不安全操作的字符级字符串操作。多么有趣!
#include <stdio.h>
#include <stdlib.h>
const size_t FIELD_LEN = 128;
const size_t FIELD_CHAR_LEN = FIELD_LEN - 1;
const size_t FIELD_COUNT = 7;
const size_t RECORD_LEN = std::max(FIELD_COUNT * FIELD_LEN + 1, size_t(1024));
// Struct with fixed char array fields
struct info
{
wchar_t firstname[FIELD_LEN];
wchar_t lastname[FIELD_LEN];
wchar_t address1[FIELD_LEN];
wchar_t address2[FIELD_LEN];
wchar_t city[FIELD_LEN];
wchar_t state[FIELD_LEN];
wchar_t zipcode[FIELD_LEN];
};
// Read a comma-delimited string out of a buffer
void read_str(const wchar_t*& input, wchar_t* output)
{
size_t copied = 0;
while (*input && *input != ',')
*output++ = *input++;
*output = '\0';
if (*input == ',')
++input;
}
// Initialize a record using a buffer of text
void set_record(info& record, const wchar_t* buffer)
{
read_str(buffer, record.firstname);
read_str(buffer, record.lastname);
read_str(buffer, record.address1);
read_str(buffer, record.address2);
read_str(buffer, record.city);
read_str(buffer, record.state);
read_str(buffer, record.zipcode);
}
// Output a record to a buffer of text
wchar_t* add_to_buffer(const wchar_t* input, wchar_t* output, wchar_t separator)
{
while (*input)
*output++ = *input++;
*output++ = separator;
return output;
}
int64_t output_record(const info& record, wchar_t* buffer)
{
const wchar_t* original = buffer;
buffer = add_to_buffer(record.firstname, buffer, ',');
buffer = add_to_buffer(record.lastname, buffer, ',');
buffer = add_to_buffer(record.address1, buffer, ',');
buffer = add_to_buffer(record.address2, buffer, ',');
buffer = add_to_buffer(record.city, buffer, ',');
buffer = add_to_buffer(record.state, buffer, ',');
buffer = add_to_buffer(record.zipcode, buffer, '\n');
return buffer - original;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Open input file
FILE* input_file = nullptr;
if (fopen_s(&input_file, argv[1], "rb") != 0)
{
printf("Opening input file failed\n");
return 1;
}
// Compute file length
fseek(input_file, 0, SEEK_END);
int file_len = ftell(input_file);
fseek(input_file, 0, SEEK_SET);
// Read file into memory
wchar_t* file_contents = (wchar_t*)malloc(file_len + 2);
if (file_contents == nullptr)
{
printf("Allocating input buffer failed\n");
return 1;
}
if (fread(file_contents, file_len, 1, input_file) != 1)
{
printf("Reading input file failed\n");
return 1;
}
size_t char_len = file_len / 2;
file_contents[char_len] = '\0';
fclose(input_file);
input_file = nullptr;
// Compute record count and delineate the line strings
size_t record_count = 0;
for (size_t idx = 0; idx < char_len; ++idx)
{
if (file_contents[idx] == '\n')
{
++record_count;
file_contents[idx] = '\0';
}
}
// Allocate record array
info* records = (info*)malloc(record_count * sizeof(info));
if (records == nullptr)
{
printf("Allocating records list failed\n");
return 1;
}
// Process memory text into records
wchar_t* cur_str = file_contents;
wchar_t* end_str = cur_str + file_len / 2;
size_t record_idx = 0;
while (cur_str < end_str)
{
set_record(records[record_idx++], cur_str);
cur_str += wcslen(cur_str) + 1;
}
if (record_idx != record_count)
{
printf("Record counts differ: idx: %d - count: %d\n",
(int)record_idx, (int)record_count);
return 1;
}
t.report("struct load ");
// Write output file
wchar_t* file_output = (wchar_t*)malloc(record_count * RECORD_LEN);
if (file_output == nullptr)
{
printf("Allocating file output buffer failed\n");
return 1;
}
size_t output_len = 0;
for (size_t r = 0; r < record_count; ++r)
{
int new_output = output_record(records[r], file_output + output_len);
if (new_output < 0)
{
printf("Writing to output buffer failed\n");
return 1;
}
output_len += new_output;
}
FILE* output_file = nullptr;
if (fopen_s(&output_file, "output.csv", "wb") != 0)
{
printf("Opening output file failed\n");
return 1;
}
if (fwrite(file_output, output_len * 2, 1, output_file) != 1)
{
printf("Writing output file failed\n");
return 1;
}
fclose(output_file);
output_file = nullptr;
t.report("struct write");
// Clean up
free(file_contents);
file_contents = nullptr;
free(records);
records = nullptr;
}
return 0;
}
方法 5:C++ 批量
也许那堆 C 代码让你有些反感。我们能否在 C++ 中应用相同的批量方法?
// C++ batch test program
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
// Our record type, just a bunch of wstrings
struct info
{
info(const std::vector<wchar_t*>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
void parse_parts(wchar_t* buffer, wchar_t separator, std::vector<wchar_t*>& ret_val)
{
ret_val.clear();
ret_val.push_back(buffer); // start at the beginning
while (*buffer)
{
if (*buffer == separator)
{
*buffer = '\0';
if (*(buffer + 1))
ret_val.push_back(buffer + 1);
}
++buffer;
}
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Read the file into memory
std::vector<char> file_contents;
{
std::ifstream file(argv[1], std::ios::binary | std::ios::ate);
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
file_contents.resize(size + 2); // save room for null termination
if (!file.read(file_contents.data(), size))
{
std::cout << "Reading file failed\n";
return 1;
}
// null terminate
file_contents.push_back(0);
file_contents.push_back(0);
}
// Get the lines out of the data
std::vector<wchar_t*> line_pointers;
parse_parts(reinterpret_cast<wchar_t*>
(file_contents.data()), '\n', line_pointers);
// Process the lines into data structures
std::vector<info> infos;
infos.reserve(line_pointers.size());
std::vector<wchar_t*> line_parts;
for (wchar_t* line : line_pointers)
{
parse_parts(line, ',', line_parts);
infos.emplace_back(line_parts);
}
t.report("C++ 2 load");
// Write the structs to an output CSV file
std::wstring output_str;
output_str.reserve(file_contents.size() / 2);
for (const auto& record : infos)
{
output_str += record.firstname;
output_str += ',';
output_str += record.lastname;
output_str += ',';
output_str += record.address1;
output_str += ',';
output_str += record.address2;
output_str += ',';
output_str += record.city;
output_str += ',';
output_str += record.state;
output_str += ',';
output_str += record.zipcode;
output_str += '\n';
}
std::ofstream output_file("output.csv", std::ios::binary);
if (!output_file)
{
std::cout << "Opening output file failed\n";
return 1;
}
output_file.write(reinterpret_cast<const char*>
(output_str.c_str()), output_str.size() * 2);
output_file.close();
t.report("C++ 2 write");
}
}
方法 6:C/C++ 混合
让我们取方法 4 和 5 的优点,创建一个具有实际可用性潜力的可重用类。
首先,是可重用类。它由 record
类型进行模板化,因此 record
类可以定义其字段和 record
长度限制。
// C/C++ hybrid file / string I/O class
namespace recordio
{
template<class record>
class lineio
{
public:
static void load(const char* inputFilePath, std::vector<record>& records)
{
// Initialize our output
records.clear();
// Open input file
FILE* input_file = nullptr;
if (fopen_s(&input_file, inputFilePath, "rb") != 0)
{
throw std::runtime_error("Opening input file failed");
}
// Compute file length
fseek(input_file, 0, SEEK_END);
int file_len = ftell(input_file);
fseek(input_file, 0, SEEK_SET);
// Read file into memory
size_t char_len = file_len / 2;
std::unique_ptr<wchar_t[]> file_contents(new wchar_t[file_len / 2 + 1]);
if (fread(reinterpret_cast<void*>
(file_contents.get()), file_len, 1, input_file) != 1)
{
throw std::runtime_error("Reading input file failed\n");
}
file_contents[char_len] = '\0';
fclose(input_file);
input_file = nullptr;
// Compute record count and delineate the line strings
size_t record_count = 0;
for (size_t idx = 0; idx < char_len; ++idx)
{
if (file_contents[idx] == '\n')
{
++record_count;
file_contents[idx] = '\0';
}
}
records.reserve(record_count);
// Process memory text into records
wchar_t* cur_str = file_contents.get();
wchar_t* end_str = cur_str + file_len / 2;
while (cur_str < end_str)
{
records.emplace_back(cur_str);
cur_str += wcslen(cur_str) + 1;
}
}
static void write(const char* outputFilePath, const std::vector<record>& records)
{
std::wstring output_str;
output_str.reserve(record::max_record_length * records.size());
for (const auto& cur : records)
{
cur.get_record_str(output_str);
output_str += '\n';
}
// Write output file
std::ofstream output_file(outputFilePath, std::ios::binary);
if (!output_file)
{
throw std::runtime_error("Opening output file failed");
}
output_file.write(reinterpret_cast<const char*>
(output_str.c_str()), output_str.size() * 2);
}
“派生”的 record
类型需要做一些工作,但这非常简单。
// C/C++ hybrid class and test driver
// Struct with fixed char array fields
// Looks familiar...
struct info
{
info() {}
info(const wchar_t* str)
{
recordio::lineio<info>::read_str(str, firstname);
recordio::lineio<info>::read_str(str, lastname);
recordio::lineio<info>::read_str(str, address1);
recordio::lineio<info>::read_str(str, address2);
recordio::lineio<info>::read_str(str, city);
recordio::lineio<info>::read_str(str, state);
recordio::lineio<info>::read_str(str, zipcode);
}
void get_record_str(std::wstring& str) const
{
str += firstname;
str += ',';
str += lastname;
str += ',';
str += address1;
str += ',';
str += address2;
str += ',';
str += city;
str += ',';
str += state;
str += ',';
str += zipcode;
}
const static size_t max_field_length = FIELD_LEN;
const static size_t max_record_length = RECORD_LEN;
wchar_t firstname[FIELD_LEN];
wchar_t lastname[FIELD_LEN];
wchar_t address1[FIELD_LEN];
wchar_t address2[FIELD_LEN];
wchar_t city[FIELD_LEN];
wchar_t state[FIELD_LEN];
wchar_t zipcode[FIELD_LEN];
};
int main(int argc, char* argv[])
{
if (argc != 2)
{
printf("Usage: recordio <input CSV file path>\n");
return 0;
}
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
std::vector<info> records;
recordio::lineio<info>::load(argv[1], records);
t.report("recordio load ");
recordio::lineio<info>::write("output.csv", records);
t.report("recordio write ");
}
printf("All done.\n");
return 0;
}
就这样!期待评论!
历史
- 2022年9月25日:初始版本