C#, C++ 和 C 的文件 I/O 和字符串性能对比（三种方式）

Michael Sydney Balloni

5.00/5 (18投票s)

2022年9月26日

CPOL

5分钟阅读

31494

226

评估不同语言和方法的文件和字符串处理基准测试

下载源代码 - 29.2 KB

引言

我一直对比较不同编程语言和解决基本编程问题的不同方法的性能感兴趣，因此我用 C#、C++ 和 C 开发了应用程序，并得出结论，以决定在不同情况下使用哪些语言和方法。

基准测试

我提出的问题是，程序将一个输入 CSV 文件加载到一个数据结构数组中，并使用这些数据结构将数据写回到一个输出 CSV 文件。这是一个基本的文件 I/O 和数据结构序列化基准测试。

我选择使用 Unicode 编码的文本文件作为 CSV 数据，因为 C# 是一种 Unicode 语言，而 C/C++ 在这一点上可以很好地处理 Unicode 数据。我选择每行 CSV 文本有七个数据字段，典型的统计数据：名字、姓氏、地址行 1、地址行 2、城市、州和邮政编码。为了简单起见，每个字段限制为 127 个 Unicode 字符。

CSV 生成程序（附加代码中的“gen”项目）将 100,000 行这种随机 CSV 数据输出到桌面上的一个文件中。

// C# script to generate the data used by the test programs
using System.Text;

// These constants are shared with the C program that uses fixed-length fields
const int FIELD_LEN = 128;
const int FIELD_CHAR_LEN = FIELD_LEN - 1;

const int RECORD_COUNT = 100 * 1000;

var rnd = new Random(0); // same output each gen

string rnd_str()
{
    StringBuilder sb = new StringBuilder(FIELD_CHAR_LEN);
    int rnd_len = rnd.Next(1, FIELD_CHAR_LEN);
    for (int r = 1; r <= rnd_len; ++r)
        sb.Append((char)((sbyte)'A' + rnd.Next(0, 25)));
    return sb.ToString();
}

string output_file_path = 
    Path.Combine
    (
        Environment.GetFolderPath(Environment.SpecialFolder.Desktop), 
        "recordio.csv"
    );
if (output_file_path == null)
    throw new NullReferenceException("output_file_path");

using (var output = new StreamWriter(output_file_path, false, Encoding.Unicode))
{
    for (int r = 1; r <= RECORD_COUNT; ++r)
        output.Write($"{rnd_str()},{rnd_str()},{rnd_str()},{rnd_str()},
                       {rnd_str()},{rnd_str()},{rnd_str()}\n");
}

让我们直奔主题：结果是什么？

这是不同程序加载和写入阶段的时间。程序会循环执行加载/写入周期四次。我取了每个程序所有运行的最佳加载和写入时间，即“最佳中的最佳”。

方法 / 语言	加载 (ms)	写入 (ms)
方法 1：C# 循环 (net)	317	178
方法 2：C# 批量 (net2)	223	353
方法 3：C++ 循环 (class)	2,489	1,379
方法 4：C 批量 (struct)	107	147
方法 5：C++ 批量 (class2)	202	136
方法 6：C/C++ 批量 (recordio)	108	149

结论和要点

C# 程序（循环和批量）代码简洁易读，性能良好。循环使用 StreamReader / StreamWriter，直观易于开发和维护。批量程序使用 File 类函数 ReadAllLines / WriteAllLines，在读取方面比 C# 循环程序快得多（包括 LINQ），但在写入方面较慢。因此，你会选择 ReadAllLines / LINQ 进行加载，并使用 StreamWriter 进行写入。

最大的消息是方法 3：C++ 循环 (class) 有一些非常严重的问题。这归结为加载时的 std::getline 调用和写入时的流输出；其他代码成本很小。我很有兴趣让大家重现这些数字，并报告如何解决这些性能问题。

C 批量程序之所以能够轻松胜过其他程序进行数据加载，是因为它使用了固定大小的字符串，这些字符串打包在 struct 中，并按顺序存储在单个数组中，因此数据局部性极佳。我们在大约 100 毫秒内读取了 90 MB 的 CSV。太棒了！出于某种原因，C 程序在将数据写入输出文件时速度较慢；代码看起来很简洁，我不确定原因。

受 C 程序启发，“C++ 批量”程序在读取数据方面远不如 C，但在写入数据方面却胜过所有其他程序。你应该选择 C 进行读取，选择 C++ 进行写入。

方法 6，recordio - 押韵于 rodeo - C 和 C++ 批量方法的混合体，封装在一个可重用的包中，性能为 108 / 149，胜过 C# 的最佳性能 223 / 178。写入性能的差异并不显著。加载性能的 2 倍提升不容忽视。使用固定长度的字符串打包到单个 struct 向量中，这任何 C++ wstring 或 C# 字符串存储都无法比拟。代码简洁，无论是在实现方面还是在测试驱动程序对可重用类用法的说明方面，都可以尝试一下。

我的总体建议是，在 C/C++ 中执行此类操作，使用具有固定宽度字段的结构体，就像你设想数据库中的存储方式一样，并利用 recordio 或类似工具获得出色的文件和字符串 I/O。

如果你使用 C# 代码，请使用 File.ReadAllLines 和 LINQ 来加载数据，并使用 StreamWriter 将其写回，从而以出色的生产力和安全性获得可观的性能。

这里有更多关于不同方法的详细信息。

方法 1：C# 循环

在附加代码中，这是“net”项目。这可能是最直观的语言和方法。你使用 StreamReader 吞噬数据，然后使用 StreamWriter 将所有数据写出。

// C# performance test program
using System.Diagnostics;
using System.Text;

var sw = Stopwatch.StartNew();

string input_file_path =
    Path.Combine
    (
        Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
        "recordio.csv"
    );
if (input_file_path == null)
    throw new NullReferenceException("input_file_path");

for (int iter = 1; iter <= 4; ++iter)
{
    // Read lines into a list of objects
    var nets = new List<info>();
    using (var input = new StreamReader(input_file_path, Encoding.Unicode))
    {
        while (true)
        {
            string? line = input.ReadLine();
            if (line == null)
                break;
            else
                nets.Add(new info(line.Split(',')));
        }
    }
    Console.WriteLine($".NET load  took {sw.ElapsedMilliseconds} ms");
    sw.Restart();

    // Write the objects to an output CSV file
    using (var output = new StreamWriter("output.csv", false, Encoding.Unicode))
    {
        foreach (var cur in nets)
            output.Write($"{cur.firstname},{cur.lastname},
            {cur.address1},{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n");
    }
    Console.WriteLine($".NET write took {sw.ElapsedMilliseconds} ms");
    sw.Restart();
}

// NOTE: Using struct did not change performance, probably because the 
//       contents of the strings are not stored consecutively, so
//       any data locality with the array of info objects is irrelevant
class info 
{
    public info(string[] parts)
    {
        firstname = parts[0];
        lastname = parts[1];

        address1 = parts[2];
        address2 = parts[3];

        city = parts[4];
        state = parts[5];
        zipcode = parts[6];
    }

    public string firstname;
    public string lastname;
    public string address1;
    public string address2;

    public string city;
    public string state;
    public string zipcode;
}

方法 2：C# 批量

在附加源代码中，这是“net2”项目。你可能会对自己说，“循环很乏味。我可以使用 File 类函数，如 ReadAllLines 来批量处理。我信任 .NET！” 这确实代码更少……

// C# performance test program
using System.Diagnostics;
using System.Runtime.ConstrainedExecution;
using System.Text;

var sw = Stopwatch.StartNew();

string input_file_path =
    Path.Combine
    (
        Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
        "recordio.csv"
    );
if (input_file_path == null)
    throw new NullReferenceException("input_file_path");

for (int iter = 1; iter <= 4; ++iter)
{
    // Read CSV file into a list of objects
    var nets =
        File.ReadAllLines(input_file_path, Encoding.Unicode)
        .Select(line => new info(line.Split(',')));
    Console.WriteLine($".NET 2 load  took {sw.ElapsedMilliseconds} ms");
    sw.Restart();

    // Write the objects to an output CSV file
    int count = nets.Count();
    string[] strs = new string[count];
    int idx = 0;
    foreach (var cur in nets)
        strs[idx++] = $"{cur.firstname},{cur.lastname},{cur.address1},
                        {cur.address2},{cur.city},{cur.state},{cur.zipcode}\n";
    File.WriteAllLines("output.csv", strs, Encoding.Unicode);
    Console.WriteLine($".NET 2 write took {sw.ElapsedMilliseconds} ms");
    sw.Restart();
}

方法 3：C++ 循环

C++ 在 Unicode 文件 I/O 和流处理方面已经取得了长足的进步。现在很容易编写出与 C# 的循环方法 1 相媲美的 C++ 代码，其可读性和简洁性也很高。

// C++ loop performance test program
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

// Our record type, just a bunch of wstrings
struct info
{
	info(const std::vector<std::wstring>& parts)
		: firstname(parts[0])
		, lastname(parts[1])
		, address1(parts[2])
		, address2(parts[3])
		, city(parts[4])
		, state(parts[5])
		, zipcode(parts[6])
	{
	}

	std::wstring firstname;
	std::wstring lastname;
	std::wstring address1;
	std::wstring address2;
	std::wstring city;
	std::wstring state;
	std::wstring zipcode;
};

// Split a string by a separator, returning a vector of substrings
std::vector<std::wstring> split(const std::wstring& str, const wchar_t seperator)
{
	std::vector<std::wstring> retVal;
	retVal.reserve(FIELD_COUNT); // cheat...

	std::wstring acc;
	acc.reserve(FIELD_CHAR_LEN); // ...a little

	for (wchar_t c : str)
	{
		if (c == seperator)
		{
			retVal.push_back(acc);
			acc.clear();
		}
		else
			acc.push_back(c);
	}

	if (!acc.empty())
		retVal.push_back(acc);

	return retVal;
}

int main(int argc, char* argv[])
{
	timer t;

	for (int iter = 1; iter <= 4; ++iter)
	{
		// Read the file into a vector of line strings
		std::vector<std::wstring> lines;
		{
			std::wifstream input(argv[1], std::ios::binary);
			input.imbue(std::locale(input.getloc(), new std::codecvt_utf16<wchar_t, 
			0x10ffff, std::codecvt_mode(std::consume_header | std::little_endian)>));
			if (!input)
			{
				std::cout << "Opening input file failed\n";
				return 1;
			}

			std::wstring line;
			while (std::getline(input, line))
				lines.push_back(line);
		}

		// Process the lines into a vector of structs
		std::vector<info> infos;
		infos.reserve(lines.size());
		for (const auto& line : lines)
			infos.emplace_back(split(line, ','));
		t.report("class load ");

		// Write the structs to an output CSV file
		{
			std::wofstream output("output.csv", std::ios::binary);
			output.imbue(std::locale(output.getloc(), new std::codecvt_utf16<wchar_t, 
			0x10ffff, std::codecvt_mode(std::generate_header | std::little_endian)>));
			if (!output)
			{
				std::cout << "Opening output file failed\n";
				return 1;
			}
			for (const auto& record : infos)
			{
				output
					<< record.firstname << ','
					<< record.lastname << ','
					<< record.address1 << ','
					<< record.address2 << ','
					<< record.city << ','
					<< record.state << ','
					<< record.zipcode << '\n';
			}
		}
		t.report("class write");
	}
}

对于文件加载步骤，快速的时间检查表明，所有时间都花在了 std::getline() 调用上。对于文件写入步骤，所有时间都花在了输出循环中，还能花在哪儿呢？这个谜团留给读者作为练习。这个简单的代码有什么问题？

方法 4：C 批量

如果我们愿意将整个文本加载到内存中，那么我们可以通过原地切片和分割数据来玩技巧，并利用固定长度的字符串缓冲区和利用数据局部性和不安全操作的字符级字符串操作。多么有趣！

#include <stdio.h>
#include <stdlib.h>

const size_t FIELD_LEN = 128;
const size_t FIELD_CHAR_LEN = FIELD_LEN - 1;

const size_t FIELD_COUNT = 7;

const size_t RECORD_LEN = std::max(FIELD_COUNT * FIELD_LEN + 1, size_t(1024));

// Struct with fixed char array fields 
struct info
{
	wchar_t firstname[FIELD_LEN];
	wchar_t lastname[FIELD_LEN];

	wchar_t address1[FIELD_LEN];
	wchar_t address2[FIELD_LEN];

	wchar_t city[FIELD_LEN];
	wchar_t state[FIELD_LEN];

	wchar_t zipcode[FIELD_LEN];
};

// Read a comma-delimited string out of a buffer
void read_str(const wchar_t*& input, wchar_t* output)
{
	size_t copied = 0;
	while (*input && *input != ',')
		*output++ = *input++;

	*output = '\0';

	if (*input == ',')
		++input;
}

// Initialize a record using a buffer of text
void set_record(info& record, const wchar_t* buffer)
{
	read_str(buffer, record.firstname);
	read_str(buffer, record.lastname);
	read_str(buffer, record.address1);
	read_str(buffer, record.address2);
	read_str(buffer, record.city);
	read_str(buffer, record.state);
	read_str(buffer, record.zipcode);
}

// Output a record to a buffer of text
wchar_t* add_to_buffer(const wchar_t* input, wchar_t* output, wchar_t separator)
{
	while (*input)
		*output++ = *input++;
		
	*output++ = separator;
	
	return output;
}
int64_t output_record(const info& record, wchar_t* buffer)
{
	const wchar_t* original = buffer;

	buffer = add_to_buffer(record.firstname, buffer, ',');
	buffer = add_to_buffer(record.lastname, buffer, ',');
	buffer = add_to_buffer(record.address1, buffer, ',');
	buffer = add_to_buffer(record.address2, buffer, ',');
	buffer = add_to_buffer(record.city, buffer, ',');
	buffer = add_to_buffer(record.state, buffer, ',');
	buffer = add_to_buffer(record.zipcode, buffer, '\n');

	return buffer - original;
}

int main(int argc, char* argv[])
{
	timer t;

	for (int iter = 1; iter <= 4; ++iter)
	{
		// Open input file
		FILE* input_file = nullptr;
		if (fopen_s(&input_file, argv[1], "rb") != 0)
		{
			printf("Opening input file failed\n");
			return 1;
		}

		// Compute file length
		fseek(input_file, 0, SEEK_END);
		int file_len = ftell(input_file);
		fseek(input_file, 0, SEEK_SET);

		// Read file into memory
		wchar_t* file_contents = (wchar_t*)malloc(file_len + 2);
		if (file_contents == nullptr)
		{
			printf("Allocating input buffer failed\n");
			return 1;
		}
		if (fread(file_contents, file_len, 1, input_file) != 1)
		{
			printf("Reading input file failed\n");
			return 1;
		}
		size_t char_len = file_len / 2;
		file_contents[char_len] = '\0';
		fclose(input_file);
		input_file = nullptr;

		// Compute record count and delineate the line strings
		size_t record_count = 0;
		for (size_t idx = 0; idx < char_len; ++idx)
		{
			if (file_contents[idx] == '\n')
			{
				++record_count;
				file_contents[idx] = '\0';
			}
		}

		// Allocate record array
		info* records = (info*)malloc(record_count * sizeof(info));
		if (records == nullptr)
		{
			printf("Allocating records list failed\n");
			return 1;
		}

		// Process memory text into records
		wchar_t* cur_str = file_contents;
		wchar_t* end_str = cur_str + file_len / 2;
		size_t record_idx = 0;
		while (cur_str < end_str)
		{
			set_record(records[record_idx++], cur_str);
			cur_str += wcslen(cur_str) + 1;
		}
		if (record_idx != record_count)
		{
			printf("Record counts differ: idx: %d - count: %d\n", 
				   (int)record_idx, (int)record_count);
			return 1;
		}
		t.report("struct load ");

		// Write output file
		wchar_t* file_output = (wchar_t*)malloc(record_count * RECORD_LEN);
		if (file_output == nullptr)
		{
			printf("Allocating file output buffer failed\n");
			return 1;
		}
		size_t output_len = 0;
		for (size_t r = 0; r < record_count; ++r)
		{
			int new_output = output_record(records[r], file_output + output_len);
			if (new_output < 0)
			{
				printf("Writing to output buffer failed\n");
				return 1;
			}
			output_len += new_output;
		}
		FILE* output_file = nullptr;
		if (fopen_s(&output_file, "output.csv", "wb") != 0)
		{
			printf("Opening output file failed\n");
			return 1;
		}
		if (fwrite(file_output, output_len * 2, 1, output_file) != 1)
		{
			printf("Writing output file failed\n");
			return 1;
		}
		fclose(output_file);
		output_file = nullptr;
		t.report("struct write");

		// Clean up
		free(file_contents);
		file_contents = nullptr;

		free(records);
		records = nullptr;
	}

	return 0;
}

方法 5：C++ 批量

也许那堆 C 代码让你有些反感。我们能否在 C++ 中应用相同的批量方法？

// C++ batch test program 
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

// Our record type, just a bunch of wstrings
struct info
{
	info(const std::vector<wchar_t*>& parts)
		: firstname(parts[0])
		, lastname(parts[1])
		, address1(parts[2])
		, address2(parts[3])
		, city(parts[4])
		, state(parts[5])
		, zipcode(parts[6])
	{
	}

	std::wstring firstname;
	std::wstring lastname;
	std::wstring address1;
	std::wstring address2;
	std::wstring city;
	std::wstring state;
	std::wstring zipcode;
};

void parse_parts(wchar_t* buffer, wchar_t separator, std::vector<wchar_t*>& ret_val)
{
	ret_val.clear();

	ret_val.push_back(buffer); // start at the beginning

	while (*buffer)
	{
		if (*buffer == separator)
		{
			*buffer = '\0';
			if (*(buffer + 1))
				ret_val.push_back(buffer + 1);
		}
		++buffer;
	}
}

int main(int argc, char* argv[])
{
	timer t;

	for (int iter = 1; iter <= 4; ++iter)
	{
		// Read the file into memory
		std::vector<char> file_contents;
		{
			std::ifstream file(argv[1], std::ios::binary | std::ios::ate);
			std::streamsize size = file.tellg();
			file.seekg(0, std::ios::beg);
			file_contents.resize(size + 2); // save room for null termination
			if (!file.read(file_contents.data(), size))
			{
				std::cout << "Reading file failed\n";
				return 1;
			}

			// null terminate
			file_contents.push_back(0);
			file_contents.push_back(0);
		}

		// Get the lines out of the data
		std::vector<wchar_t*> line_pointers;
		parse_parts(reinterpret_cast<wchar_t*>
                   (file_contents.data()), '\n', line_pointers);

		// Process the lines into data structures
		std::vector<info> infos;
		infos.reserve(line_pointers.size());
		std::vector<wchar_t*> line_parts;
		for (wchar_t* line : line_pointers)
		{
			parse_parts(line, ',', line_parts);
			infos.emplace_back(line_parts);
		}
		t.report("C++ 2 load");

		// Write the structs to an output CSV file
		std::wstring output_str;
		output_str.reserve(file_contents.size() / 2);
		for (const auto& record : infos)
		{
			output_str += record.firstname;
			output_str += ',';
			output_str += record.lastname;
			output_str += ',';
			output_str += record.address1;
			output_str += ',';
			output_str += record.address2;
			output_str += ',';
			output_str += record.city;
			output_str += ',';
			output_str += record.state;
			output_str += ',';
			output_str += record.zipcode;
			output_str += '\n';
		}
		std::ofstream output_file("output.csv", std::ios::binary);
		if (!output_file)
		{
			std::cout << "Opening output file failed\n";
			return 1;
		}
		output_file.write(reinterpret_cast<const char*>
                         (output_str.c_str()), output_str.size() * 2);
		output_file.close();
		t.report("C++ 2 write");
	}
}

方法 6：C/C++ 混合

让我们取方法 4 和 5 的优点，创建一个具有实际可用性潜力的可重用类。

首先，是可重用类。它由 record 类型进行模板化，因此 record 类可以定义其字段和 record 长度限制。

// C/C++ hybrid file / string I/O class
namespace recordio
{
template<class record>
class lineio
{
public:
static void load(const char* inputFilePath, std::vector<record>& records)
{
	// Initialize our output
	records.clear();

	// Open input file
	FILE* input_file = nullptr;
	if (fopen_s(&input_file, inputFilePath, "rb") != 0)
	{
		throw std::runtime_error("Opening input file failed");
	}

	// Compute file length
	fseek(input_file, 0, SEEK_END);
	int file_len = ftell(input_file);
	fseek(input_file, 0, SEEK_SET);

	// Read file into memory
	size_t char_len = file_len / 2;
	std::unique_ptr<wchar_t[]> file_contents(new wchar_t[file_len / 2 + 1]);
	if (fread(reinterpret_cast<void*>
       (file_contents.get()), file_len, 1, input_file) != 1)
	{
		throw std::runtime_error("Reading input file failed\n");
	}
	file_contents[char_len] = '\0';
	fclose(input_file);
	input_file = nullptr;

	// Compute record count and delineate the line strings
	size_t record_count = 0;
	for (size_t idx = 0; idx < char_len; ++idx)
	{
		if (file_contents[idx] == '\n')
		{
			++record_count;
			file_contents[idx] = '\0';
		}
	}

	records.reserve(record_count);

	// Process memory text into records
	wchar_t* cur_str = file_contents.get();
	wchar_t* end_str = cur_str + file_len / 2;
	while (cur_str < end_str)
	{
		records.emplace_back(cur_str);
		cur_str += wcslen(cur_str) + 1;
	}
}

static void write(const char* outputFilePath, const std::vector<record>& records)
{
	std::wstring output_str;
	output_str.reserve(record::max_record_length * records.size());
	for (const auto& cur : records)
	{
		cur.get_record_str(output_str);
		output_str += '\n';
	}

	// Write output file
	std::ofstream output_file(outputFilePath, std::ios::binary);
	if (!output_file)
	{
		throw std::runtime_error("Opening output file failed");
	}
	output_file.write(reinterpret_cast<const char*>
                     (output_str.c_str()), output_str.size() * 2);
}

“派生”的 record 类型需要做一些工作，但这非常简单。

// C/C++ hybrid class and test driver
// Struct with fixed char array fields
// Looks familiar...
struct info
{
	info() {}

	info(const wchar_t* str)
	{
		recordio::lineio<info>::read_str(str, firstname);
		recordio::lineio<info>::read_str(str, lastname);
		recordio::lineio<info>::read_str(str, address1);
		recordio::lineio<info>::read_str(str, address2);
		recordio::lineio<info>::read_str(str, city);
		recordio::lineio<info>::read_str(str, state);
		recordio::lineio<info>::read_str(str, zipcode);
	}

	void get_record_str(std::wstring& str) const
	{
		str += firstname;
		str += ',';
		str += lastname;
		str += ',';
		str += address1;
		str += ',';
		str += address2;
		str += ',';
		str += city;
		str += ',';
		str += state;
		str += ',';
		str += zipcode;
	}

	const static size_t max_field_length = FIELD_LEN;
	const static size_t max_record_length = RECORD_LEN;

	wchar_t firstname[FIELD_LEN];
	wchar_t lastname[FIELD_LEN];

	wchar_t address1[FIELD_LEN];
	wchar_t address2[FIELD_LEN];

	wchar_t city[FIELD_LEN];
	wchar_t state[FIELD_LEN];

	wchar_t zipcode[FIELD_LEN];
};

int main(int argc, char* argv[])
{
	if (argc != 2)
	{
		printf("Usage: recordio <input CSV file path>\n");
		return 0;
	}

	timer t;

	for (int iter = 1; iter <= 4; ++iter)
	{
		std::vector<info> records;
		recordio::lineio<info>::load(argv[1], records);
		t.report("recordio load ");

		recordio::lineio<info>::write("output.csv", records);
		t.report("recordio write ");
	}

	printf("All done.\n");
	return 0;
}

就这样！期待评论！

历史

2022年9月25日：初始版本