在 C/C++ 中处理简单的文本文件






4.33/5 (12投票s)
最近关于读取 ANSI 与 Unicode 文本的问题促使我编写了以下内容
这段代码处理从文本文件中读取的文本块,并将其预处理成程序所需的格式。源文本可以是 ANSI、UTF-8、Unicode 或 Unicode 大端。下面的代码会将文本转换为 Unicode 或 UTF-8,具体取决于项目设置,无论编译为 Unicode 还是 MBCS 支持。源文本通过缓冲区开头的字节顺序标记 (BOM) 来识别。如果没有 BOM,则假定数据是纯 ANSI,尽管还有其他工具和 Win32 API 函数可以帮助确定,如以下 MSDN 链接所述:Unicode 和字符集[^] 字符类型和字节顺序标记定义如下
- ANSI 无签名,单字节字符范围为 0x00 到 0x7F。
- UTF-8 签名 = 3 个字节:0xEF 0xBB 0xBF,后跟以下链接中提到的多字节字符 UTF 信息[^]。
- UTF-16 LE(小端),用于 Windows 和其他操作系统。通常称为“Unicode”。签名 = 2 个字节:0xFF 0xFE(或 1 个字 0xFEFF),后跟字:0x0000 到 0x007F 用于正常的 0-127 ASCII 字符。0x0080 到 0xFDFF 用于扩展集。
- UTF-16 BE(大端)。这用于 Macintosh 操作系统。签名 = 2 个字节:0xFE 0xFF(或 1 个字 0xFFFE),后跟字,如 UTF-16,但字节已反转。
null
字节,以表示文本块的结尾(即使它是 ANSI)。此外,调用例程负责在不再需要时处理这两个缓冲区。PTSTR Normalise(PBYTE pBuffer
)
{
PTSTR ptText; // pointer to the text char* or wchar_t* depending on UNICODE setting
PWSTR pwStr; // pointer to a wchar_t buffer
int nLength; // a useful integer variable
// obtain a wide character pointer to check BOMs
pwStr = reinterpret_cast<PWSTR>(pBuffer);
// check if the first word is a Unicode Byte Order Mark
if (*pwStr == 0xFFFE || *pwStr == 0xFEFF)
{
// Yes, this is Unicode data
if (*pwStr++ == 0xFFFE)
{
// BOM says this is Big Endian so we need
// to swap bytes in each word of the text
while (*pwStr)
{
// swap bytes in each word of the buffer
WCHAR wcTemp = *pwStr >> 8;
wcTemp |= *pwStr << 8;
*pwStr = wcTemp;
++pwStr;
}
// point back to the start of the text
pwStr = reinterpret_cast<PWSTR>(pBuffer + 2);
}
#if !defined(UNICODE)
// This is a non-Unicode project so we need
// to convert wide characters to multi-byte
// get calculated buffer size
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, NULL, 0, NULL, NULL);
// obtain a new buffer for the converted characters
ptText = new TCHAR[nLength];
// convert to multi-byte characters
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, ptText, nLength, NULL, NULL);
#else
nLength = wcslen(pwStr) + 1; // if Unicode, then copy the input text
ptText = new WCHAR[nLength]; // to a new output buffer
nLength *= sizeof(WCHAR); // adjust to size in bytes
memcpy_s(ptText, nLength, pwStr, nLength);
#endif
}
else
{
// The text data is UTF-8 or Ansi
#if defined(UNICODE)
// This is a Unicode project so we need to convert
// multi-byte or Ansi characters to Unicode.
// get calculated buffer size
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, NULL, 0);
// obtain a new buffer for the converted characters
ptText = new TCHAR[nLength];
// convert to Unicode characters
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, ptText, nLength);
#else
// This is a non-Unicode project so we just need
// to skip the UTF-8 BOM, if present
if (memcmp(pBuffer, "\xEF\xBB\xBF", 3) == 0)
{
// UTF-8
pBuffer += 3;
}
nLength = strlen(reinterpret_cast<PSTR>(pBuffer)) + 1; // if UTF-8/ANSI, then copy the input text
ptText = new char[nLength]; // to a new output buffer
memcpy_s(ptText, nLength, pBuffer, nLength);
#endif
}
// return pointer to the (possibly converted) text buffer.
return ptText;
}