在 C/C++ 中处理简单的文本文件

Richard MacCutchan

4.33/5 (12投票s)

2012年1月20日

CPOL

2分钟阅读

66149

最近关于读取 ANSI 与 Unicode 文本的问题促使我编写了以下内容

这段代码处理从文本文件中读取的文本块，并将其预处理成程序所需的格式。源文本可以是 ANSI、UTF-8、Unicode 或 Unicode 大端。下面的代码会将文本转换为 Unicode 或 UTF-8，具体取决于项目设置，无论编译为 Unicode 还是 MBCS 支持。源文本通过缓冲区开头的字节顺序标记 (BOM) 来识别。如果没有 BOM，则假定数据是纯 ANSI，尽管还有其他工具和 Win32 API 函数可以帮助确定，如以下 MSDN 链接所述：Unicode 和字符集[^] 字符类型和字节顺序标记定义如下

ANSI 无签名，单字节字符范围为 0x00 到 0x7F。
UTF-8 签名 = 3 个字节：0xEF 0xBB 0xBF，后跟以下链接中提到的多字节字符 UTF 信息[^]。
UTF-16 LE（小端），用于 Windows 和其他操作系统。通常称为“Unicode”。签名 = 2 个字节：0xFF 0xFE（或 1 个字 0xFEFF），后跟字：0x0000 到 0x007F 用于正常的 0-127 ASCII 字符。0x0080 到 0xFDFF 用于扩展集。
UTF-16 BE（大端）。这用于 Macintosh 操作系统。签名 = 2 个字节：0xFE 0xFF（或 1 个字 0xFFFE），后跟字，如 UTF-16，但字节已反转。

根据 MilanA 下面的评论，我修改了代码，使其始终返回新分配的缓冲区，即使没有进行转换。将文本读取到的输入缓冲区之后必须跟有两个 null 字节，以表示文本块的结尾（即使它是 ANSI）。此外，调用例程负责在不再需要时处理这两个缓冲区。

PTSTR Normalise(PBYTE	pBuffer
        	)
{
    PTSTR			ptText;		// pointer to the text char* or wchar_t* depending on UNICODE setting
    PWSTR			pwStr;		// pointer to a wchar_t buffer
    int				nLength;	// a useful integer variable
    
    // obtain a wide character pointer to check BOMs
    pwStr = reinterpret_cast<PWSTR>(pBuffer);
    
    // check if the first word is a Unicode Byte Order Mark
    if (*pwStr == 0xFFFE || *pwStr == 0xFEFF)
    {
        // Yes, this is Unicode data
        if (*pwStr++ == 0xFFFE)
        {
            // BOM says this is Big Endian so we need
            // to swap bytes in each word of the text
            while (*pwStr)
            {
                // swap bytes in each word of the buffer
                WCHAR	wcTemp = *pwStr >> 8;
                wcTemp |= *pwStr << 8;
                *pwStr = wcTemp;
                ++pwStr;
            }
            // point back to the start of the text
            pwStr = reinterpret_cast<PWSTR>(pBuffer + 2);
        }
#if !defined(UNICODE)
        // This is a non-Unicode project so we need
        // to convert wide characters to multi-byte
        
        // get calculated buffer size
        nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, NULL, 0, NULL, NULL);
        // obtain a new buffer for the converted characters
        ptText = new TCHAR[nLength];
        // convert to multi-byte characters
        nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, ptText, nLength, NULL, NULL);
#else
        nLength = wcslen(pwStr) + 1;    // if Unicode, then copy the input text
        ptText = new WCHAR[nLength];    // to a new output buffer
        nLength *= sizeof(WCHAR);       // adjust to size in bytes
        memcpy_s(ptText, nLength, pwStr, nLength);
#endif
    }
    else
    {
        // The text data is UTF-8 or Ansi
#if defined(UNICODE)
        // This is a Unicode project so we need to convert
        // multi-byte or Ansi characters to Unicode.
        
        // get calculated buffer size
        nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, NULL, 0);
        // obtain a new buffer for the converted characters
        ptText = new TCHAR[nLength];
        // convert to Unicode characters
        nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, ptText, nLength);
#else
        // This is a non-Unicode project so we just need
        // to skip the UTF-8 BOM, if present
        if (memcmp(pBuffer, "\xEF\xBB\xBF", 3) == 0)
        {
            // UTF-8
            pBuffer += 3;
        }
        nLength = strlen(reinterpret_cast<PSTR>(pBuffer)) + 1;  // if UTF-8/ANSI, then copy the input text
        ptText = new char[nLength];                             // to a new output buffer
        memcpy_s(ptText, nLength, pBuffer, nLength);
#endif
    }
    
    // return pointer to the (possibly converted) text buffer.
    return ptText;
}