将基于 STL 的应用程序升级为使用 Unicode。

Taka Muraoka

4.82/5 (85投票s)

2003年7月17日

CPOL

7分钟阅读

540626

开发人员在将基于 STL 的应用程序升级为使用 Unicode 时将面临哪些问题以及如何解决它们。

引言

我最近将一个相当大的程序升级为使用 Unicode 而不是单字节字符。除了少数遗留模块外，我一直认真使用了 t- 函数，并将所有字符串文字和字符常量封装在 _T() 宏中，深信不疑：当需要切换到 Unicode 时，我所要做的就是定义 UNICODE 和 _UNICODE，然后一切都会如我所愿（tm）。

老天，我真是大错特错 :((

因此，我写这篇文章是为了疏解过去两周工作的痛苦，并希望它能为他人节省一些我所经历的痛苦和折磨。唉...

基础知识

理论上，编写可以使用单字节或双字节字符编译的代码是直接的。我本来打算写一个关于基础知识的部分，但 Chris Maunder 已经做过了。他描述的技术广为人知，所以我们将直接进入本文的重点。

宽字符文件 I/O

有常用的流类的宽版本，并且很容易定义 t- 风格的宏来管理它们。

#ifdef _UNICODE
    #define tofstream wofstream 
    #define tstringstream wstringstream
    // etc...
#else 
    #define tofstream ofstream 
    #define tstringstream stringstream
    // etc...
#endif // _UNICODE

你会这样使用它们：

tofstream testFile( "test.txt" ) ; 
testFile << _T("ABC") ;

现在，你会期望上述代码在用单字节字符编译时生成一个 3 字节的文件，在使用双字节字符编译时生成一个 6 字节的文件。但实际上并非如此。两者都得到一个 3 字节的文件。到底是怎么回事？！

事实证明，C++ 标准规定，宽字符流在写入文件时必须将双字节字符转换为单字节字符。因此，在上面的示例中，宽字符串 L"ABC"（6 字节长）在写入文件之前被转换为窄字符串（3 字节）。如果这还不够糟糕，这种转换的方式是依赖于实现的。

我找不到关于为什么会这样指定的明确解释。我最 M 好的猜测是，文件根据定义被视为（单字节）字符流，允许一次写入 2 个字节的内容会破坏这种抽象。无论对错，这都会导致严重的问题。例如，你不能向 wofstream 写入二进制数据，因为该类会在写入之前尝试将其窄化（通常会惨败）。

这对我来说尤其有问题，因为我有很多函数是这样的：

void outputStuff( tostream& os )
{
    // output stuff to the stream
    os << ....
}

如果传递的是 tstringstream 对象，这段代码（即，它会输出宽字符）会正常工作，但如果传递的是 tofstream 对象，则会得到奇怪的结果（因为所有内容都被窄化了）。

宽字符文件 I/O：解决方案

在调试器中逐步调试 STL（真开心！）发现 wofstream 在将输出数据写入文件之前会调用一个 std::codecvt 对象来窄化输出数据。std::codecvt 对象负责将字符串从一种字符集转换为另一种字符集，C++ 要求提供两种标准：一种将 char 转换为 char（即，实际上不做任何事），另一种将 wchar_t 转换为 char。后者就是导致我如此痛苦的原因。

解决方案：编写一个新的 codecvt 派生类，该类将 wchar_t 转换为 wchar_t（即，不做任何事），并将其附加到 wofstream 对象。当 wofstream 尝试转换要写入的数据时，它将调用我的新 codecvt 对象，该对象不做任何事，数据将按原样写入。

在 Google Groups 上进行了一些搜索，找到了 P. J. Plauger（MSVC 附带的 STL 作者）编写的一些代码，但我在用 Stlport 4.5.3 编译它时遇到了一些问题。这是我最终拼凑出来的版本：

#include <locale>

// nb: MSVC6+Stlport can't handle "std::"
// appearing in the NullCodecvtBase typedef.
using std::codecvt ; 
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;

class NullCodecvt
    : public NullCodecvtBase
{

public:
    typedef wchar_t _E ;
    typedef char _To ;
    typedef mbstate_t _St ;

    explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }

protected:
    virtual result do_in( _St& _State ,
                   const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
                   _E* F2 , _E* _L2 , _E*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_out( _St& _State ,
                   const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
                   _To* F2, _E* _L2 , _To*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_unshift( _St& _State , 
            _To* _F2 , _To* _L2 , _To*& _Mid2 ) const
    {
        return noconv ;
     }
    virtual int do_length( _St& _State , const _To* _F1 , 
           const _To* _L1 , size_t _N2 ) const _THROW0()
    {
        return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
    }
    virtual bool do_always_noconv() const _THROW0()
    {
        return true ;
    }
    virtual int do_max_length() const _THROW0()
    {
        return 2 ;
    }
    virtual int do_encoding() const _THROW0()
    {
        return 2 ;
    }
} ;

你可以看到，那些本应执行转换的函数实际上什么都没做，并返回 noconv 来表示这一点。

剩下的唯一工作就是实例化其中一个并将其连接到 wofstream 对象。使用 MSVC，你应该使用（非标准的）_ADDFAC() 宏来用 locale 充实对象，但它与我的新 NullCodecvt 类不起作用，所以我提取了宏的内部结构，并编写了一个新的宏来实现：

#define IMBUE_NULL_CODECVT( outputFile ) \
{ \
    NullCodecvt* pNullCodecvt = new NullCodecvt ; \
    locale loc = locale::classic() ; \
    loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \
    (outputFile).imbue( loc ) ; \
}

因此，上面给出的示例代码，之前无法正常工作，现在可以这样写：

tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt" , ios::out | ios::binary ) ; 
testFile << _T("ABC") ;

在文件流对象打开之前用新的 codecvt 对象充实它非常重要。文件还必须以二进制模式打开。如果不是，每次文件看到高字节或低字节中值为 10 的宽字符时，它都会执行 CR/LF 转换，这绝对不是你想要的。如果你确实想要 CR/LF 序列，你需要使用 "\r\n" 而不是 std::endl 来显式插入它。

wchar_t 问题

wchar_t 是用于宽字符的类型，定义如下：

typedef unsigned short wchar_t ;

不幸的是，因为它是一个 typedef 而不是真正的 C++ 类型，所以这样定义有一个严重的缺点：你不能重载它。看下面的代码：

TCHAR ch = _T('A') ;
tcout << ch << endl ;

使用窄字符串，这会按照你的预期工作：打印字母 A。使用宽字符串，它打印 65。编译器会认为你正在输出一个无符号短整型，并将其作为数值打印出来，而不是宽字符。啊啊啊！！！对此没有解决方案，除非你遍历整个代码库，查找输出单个字符的实例并进行修复。我写了一个小程序，让事情更清楚：

#ifdef _UNICODE
    // NOTE: Can't stream out wchar_t's - convert to a string first!
    inline std::wstring toStreamTchar( wchar_t ch ) 
            { return std::wstring(&ch,1) ; }
#else 
    // NOTE: It's safe to stream out narrow char's directly.
    inline char toStreamTchar( char ch ) { return ch ; }
#endif // _UNICODE    

TCHAR ch = _T('A') ;
tcout << toStreamTchar(ch) << endl ;

宽字符异常类

大多数 C++ 程序将使用异常来处理错误条件。不幸的是，std::exception 定义如下：

class std::exception
{
    // ...
    virtual const char *what() const throw() ;
} ;

并且只能处理窄字符错误消息。我只抛出我自己定义的异常或 std::runtime_error，所以我写了一个 std::runtime_error 的宽字符版本，如下所示：

class wruntime_error
    : public std::runtime_error
{

public:                 // --- PUBLIC INTERFACE ---

// constructors:
                        wruntime_error( const std::wstring& errorMsg ) ;
// copy/assignment:
                        wruntime_error( const wruntime_error& rhs ) ;
    wruntime_error&     operator=( const wruntime_error& rhs ) ;
// destructor:
    virtual             ~wruntime_error() ;

// exception methods:
    const std::wstring& errorMsg() const ;

private:                // --- DATA MEMBERS ---

// data members:
    std::wstring        mErrorMsg ; ///< Exception error message.
    
} ;

#ifdef _UNICODE
    #define truntime_error wruntime_error
#else 
    #define truntime_error runtime_error
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wruntime_error::wruntime_error( const wstring& errorMsg )
    : runtime_error( toNarrowString(errorMsg) )
    , mErrorMsg(errorMsg)
{
    // NOTE: We give the runtime_error base the narrow version of the 
    //  error message. This is what will get shown if what() is called.
    //  The wruntime_error inserter or errorMsg() should be used to get 
    //  the wide version.
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::wruntime_error( const wruntime_error& rhs )
    : runtime_error( toNarrowString(rhs.errorMsg()) )
    , mErrorMsg(rhs.errorMsg())
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error&
wruntime_error::operator=( const wruntime_error& rhs )
{
    // copy the wruntime_error
    runtime_error::operator=( rhs ) ; 
    mErrorMsg = rhs.mErrorMsg ; 

    return *this ; 
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::~wruntime_error()
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; }

（toNarrowString() 是一个将宽字符串转换为窄字符串的小辅助函数，下面会给出）。wruntime_error 仅保留宽错误消息的副本，并在有人调用 what() 时向基类 std::exception 提供窄版本。我自己定义的异常类，我修改它们如下：

class MyExceptionClass : public std::truntime_error
{
public:
    MyExceptionClass( const std::tstring& errorMsg ) : 
                            std::truntime_error(errorMsg) { } 
} ;

最后一个问题是，我有很多代码看起来像这样：

try
{
    // do something...
}
catch( exception& xcptn )
{
    tstringstream buf ;
    buf << _T("An error has occurred: ") << xcptn ; 
    AfxMessageBox( buf.str().c_str() ) ;
}

我为此定义了一个 std::exception 的插入器，如下所示：

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    // NOTE: toTstring() converts a string to a tstring - defined below
    os << toTstring( xcptn.what() ) ;

    return os ;
}

问题在于我的插入器调用了 what()，它只返回错误消息的窄版本。但如果错误消息包含外国字符，我希望在错误对话框中看到它们！所以我重写了插入器，如下所示：

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    if ( const wruntime_error* p = 
            dynamic_cast<const wruntime_error*>(&xcptn) )
        os << p->errorMsg() ; 
    else 
        os << toTstring( xcptn.what() ) ;

    return os ;
}

现在它会检测是否收到了一个宽字符异常类，如果是，则输出宽错误消息。否则，它将回退到使用标准的（窄字符）错误消息。即使我的应用程序中可能只使用 truntime_error 派生类，后一种情况仍然很重要，因为 STL 或其他第三方库可能会抛出 std::exception 派生错误。

其他杂项问题

Q100639：如果你正在编写一个 MFC 应用程序并使用 Unicode，你需要在项目选项的“链接”页面中将 wWinMainCRTStartup 指定为你的入口点。
许多 Windows 函数接受一个缓冲区来返回结果。缓冲区大小通常以字符为单位指定，而不是字节。所以，虽然下面的代码在使用单字节字符编译时会正常工作：
```
// get our EXE name 
TCHAR buf[ _MAX_PATH+1 ] ; 
GetModuleFileName( NULL , buf , sizeof(buf) ) ;
```
对于双字节字符是错误的。GetModuleFileName() 的调用需要这样编写：
```
GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ;
```
如果你逐字节处理文件，你需要测试 WEOF，而不是 EOF。
HttpSendRequest() 接受一个字符串，该字符串指定在发送 HTTP 请求之前要附加的附加标头。ANSI 版本接受字符串长度 -1，表示标头字符串是 NULL 终止的。Unicode 版本要求显式提供字符串长度。别问我为什么。

杂项有用信息

最后，提供一些如果你在做这类工作可能会有用的简单辅助函数。

extern std::wstring toWideString( const char* pStr , int len=-1 ) ; 
inline std::wstring toWideString( const std::string& str )
{
    return toWideString(str.c_str(),str.length()) ;
}
inline std::wstring toWideString( const wchar_t* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::wstring(pStr,len) ;
}
inline std::wstring toWideString( const std::wstring& str )
{
    return str ;
}
extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; 
inline std::string toNarrowString( const std::wstring& str )
{
    return toNarrowString(str.c_str(),str.length()) ;
}
inline std::string toNarrowString( const char* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::string(pStr,len) ;
}
inline std::string toNarrowString( const std::string& str )
{
    return str ;
}

#ifdef _UNICODE
    inline TCHAR toTchar( char ch )
    {
        return (wchar_t)ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return ch ;
    }
    inline std::tstring toTstring( const std::string& s )
    {
        return toWideString(s) ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return toWideString(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return (len < 0) ? p : std::wstring(p,len) ;
    }
#else 
    inline TCHAR toTchar( char ch )
    {
        return ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ;
    } 
    inline std::tstring toTstring( const std::string& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return (len < 0) ? p : std::string(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return toNarrowString(s) ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return toNarrowString(p,len) ;
    }
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wstring 
toWideString( const char* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many wide characters we are going to get 
    int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return L"" ;

    // convert the narrow string to a wide string 
    // nb: slightly naughty to write directly into the string like this
    wstring buf ;
    buf.resize( nChars ) ; 
    MultiByteToWideChar( CP_ACP , 0 , pStr , len , 
        const_cast<wchar_t*>(buf.c_str()) , nChars ) ; 

    return buf ;
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

string 
toNarrowString( const wchar_t* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many narrow characters we are going to get 
    int nChars = WideCharToMultiByte( CP_ACP , 0 , 
             pStr , len , NULL , 0 , NULL , NULL ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return "" ;

    // convert the wide string to a narrow string
    // nb: slightly naughty to write directly into the string like this
    string buf ;
    buf.resize( nChars ) ;
    WideCharToMultiByte( CP_ACP , 0 , pStr , len , 
          const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

    return buf ; 
}