FormatString - 智能字符串格式化

Ivo Beltchev

4.86/5 (30投票s)

2006年11月25日

CPOL

12分钟阅读

174726

1710

智能字符串格式化和其他字符串实用程序

下载源码 - 18.48 KB

引言

在本文中，我们将讨论 string 格式化。C 语言中标准的格式化方法是古老的 sprintf 函数。它存在各种缺陷，并且已经显得陈旧。C++ 和 STL 引入了 iostreams 和 << 运算符。虽然对于简单任务很方便，但其格式化功能笨拙且功能不足。

另一方面，我们有 .NET Framework 及其 String 类，它有一个格式化函数 String.Format[^]。它比 sprintf 更安全易用，但只能从托管代码中使用。本文将展示 sprintf 的主要问题，并提供一个可在本机 C++ 代码中使用的替代方案。

sprintf 有什么问题？

sprintf 容易发生缓冲区溢出

sprintf 有不同的版本，提供了不同程度的缓冲区溢出保护。基本的 sprintf 版本没有任何保护。它会愉快地写入到给定缓冲区之外，很可能会导致程序崩溃。_snprintf 函数不会写入到缓冲区之外，但如果空间不足，也不会在末尾添加零。程序不会立即崩溃，但很可能稍后崩溃。新的 _sprintf_s 函数修复了缓冲区溢出问题，但仅适用于 Visual Studio 2005 及更高版本。

String.Format 从托管堆中自行分配输出缓冲区，并可以根据需要将其扩充到足够大。

sprintf 不支持类型安全

sprintf 函数使用省略号语法 (...) 来接受可变数量的参数。缺点是函数没有关于参数类型的直接信息，也无法执行任何验证。它假定参数的数量和类型与格式化 string 匹配。这可能导致难以发现的 bug。例如：

std::string userName("user1");
int userData=0;

// These will compile and often run, but will produce wrong result

// the type of the arguments don't match the format
sprintf(buf,"user %d, data %s",userName.c_str(),userData);

// the string is missing .c_str()
sprintf(buf,"user %s, data %d",userName,userData);

在 String.Format 中，参数的格式是可选的。如果参数是 string ，它将作为 string 打印；如果它是数字，它将作为数字打印。

// The .NET equivalent:
String.Format("user {0}, data {1}",userName,userData);

sprintf 存在本地化问题

sprintf 函数要求参数的顺序与格式说明符的顺序完全相同。坏消息是不同语言的词序不同。程序需要提供不同顺序的参数来适应不同的语言。例如：

// English
sprintf(buf,"The population of %s is %d people.","New York",20000000);
// But maybe in some other language it has to be:
sprintf(buf,"%d people live in %s.",20000000,"New York"); // the order is different

在这种情况下，String.Format 也占优。它的格式项明确指定要使用哪个参数，并且可以按任何顺序进行。

// The .NET equivalent - same code can be used for both languages,
// just the formatting string needs to change:
String.Format("The population of {0} is {1} people.","New York",20000000);
String.Format("{1} people live in {0}.","New York",20000000);

FormatString 函数

FormatString 函数是 sprintf 的一个智能且类型安全的替代方案，可供本机 C++ 代码使用。其用法如下：

FormatString(buffer, buffer_size_in_characters, format, arguments...);

该函数有两个版本 - 一个是 char 版本，另一个是 wchar_t 版本。

格式 string 包含类似于 String.Format 的项：

{index[,width][:format][@comment]}

index 是参数列表中的零基索引。如果 index 超出了最后一个参数，FormatString 将会断言。

width 是结果的可选宽度。如果 width 小于零，结果将左对齐。width 可以是 '*<index>' 的形式。那么 <index> 必须是列表中另一个参数的索引，该参数提供 width 值。

format 是结果的可选格式。可用格式取决于参数类型。如果 format 对于给定参数不受支持，FormatString 将会断言。

comment 被忽略。它可以是描述参数含义的提示，或者提供有助于格式化 string 本地化的示例。

FormatString 的结果始终适合提供的缓冲区，并始终以零结尾。像缓冲区在一个双字节字符的中间结束或在一个代理对的中间结束这样的特殊情况也得到了处理。

由于 { 和 } 字符用于定义格式项，因此它们需要在格式字符串中转义为 {{ 和 }}。

可用格式

适用于 8、16、32 和 64 位整数，包括 32 和 64 位指针

c - 一个字符。它是 ANSI 或 UNICODE 字符，具体取决于格式 string 的类型
d[+][0] - 带符号整数。'+' 会强制为正值添加 + 号。'0' 会添加前导零
u[0] - 无符号整数。'0' 会添加前导零
x[0] - 小写十六进制整数。'0' 会添加前导零
X[0] - 大写十六进制整数。'0' 会添加前导零
n - 本地化数字（使用 GetNumberFormat[^]，但没有小数位数）
f - 本地化文件大小（使用 StrFormatByteSize[^]）
k - 本地化 KB 文件大小（使用 StrFormatKBSize[^]）
t[<number>] - 本地化毫秒时间间隔（使用 StrFromTimeInterval[^]，可选显著位数，介于 1 到 6 之间）

带符号整数的默认格式是 'd'，无符号整数的默认格式是 'u'。

适用于浮点数和双精度数

f[<number>] - 定点数（可选小数位数）
f*<index> - 定点数。<index> 是另一个参数的索引，该参数提供小数位数
e 或 E - 指数格式。支持与 'f' 格式相同的小数位数
g 或 G - 选择 'f' 和 'e'/'E' 中较短的一种。小数位数规则相同
$ - 本地化货币（使用 GetCurrencyFormat[^]）
n[<number>] 或 n*<index> - 本地化数字（使用 GetNumberFormat，可选小数位数）

浮点数或双精度数的默认格式是 'f'。

适用于 ANSI 字符串，包括 std::string

FormatString 的 char 版本不支持任何 ANSI 字符串格式。wchar_t 版本支持：

<number> - 将 ANSI 字符串转换为 UNICODE 时使用的代码页
*<index> - 提供代码页的另一个参数的索引

如果未提供代码页，则使用默认值 (CP_ACP)。

适用于 UNICODE 字符串，包括 std::wstring

FormatString 的 wchar_t 版本不支持任何 UNICODE 字符串格式。char 版本支持：

<number> - 将 UNICODE 字符串转换为 ANSI 时使用的代码页
*<index> - 提供代码页的另一个参数的索引

如果未提供代码页，则使用默认值 (CP_ACP)。

适用于 SYSTEMTIME（以 const SYSTEMTIME & 形式传递）

d[l/f][format] - 短日期格式（使用 GetDateFormat[^]）。'l' - 将时间从 UTC 转换为本地时间。'f' - 与 'l' 相同，但使用文件系统规则 *。 format - 传递给 GetDateFormat 的可选格式
D[l/f][format] - 长日期格式
t[l/f][format] - 时间格式，无秒（使用 GetTimeFormat[^]）
T[l/f][format] - 时间格式

* 'l' 使用 SystemTimeToTzSpecificLocalTime 从 UTC 转换为本地时间。'f' 使用 FileTimeToLocalFileTime。区别在于 FileTimeToLocalFileTime 使用当前的夏令时设置，而不是给定日期的设置。这是不正确的，但与 Windows 显示本地文件时间的方式更一致。如果未定义 STR_USE_WIN32_TIME，则无论是否指定 'l' 或 'f'，都会使用 localtime 函数。localtime 生成的结果与文件系统（以及 FileTimeToLocalFileTime）一致。您可以在此处阅读文件系统为何如此行为：The Old New Thing: Why Daylight Savings Time is nonintuitive。

SYSTEMTIME 的默认格式是 'd'。

示例

char buf[100];

// The order of the arguments can change
FormatString(buf,100,"{1} people live in {0}.","New York",20000000);
    -> 20000000 people live in New York.

// Signed values are printed as signed
FormatString(buf,100,"{0}",-1);
    -> -1

// Unsigned values are printed as unsigned
FormatString(buf,100,"{0}",(unsigned int)-1);
    -> 4294967295

// The same argument can be used more than once
FormatString(buf,100,"{0}, 0x{0,8:X0}",1);
    -> 1, 0x00000001

// UNICODE text can be converted to ANSI
FormatString(buf,100,"{0}",L"test");
    -> test

// Localized integer number
FormatString(buf,100,"{0:n}",12345678);
    -> 12,345,678

// Time interval
FormatString(buf,100,"{0:t3}",12345678);
    -> 3 hr, 25 min

// Floating point number
FormatString(buf,100,"{0}",12345.678);
    -> 12345.678000

// Localized floating point number
FormatString(buf,100,"{0:n*1}",12345.678,2);
    -> 12,345.68

// Show current time
SYSTEMTIME st;
GetSystemTime(&st);
FormatString(buf,100,"{0:dl}  {0:tl}",st);
    -> 11/25/2006  1:26 PM

// Use custom date format
FormatString(buf,100,"{0:ddddd',' MMM dd yy}",st);
    -> Saturday, Nov 25 06

工作原理

FormatString 函数有 10 个可选参数 arg1, ... arg10，类型为 const CFormatArg &，如下所示：

class CFormatArg
{
public:
    CFormatArg( void );
    CFormatArg( char x );
    CFormatArg( unsigned char x );
    CFormatArg( short x );
    CFormatArg( unsigned short x );
    ..........
    
    enum
    {
        TYPE_NONE=0,
        TYPE_INT=1,
        TYPE_UINT=2,
        .....
    };

    union
    {
        int i;
        __int64 i64;
        double d;
        const char *s;
        const wchar_t *ws;
        const SYSTEMTIME *t;
    };
    int type;
    static CFormatArg s_Null;
;

int FormatString( char *string, int len, const char *format,
    const CFormatArg &arg1=CFormatArg::s_Null, ...,
    const CFormatArg &arg10=CFormatArg::s_Null );

CFormatArg 类包含每种支持类型的构造函数。每个构造函数根据其参数的类型设置 type 成员。当 FormatString 函数使用实际参数调用时，会创建一个临时的 CFormatArg 对象，该对象存储参数的值和类型。然后，FormatString 函数可以确定提供的参数数量，并可以访问它们的类型和值。

动态分配的字符串

通常您不想使用固定大小的缓冲区，而是动态分配的缓冲区。请改用 FormatStringAlloc 函数：

char *string=FormatStringAlloc(alocator, format, arguments );

第一个参数是一个对象，它具有一个负责分配和扩展 string 缓冲区的虚成员函数：

class CFormatStringAllocator
{
public:
    virtual bool Realloc( void *&ptr, int size );

    static CFormatStringAllocator g_DefaultAllocator;
};

bool CFormatStringAllocator::Realloc( void *&ptr, int size )
{
    void *res=realloc(ptr,size);
    if (ptr && !res) free(ptr);
    ptr=res;
    return res!=NULL;
}

Realloc 成员函数必须将 ptr 指向的缓冲区重新分配为给定大小（以字节为单位），并将 ptr 设置为新地址。分配器会每隔 256 个字符（大约）被调用一次以增大缓冲区。第一次调用 Realloc 时 ptr=NULL。如果发生错误，Realloc 必须释放 ptr 指向的内存并返回 false 或抛出异常。如果 Realloc 返回 false，则 FormatStringAlloc 终止并返回 NULL。

默认分配器使用 C 运行时堆的 realloc 函数。要释放返回的 string ，您需要调用 free(string)。您可以编写自己的分配器，使用不同的堆或其他内存分配方式。请参阅下面的示例。

输出到流

通常您不想将格式化的 string 输出到缓冲区，而是输出到文件、文本控制台、Visual Studio 的调试窗口等。请改用 FormatStringOut 函数：

bool success=FormatStringOut(output, format, arguments );

第一个参数是一个对象，它具有一个负责输出结果部分的虚成员函数。有针对 char 和 wchar_t 的独立类：

// char version
class CFormatStringOutA
{
public:
    virtual bool Output( const char *text, int len );

    static CFormatStringOutA g_DefaultOut;
};

bool CFormatStringOutA::Output( const char *text, int len )
{
    for (int i=0;i<len;i++)
        if (putchar(text[i])==EOF) return false;
    return true;
}

// wchar_t version
class CFormatStringOutW
{
public:
    virtual bool Output( const wchar_t *text, int len );

    static CFormatStringOutA g_DefaultOut;
};

bool CFormatStringOutW::Output( const wchar_t *text, int len )
{
    for (int i=0;i<len;i++)
        if (putwchar(text[i])==WEOF) return false;
    return true;
}

Output 成员函数将使用结果的每个部分进行调用。len 参数是字符数。请注意，文本不保证以零结尾。Output 必须返回 false 或抛出异常，如果发生错误。如果 Output 返回 false，则 FormatStringOut 终止并返回 false。

默认实现仅使用 putchar/putwchar 将文本发送到控制台。您可以为 iostream、FILE*、Win32 HANDLE 等编写自己的输出类。

附加功能

支持 FILETIME、time_t 和 OLE 时间

CFormatTime 类继承自 CFormatArg，允许您使用不同的日期/时间格式。使用方法如下：

time_t t=time();
FormatString(buf, 100, "local time: {0:dl}  {0:tl}", CFormatTime(t));
    -> local time: 11/25/2006  1:26 PM

您可以创建自己的派生自 CFormatArg 的类来支持更多数据类型或添加更多格式选项。

将 CFormatArg 参数列表传递给其他函数

FormatString.h 定义了 3 个宏，可与参数列表一起使用：

FORMAT_STRING_ARGS_H
FORMAT_STRING_ARGS_CPP 和
FORMAT_STRING_ARGS_PASS

您可以使用它们来创建其他具有可变参数列表的函数并调用 FormatString。例如，让我们创建一个可以格式化消息的 MessageBox 函数：

// in your header file
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
        LPCTSTR format, FORMAT_STRING_ARGS_H );

// in your cpp file
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
        LPCTSTR format, FORMAT_STRING_ARGS_CPP )
{
    TCHAR *text=FormatStringAlloc(CFormatStringAllocator::g_DefaultAllocator,
            format,
            FORMAT_STRING_ARGS_PASS);
    int res=MessageBox(parent,text,caption,type);
    free(text);
    return res;
}

调用时没有可变参数

如果 FormatString 及其同类函数被调用时没有可变参数，则格式 string 将直接复制到输出。在上面的示例中，您可以调用 MessageBox(parent, type, caption, text)，文本将直接显示在消息框中，而不会被解析为任何格式项。

CString 类

示例源代码提供了简单的 string 容器类 CStringA 和 CStringW。存储在其中的 strings 在第一个字符之前的 4 个字节中有一个引用计数。当此类被复制时，string 不会被复制，只是引用计数会增加（所谓的写时复制，带有引用计数）。当 string 被销毁时，引用计数会减少，如果达到 0，则内存会被释放。使用 InterlockedIncrement 和 InterlockedDecrement 修改引用计数以实现线程安全。

在 ANSI 配置中，CString 类型设置为 CStringA；在 UNICODE 配置中，设置为 CStringW。这允许您使用配置相关的 CString，同时仍然可以根据需要混合 ANSI 和 UNICODE 类型。

CString 类有一个 Format 成员函数，该函数格式化 string 并将结果分配给对象。这是通过调用 FormatStringAlloc 并使用一个特殊的分配器来完成的，该分配器请求的内存比实际需要多 4 个字节，以存储引用计数。CString 类还定义了一个强制转换 operator CFormatArg，因此它们可以直接用作 FormatString 的参数。

CString s;
s.Format(_T("{0}"),"test");
FormatStringOut(CFormatStringOutA::g_DefaultOut,"s=\"{0}\"\n",s);
    -> s="test"

CString 的行为与 ATL/MFC 的 strings 非常相似，这里仅用于演示对 FormatStringAlloc 的自定义内存分配器的使用以及 CFormatArg 强制转换运算符的使用。要在实际应用程序中使用它们，您可能希望添加更多功能，例如比较运算符、CStringA 和 CStringW 之间的转换运算符/构造函数、string 操作功能等。或者干脆使用现有的类 std::string 或 ATL::CString。

StringUtils.h

源文件包含一组 string 实用程序，可以独立于 FormatString 使用。其中大多数是对系统 string 函数的包装器。这些函数成对出现 - 一个用于 ANSI，一个用于 UNICODE，如下所示：

inline int Strlen( const char *str ) { return (int)strlen(str); }
inline int Strlen( const wchar_t *str ) { return (int)wcslen(str); }
int Strcpy( char *dst, int size, const char *src );
int Strcpy( wchar_t *dst, int size, const wchar_t *src );

与 _tcslen 和 _tcscpy 相比，这种方法的优点是您可以轻松混合 ANSI 和 UNICODE 代码，并且始终使用相同的函数名称。

其他包装器提供了 strncpy、sprintf、strcat 等函数的安全版本，它们不会写入到提供的缓冲区之外，并且始终使结果以零结尾。它们在 VC 6.0、VS 2003 和 VS 2005 下都可以干净地编译。

输出到 STL 字符串

这些函数将格式化的结果输出到 STL string

std::string FormatStdString( const char *format, ... );
std::wstring FormatStdString( const wchar_t *format, ... );
void FormatStdString( std::string &string, const char *format, ... );
void FormatStdString( std::wstring &string, const wchar_t *format, ... );

输出到 STL 流

您可以将格式化的 string 输出到 STL streams，如下所示：

stream << StdStreamOut(format, parameters) << ...;

源代码

要使用源代码，只需将 .h 和 .cpp 文件放入您的项目中即可

StringUtils.h/StringUtils.cpp - 一组 string 辅助函数。它们可以单独使用。
FormatString.h/FormatString.cpp - string 格式化功能。需要 StringUtils
CString.h/CString.cpp - string 容器类。需要 StringUtils 和 FormatString

配置源代码

StringUtils.h 定义了几个宏，可用于启用或禁用部分功能：

STR_USE_WIN32_CONV - 如果定义了此宏，代码将使用 Win32 函数 WideCharToMultiByte 和 MultiByteToWideChar 在 char 和 wchar_t 字符串之间进行转换。否则，它将使用 wcstombs 和 mbstowcs。使用 Win32 函数的优点是它们支持 Unicode 和不同代码页（包括 UTF8）之间的转换。
STR_USE_WIN32_NLS - 如果定义了此宏，FormatString 函数将使用 Win32 功能来格式化数字、日期和时间。否则，它们将尝试在一定程度上模拟其功能。
STR_USE_WIN32_TIME - 如果定义了此宏，FormatString 函数将支持 time_t、SYSTEMTIME、FILETIME 和 DATE 类型。否则，仅支持 time_t。
STR_USE_WIN32_DBCS - 如果定义了此宏，代码将使用 IsDBCSLeadByte 来处理 DBCS 字符。否则，将使用 isleadbyte。
STR_USE_STL - 如果定义了此宏，FormatString 函数将支持 std::string 和 std::wstring 作为输入参数。此外，还将定义 FormatStdString 和 StdStreamOut，它们输出到 std::string、std::wstring、std::ostream 和 std::wostream。

通过这些宏，您可以选择性地仅启用您需要的功能，并且这些功能可以被您的编译器或平台支持。

历史

2006 年 11 月 - 第一个版本
- FormatString 的 char 和 wchar_t 实现
- 支持数字、strings 和时间格式
- 格式化到固定大小缓冲区、动态分配缓冲区和输出 streams
2006 年 12 月 - 更好的可移植性和更多功能
- 添加了配置宏
- 添加了对 STL strings 和 streams 的支持
- 添加了对不同大小 wchar_t 的支持
- 感谢 Mihai Nita 的建议，对数字格式的处理更加健壮
2007 年 2 月

添加了从 UTC 时间到本地时间的转换，该转换与文件系统一致（用于文件时间）