使用 STL 的便捷分词函数

Joerg Wiedenmann

4.71/5 (12投票s)

2006年3月1日

CPOL

7分钟阅读

70870

1063

一个方便且可自定义的 tokenizer 函数，可与 STL 字符串配合使用。

引言

本文介绍了一个 tokenizer 函数，它提供了一种高度可定制的方式来分割字符串。我之所以创建它，是因为 std::string 没有提供有效分割其内容的方法，而且我不想为此使用另一个类。我使用了 std::string 提供的各种方法来实现这个函数。

为了我的 CSV 类（我将在另一篇文章中介绍），我需要一个可以将字符串分割成一系列 token 的函数。在 Google 上搜索“tokenizer”一词后，我找到的唯一有用的东西是 boost::tokenizer 类。在对其进行了一些研究后，我决定实现我自己的函数，因为我不想为各种 TokenizerFunction 模型定义类型。然而，我喜欢 boost 类提供的功能，并将其中一些功能实现了到我的函数中。

特点

所有分隔符、引号和转义字符都是 100% 可自定义的。
每个分隔符组都可以指定多个字符。
引用文本以保护其不被分词。
转义单个字符以保护它们。
可以选择将分隔符保留为 token。

使用代码

要使用该函数，您只需向函数提供一个输入字符串、一个将接收输出的向量以及各种分隔符。可选地，您可以传递引号和/或转义字符。

分隔符的默认值是常见的 CSV 分隔符（空格、制表符、逗号、冒号、分号）。默认的引号是（" 和 '），默认的转义字符是反斜杠（\）。默认情况下，不保留任何分隔符字符。

函数原型

void tokenize ( const string& str, vector<string>& result,
  const string& delimiters, const string& delimiters_preserve,
  const string& quote, const string& esc );

`str`	输入字符串。这是将被分词的原始字符串。
`结果`	Token。此向量保存所有生成的 token。
`delimiters`	将用于分割输入字符串的分隔符。默认值：常见的 CSV 分隔符（空格、制表符、逗号、冒号、分号）
`delimiters_preserve`	将用于分割输入字符串的分隔符。这些分隔符将作为 token 出现在结果中。无默认字符。
`quote`	引号字符。引号字符会保护包围的文本（匹配的引号）。默认值：“ 和 '
`esc`	转义字符。这些字符用于保护单个字符。默认值：反斜杠（\）

示例

#include <string>
#include <vector>
#include "tokenizer.h"

// A string, which contents will be tokenized.
string input;

// define the characters that will break the string
string delimiter = ",\t";  // use comma and  tab

// define the characters that will break the string 
// and generate tokens themselves
string keep_delim = ";:";  // use semicolon and  colon

// define the characters that will protect the enclosed text
string quote = "\'\"";  // use single quote and double  quote

// define the characters that will protect the following 
// character
string esc = "\\#";  // use backslash and the hash sign

// vector that contains the tokens for input
vector<string> tokens;

tokenize ( input, tokens, delimiter, keep_delim, quote, esc );

// to use the tokens, define a token-iterator
vector<string>::iterator token;

// and simply iterate through the tokens
for ( token = tokens.begin(); tokens.end() != token; ++token ) 
{
    cout  << *token << endl;
}

演示应用程序

通过直接运行演示应用程序，您将获得以下输出

Demo application for the tokenizer function.
The tokens are in []:

This;string,is for      demonstration.

[This]
[string]
[is]
[for]
[demonstration.]

Delimiters can be preserved: sqrt(17 * (20 + a))
[sqrt]  [(]     [17]    [*]     [(]     [20]    [+]     [a]     [)]     [)]

"This;string;contains;quoted;text";and;escaped\;characters.

[This;string;contains;quoted;text]
[and]
[escaped;characters.]

您还可以提供参数，或编辑并使用包含的批处理文件来使用演示文件

TokenizerDemo   filename [delimiters] [preserved delimiters] 
         [quote chars] [escape chars]

所有参数都是可选的，但您不能跳过参数。例如，如果您不想提供引号字符但需要转义字符，则必须传递一个空参数，例如：“”。
如果您想使用空格作为分隔符，则必须引用它。例如，如果您想使用逗号、分号、空格和冒号：“,; :”。
“也必须被引用，如下所示：“””。
只处理文件的前 15 行。

工作原理

本质上，字符串是逐个字符地迭代的，每个字符都被附加到 token 字符串。每次字符属于分隔符时，token 字符串都会被保存在列表中并清除以准备下一个 token。此外，还会进行特殊情况（如引号）的检查。

实现细节

函数的第一部分清除结果向量，并初始化用于保存字符在字符串中的当前位置、引号状态和当前 token 的变量。第二部分是执行分割的循环，第三部分是将剩余的 token（如果还有的话）添加到结果中。

循环

For every character in the string
    Test if it is an escape character
        If yes, skip all other tests
    Test if it is a quote character
        If yes, skip all other tests
    Test if it is a delimiter
        Token is complete
    Test if it is a delimiter which should be preserved
        Token is complete
        flag the delimiter to be added

    Append the character to the current token if it isn't 
    a special one.
    If the token is complete and not empty
        add the token to the results
    If the delimiter is preserved
        add it to the results

循环逐个字符地遍历字符串。它对字符执行几个测试，以便能够决定如何处理它。在进行任何测试之前，都假定该字符不是特殊字符之一。

string::size_type len = str.length();
while ( len > pos ) {
    ch = str.at(pos);
    delimiter = 0;

    bool add_char = true;

从字符串中提取字符后，会检查该字符是否属于转义字符组。如果找到，则将位置增加一以获取下一个字符，前提是至少还有一个字符。无需进行进一步的测试，因为无论是什么，转义字符都将被添加到当前 token 中。

if ( string::npos != esc.find_first_of(ch) ) {
    ++pos;
    if ( pos < len ) {
        ch = str.at(pos);
        add_char = true;
    } else {
        add_char = false;
    }
    escaped = true;
}

之后，如果字符属于引号字符组，则检查是否有打开的引号。如果打开引号状态已设置，则将其关闭；如果未设置，则将其设置。在“打开引号”状态下，不会进行任何分隔符检查，任何特殊字符都将被添加到当前 token 中。

if ( false == escaped ) {
    if ( string::npos != quote.find_first_of(ch) ) {
        if ( false == quoted ) {
            quoted = true;
            current_quote = ch;
            add_char = false;
        } else if ( current_quote == ch ) {
            quoted = false;
            current_quote = 0;
            add_char = false;
        }
    }
}

如果字符不匹配上述任何组，则会检查它是否属于分隔符组。如果是，并且 token 字符串不为空，则将 token 标记为完成。

if ( false == escaped && false == quoted ) {
    if ( string::npos != delimiters.find_first_of(ch) ) {
        if ( false == token.empty() ) {
            token_complete = true;
        }
        add_char = false;
    }
}

……如果分隔符应被保留，则会由 add-delimiter 标志指示

bool add_delimiter = false;
if ( false == escaped && false == quoted ) {
    if ( string::npos != delimiters_preserve.find_first_of(ch) ) {
        if ( false == token.empty() ) {
            token_complete = true;
        }
        add_char = false;
        delimiter = ch;
        add_delimiter = true;
    }
}

如果字符不是特殊字符，它将被附加到 token 字符串的末尾。

if ( true == add_char ) {
    token.push_back( ch );
}

如果 token 不为空且标记为已完成，则将其添加到结果中并重置以准备下一个 token。

if ( true == token_complete && false == token.empty() ) {
    result.push_back( token );
    
    token.clear();
    token_complete = false;
}

如果分隔符被标记为已保留，它将作为 token 添加到结果中。

if ( true == add_delimiter ) {
    string delim_token;
    delim_token.push_back( delimiter );
    
    result.push_back( delim_token );
}

循环结束后，如果输入字符串未以分隔符结尾，则可能仍有一个 token 未添加，因为 token 完成标志仅在分隔符测试中设置 - 或者可能有一个未关闭的引号。无论出于何种原因，如果 token 缓冲区不为空，它将被添加到结果中。

关注点

这是实现的第二种方法。在原始函数中，我将所有特殊字符放入一个字符串中，并使用 string::find_first_of 方法获取其中一个字符的位置。这被证明是不方便的，因为我必须仔细检查并处理引号和转义字符等异常情况。

思考了几分钟后，我认为我可以在函数中逐个字符地迭代字符串，并查看该字符是否属于任何特殊字符组。两种方法的区别在于，对于第一种方法，我拥有复制到 token 字符串的子字符串的开始和结束位置；而在第二种方法中，我只是将字符附加到 token 字符串，并在找到分隔符时将其清除。

结论

我想感谢您，读者，感谢您的兴趣和反馈，并感谢 Lounge 中告诉我如何写一个好的介绍的友善的人们。但是，我不知道它最终是否算一个好的介绍。

我不知道“tokenizer”、“tokenizing”这些词是否存在——对于“tokenized”，字典上说的是“被翻译成 token”，但含义应该已经很清楚了；）

还有未解答的问题吗？请随时提问。：)

许可证

zlib/libpng 许可证。

本软件按“原样”提供，不附带任何明示或暗示的保证。在任何情况下，作者均不对使用本软件引起的任何损害负责。允许任何人出于任何目的（包括商业应用）使用本软件，并自由修改和重新分发，但须遵守以下限制：

不得虚构本软件的来源；您不得声称您编写了原始软件。如果您在产品中使用本软件，将在产品文档中注明将不胜感激，但并非必需。
修改后的源代码版本必须清楚地标明为修改版本，并且不得被虚构为原始软件。
此声明不得从任何源代码分发中删除或修改。

历史

2006-02-10 - 初始版本。
2006-03-05 - 错误修复和一些文章的小修改。感谢 Elias 指出这一点。

使用 STL 的便捷分词函数

目录

引言

背景

特点

使用代码

演示应用程序

工作原理

实现细节

关注点

结论

许可证

历史