一个多子字符串搜索类：CIVStringSet

Scot Brennecke

4.69/5 (6投票s)

2002年6月16日

CPOL

4分钟阅读

159379

2193

一个使用 MFC 或 STL 的字符串数组类，可执行非常快速的多字符串搜索。

引言

程序员可能有很多原因希望同时搜索一个文本字符串中是否存在多个子字符串。很多时候，程序员会采用一种非常低效的暴力方法，即遍历令牌列表，然后为每个令牌从头到尾依次搜索 `string`。毫无疑问，这种方法效率非常低下，如果令牌列表很长和/或要搜索的 `string` 很长，效率甚至会变得非常糟糕。我创建了一个类来简化这项任务并高效地完成它。该类目前提供两个版本：一个使用 MFC 的 `CString` 和 `CStringArray`，另一个使用 STL 的 `std::string` 和 `std::vector`。

致谢

在 2002 年 6 月的《C/C++ Users Journal》杂志上，Moishe Halibard 和 Moshe Rubin 发表了一篇题为“A Multiple Substring Search Algorithm”的文章，其中他们详细介绍了他们用于高效有限状态机 (FSM) 的算法，以协助单次遍历搜索多个子字符串。我的类就是基于这个算法的。

更新详情

对于那些看过我 2002 年发布的这些类版本的人来说，以下是我所做主要更改的摘要：

支持 `string` 类型的分隔符字符。
加速 `FindNext` 方法，以跳过搜索开头的分隔符，并在单词被拒绝时跳到下一个分隔符。
将 FSM 实现替换为使用新的 `SEntry struct` 类型的 `std::vector`，而不是 `DWORD` 的动态数组。
更多地使用更合适的 `size_t` 类型而不是 `int`。
示例项目使用 VC++ 9.0 (VS2008)，而不是过时且有 bug 的 VC++ 6.0。

类接口

`CIVStringSet` 类派生自 `CStringArray` 或 `std::vector`，以便轻松继承 `CString` 或 `std::string` 动态数组的功能。

但是，我的类隐藏了许多基类方法，以防止在数组末尾以外的位置插入或删除 `string`。在添加到 FSM 后，更改给定单词的数组索引是**至关重要的**。

类的 `public` 接口如下：

struct SEntry
{
    SEntry( DWORD dwState = 0UL, DWORD dwIndex = 0UL )
        : m_dwState( dwState ), m_dwIndex( dwIndex ) {}

    DWORD m_dwState ;
    DWORD m_dwIndex ;
} ;

class CIVStringSet : public CStringArray
{
public:
    CIVStringSet( DWORD dwInitialWidth = 64 ) ;  		// Initial width of FSM
    virtual ~CIVStringSet() ;

    bool     Add( LPCTSTR pszWord ) ;                     	// a single word
    bool     Add( const CString & rstrWord ) ;            	// a single word
    size_t   Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; 	// multiple words, 
					// delimited with chars from pszDelims
    size_t   Add( LPCTSTR pszzWords, size_t nWords ) ;    	// nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t   Add( const std::set<CString< & sstrWords ) ; 	// all the elements of a 
							// set of CStrings
    size_t   Add( const CStringArray & astrWords ) ;      	// all the elements 
							// of a CStringArray
    size_t   Add( const CStringList & lstrWords ) ;       	// all the elements of a 
							// CStringList

    void     SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } // delimiters 
							// to use in word search
    UINT     FindFirstIn( CString strText, size_t & rnFirst ) ; // Begin iteration
    UINT     FindNext( size_t & rnNext ) ;                      // Continue iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT< < CWordPosPairList ; // list of pairs to 
						// be returned by FindAllIn
    size_t   FindAllIn( CString strText, CWordPosPairList & rlstrWords ) ; // Iterate 
						// all at once and make list
} ;

#ifdef _UNICODE
typedef std::wstring _tstring ;
#else
typedef std::string _tstring ;
#endif

class CIVStringSet : public std::vector<_tstring>

{
public:
    CIVStringSet( unsigned long dwInitialWidth = 64 ) ;  // Initial width of FSM
    virtual ~CIVStringSet() ;

    bool        Add( LPCTSTR pszWord ) ;                         // a single word
    bool        Add( const _tstring & rstrWord ) ;               // a single word
    size_t      Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ;     // multiple words, 
					// delimited with chars from pszDelims
    size_t      Add( LPCTSTR pszzWords, size_t nWords ) ;        // nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t      Add( const std::set<_tstring> & sstrWords ) ;    // all the elements of 
							 // a set of strings
    size_t      Add( const std::vector<_tstring> & astrWords ) ; // all the elements 
						// of an array of strings
    size_t      Add( const std::list<_tstring> & lstrWords ) ;   // all the elements 
							 // of a list of strings

    void        SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } //delimiters
							// to use in word search
    UINT        FindFirstIn( const _tstring & strText, size_t & rnFirst ) ;  // Begin 
								// iteration
    UINT        FindNext( size_t & rnNext ) ;                                // Continue 
								// iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT> > CWordPosPairList ; // list of pairs 
						// to be returned by FindAllIn
    size_t      FindAllIn( _tstring strText, CWordPosPairList & rlstrWords ) ; // Iterate
						// all at once and make list
} ;

构造函数接受一个可选的单个参数，该参数决定 FSM 的初始宽度。宽度至少应等于数组中所有 `string` 的总长度。当向集合添加 `string` 时，FSM 会根据需要进行扩充。在构造时对总长度进行较好的猜测可以避免稍后进行微不足道的重新分配。

有七种 `Add` 方法的重载，允许您一次添加一个单词，或一次添加一组、列表或数组的单词。请参阅演示项目，其中包含使用一种 `Add` 形式以及 `FindFirstIn` 和 `FindNext` 的示例。

当 `string` 被添加到集合中时，FSM 会被更新。一旦添加了所有 `string`，就不需要进行额外的准备就可以进行搜索。有三种搜索方法：`FindFirstIn`、`FindNext` 和 `FindAllIn`。前两种用于一次遍历找到的子字符串。第三种方法使用前两种为您构建一个 `CWordPosPairList`。只需将要搜索的文本 `string` 作为 `FindFirstIn` 或 `FindAllIn` 的第一个参数传入，该类就会找到数组中所有 `string` 的所有实例。

`FindFirstIn` 和 `FindNext` 都返回找到标记的被搜索文本中的偏移量。返回时，`size_t &` 参数将被填充，其中包含找到的 `string` 标记在字符串数组中的索引。 `FindAllIn` 返回找到的标记数量，并接受 `CWordPosPairList &` 作为第二个参数。 `CWordPosPairList &` 是一个 STL 列表对。返回的每个对都包含找到的标记的索引和偏移量。请参阅演示程序以了解此用法的示例。

演示程序

应要求，我还提供了一个简单的基于对话框的演示程序，该程序展示了 `Add` 和 `Find` 方法的用法。它使用了接收分隔符 `string` 的 `Add` 形式，将其解析以创建要查找的令牌列表，然后将这些找到的令牌填充到列表框中。

源代码亮点

对于那些好奇但不想下载源代码的人来说，这里有一些最有趣的功能。如果代码和注释不是令人满意的文档，请告诉我。

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// InsertWord
//          put the new word into the FSM
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
    bool bRetVal = false ;

    size_t nLen = _tcslen( pszWord ) ;
    if ( m_dwMaxUsedState < MAXDWORD )    	// if we've got over 4 billion states, 
					// something's wrong
    {
        unsigned long dwCurState = 0UL ;
        for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
        {
            size_t          nIdxChar = 
		min(static_cast<size_t>(pszWord[nChar]), c_nMaxChar) ;
            SEntry        & rEntry   = m_apnFSM[dwCurState][nIdxChar] ;  // the state, 
						// and possibly also an index
            unsigned long   dwState  = rEntry.m_dwState ;
            bool            bNew     = (dwState == 0UL) ;   	// no previous words 
							// have been here

            if ( bNew )
                dwState = ++m_dwMaxUsedState ;  // use a new state number

            // special case - last character of substring
            if ( nChar == ( nLen-1 ) )
            {
                // Put both the state number and the word index in the entry
                rEntry.m_dwState = dwState ;
                rEntry.m_dwIndex = dwIndex ;
                break ;
            }
            else if ( bNew ) // if current entry for this char value and state 
				// is still zero, add to FSM
                rEntry.m_dwState = dwState ;

            dwCurState = dwState ;  // prepare for next iteration
            if ( m_apnFSM.size() <= dwCurState )
                m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0, 
			vector( c_nMaxChar ) ) ;
        }

        bRetVal = true ;
    }

    return bRetVal ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindFirstIn
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindFirstIn( const _tstring & strText, size_t & rnFirst )
{
    // The real work is done by FindNext, but we must initialize the starting
    //    character index and string buffer first
    m_nCurTextChar = 0U ;
    m_strSearch = strText ;

    return FindNext( rnFirst ) ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindNext
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindNext( size_t & rnNext )
{
    unsigned long dwCurState = 0UL ; // start in state 0
    UINT nIdx ;
    UINT nStartChar = MAXDWORD ;
    size_t nLen = m_strSearch.length() ;

    // Start with the character following the last one examined
    for ( nIdx = m_nCurTextChar ; nIdx < nLen ; nIdx++ )
    {
        size_t nNextIdx = m_strSearch.find_first_not_of( m_strDelims, nIdx ) ;
        if ( nNextIdx != nIdx )
        {
            // Skip over delimiters
            nStartChar = MAXDWORD ;
            if ( ( nNextIdx != _tstring::npos )
              && ( nNextIdx < nLen )
               )
            {
                dwCurState = 0UL ;
                nIdx = nNextIdx ;
            }
            else
                break ;
        }

        // if we are in state 0, save the index of the starting character,
        //   in case we must backtrack or for the return value
        if ( dwCurState == 0UL )
            nStartChar = nIdx ;

        // follow up the table states
        size_t nIdxChar = min(static_cast<size_t>(m_strSearch[nIdx]), c_nMaxChar) ;
        const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;

        // if we reach an entry with a an index, we may have found a full match
        //    if so, return the word we found
        if ( ( rEntry.m_dwIndex != 0UL )
          && ( ( nIdx == nLen-1 )
            || ( m_strDelims.find_first_of( m_strSearch[nIdx+1] ) != _tstring::npos )
             )
           )
        {
            size_t nIndex = rEntry.m_dwIndex - 1 ;
            ASSERT(nIndex < size());
            rnNext = nIndex ;
            m_nCurTextChar = nIdx + 1 ;  // set the index for the next time
            break ;
        }
        else
            dwCurState = rEntry.m_dwState ;

        if ( dwCurState == 0 )
        {
            nStartChar = MAXDWORD ;  // in case we're about to terminate the loop
            nIdx = m_strSearch.find_first_of( m_strDelims, nIdx ) ; // advance to 
							// the next delimiter
        }
    }

    // return the index of the start of the word, or -1 if no word found
    return (( nIdx < nLen ) ? nStartChar : MAXDWORD) ;
}

更新

2002 年 7 月 1 日 - 添加了 STL 版本
2010 年 4 月 24 日 - 重新修订了 MFC 和 STL 版本