用于 Knuth-Moris-Pratt (KMP) 算法的 .NET 实现

Nairooz Nilafdeen

4.67/5 (5投票s)

2009年4月4日

CPOL

3分钟阅读

49011

1308

用于 Knuth-Moris-Pratt (KMP) 算法的 .NET 实现。

下载源代码 - 4.79 KB

引言

本文关于 Knuth-Moris-Pratt 算法 (KMP)。 KMP 是一种字符串匹配算法，允许您在 O(n) 时间和 O(m) 预处理时间内搜索字符串中的模式，其中 n 是文本长度，m 是模式长度。

背景

KMP 算法首先计算一个转换数组，该数组指示发生不匹配时模式移动的位数。

实现

PrefixArray 类采用一个 string 参数（即模式），并负责计算前缀函数并返回一个包含转换索引的数组。

在计算前缀数组时，已注意最大程度地提高性能；因此，已在多个位置调整了一般实现。

for(int i = 1 ; i < pattern.Length ; i++) 
{ 
    temp        = SubCharArray(i, patternArray);
    hArray[i]   = GetPrefixLegth(temp, firstChar);
}

上面的代码显示了计算前缀函数；代码中的循环遍历模式并计算每个索引的前缀函数。（注意：迭代从 1 开始，因为我们知道索引 0 处的转换是 0。）

temp 数组表示从模式的第 0^th 个索引到当前循环索引的所有字符。此字符数组被传递到 GetPrefixLength() 函数，该函数实际计算前缀函数。

接下来，我们有 GetPrefixLength() 函数的代码

private static int GetPrefixLegth(char[] array, char charToMatch)
{ 
    for(int i = 2; i < array.Length; i++)
    {
     //if it is a match
       if(array[i] == charToMatch)
       { 
              if(IsSuffixExist(i, array))
              {
              //Return on the first prefix which is the largest
              return array.Length - i;
              }
       }
    }
    return 0;
}

此函数接受我们讨论的数组作为参数，以及模式的第一个字符 (charToMatch)。

迭代并搜索该数组，以查找与模式的第一个字符的匹配项（需要匹配第一个字符，前缀以第一个字符开头）。如果该数组中存在第一个字符，那么我们计算作为模式前缀的最长后缀。

显然，第一次匹配会给我们提供作为模式前缀的最长后缀。

下面的代码描述了函数 IsSufficExists()

private static bool IsSuffixExist(int index, char[] array)
{                    
  //Keep track of the prefix index
  int k = 0;
  for(int i = index; i < array.Length ; i++)
  {
      //A mismatch so return
       if(array[i] != array[k]){ return false;}
       k++;
  }
  return true;
}

接下来，我们将研究实际使用模式的转换数组在字符串中查找模式出现的代码段。

for(int i = 0; i < charArray.Length; i++)
{
  //If there is a match
  if(charArray[i] == patternArray[k])
  { 
       //Move the pattern index by one
       k++; 
  }
  else
  {
       //There is a mismatch..so move the pattern 

       //The amount to move through the pattern 
       int prefix = transitionArray[k];
       //if the current char does not match
       //the char in the pattern that concides
       //when moved then shift the pattern entirley, so 
       //we dont make a unnecssary comparision
       if(prefix + 1 > patternArray.Length && 
          charArray[i] != patternArray[prefix + 1])
       {
              k = 0;
       }
       else
       {
              k = prefix;
       }
   }

循环遍历字符串以搜索模式。如果当前字符与模式索引中的字符匹配（迭代中应匹配的字符索引），则存在匹配项，我们将模式的索引递增并继续。

如果存在不匹配，那么我们获取特定索引的转换索引，并查看转换索引 + 1 处的字符是否与正在匹配的字符匹配（这样做是为了我们不需要不必要地再次匹配字符）。如果存在匹配项，则我们将模式索引与转换索引一起移动；否则，我们将模式索引与 0 一起移动。

接下来，我们查看模式索引是否等于模式长度；如果是，那么我们有一个匹配项。此代码段如下所示

//A complet match, if kis
//equal to pattern length
if(k == patternArray.Length)
{
    //Add it to our result
    result.Add(i - (patternArray.Length - 1));
    //Set k as if the next character is a mismatch
    //therefore we don’t miss out any other containing pattern
    k  = transitionArray[k - 1];
}

使用代码

为了使用该代码，您只需构建提供的源代码中的文件，并使用 KMPUtil 类的静态方法 GetAllOccurences(string pattern, string text)，如下所示

KMPUtil.GetAllOccurences("ab", "abhsdsabsbabaa");

此方法将返回一个 ArrayList，其中包含模式在文本中出现的索引（从零开始）。