朴素字符串比较器

Pete O'Hanlon

4.63/5 (14投票s)

2008年4月17日

CPOL

2分钟阅读

42894

240

一个用于执行两个文本块“朴素”比较的类，以查看它们是否看起来相同。

下载源代码 - 4.35 KB

引言

不久前，我们为一位客户编写了一个应用程序，该客户正在举办一个比赛，其中包括一个决胜问题。决胜问题必须是唯一的，因此需要对参赛作品进行去重，以删除重复的匹配项（由于一项古怪的法律，如果您根据决胜问题向某人授予奖品，而另一个人可以证明他们使用了相同的决胜问题，那么您也必须向他们授予奖品，前提是他们满足所有其他获奖标准）。剩余的参赛作品将由一个小组进行评判，以确定哪些作品是最好的。

正如您所能想象的，由于数据捕获错误或参赛者拼写错误的可能性，决胜问题是手动输入的，这可能会带来潜在的风险。为了解决这个问题，客户要求一个朴素的比较器，该比较器使用一些简单的规则来识别潜在的重复条目。本文是该比较器的版本。

基础知识

首先需要做的是加载要进行比较的 string ，并设置在执行比较时是否要删除 HTML 标签。这是我添加到原始设计中的一个小扩展，以便它可以用于检测 CodeProject 上的文章是否只是基本模板的副本。在原始版本中，HTML 标签没有被删除。

基本算法是使用以下正则表达式删除 HTML 标签

</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>

这个表达式只是获取尖括号内的所有文本，然后应用程序使用 Regex.Replace 将它们替换为空字符串。

使用类似的技术，所有元音、空格和特殊字符都会从文本中删除。然后可以比较文本。

用于实际删除“违规”标签的方法如下所示

/// <summary>
/// Break the relevant text item down into a string 
/// that is stripped of all HTML tags, vowels and 
/// whitespace characters.
/// </summary>
/// <param name="source">The item that needs to be "cleansed" 
/// using the relevant regular expressions.</param>
/// <returns>The "cleansed" string.</returns>
/// <remarks>This implementation starts off by removing all HTML tags, 
/// before removing any vowels and
/// whitespace characters.</remarks>
private string BreakdownText(string source)
{
  if (string.IsNullOrEmpty(source))
    throw new ArgumentNullException("source");
  // First of all, strip the tags out of the source.
  if (_tagType == StripTagType.StripTags)
  {
    source = ReplaceCharacters(source, _tagRemover, string.Empty);
  }
  source = ReplaceCharacters(source, _chars, string.Empty);
  return ReplaceCharacters(source, _vowels, string.Empty).ToLowerInvariant();
}
/// <summary>
/// Method to replace one set of characters with another.
/// </summary>
/// <param name="source">The source text to replace the characters in.</param>
/// <param name="regexText">The regular expression defining the characters to replace.
/// </param>
/// <param name="replaceText"></param>
/// <returns></returns>
private string ReplaceCharacters(string source, string regexText, string replaceText)
{
  Regex regex = new Regex(regexText,
  RegexOptions.IgnoreCase | RegexOptions.Multiline | 
		RegexOptions.IgnorePatternWhitespace);
  return regex.Replace(source, replaceText);
}

删除项目的顺序很重要。您需要在删除非字母字符和元音字符之前删除 HTML 标签。顺便说一下，_vowels 和 _char 正则表达式分别定义为 [eiouy] 和 [^b-z]。

如果想要比较的两个清理后的文本片段匹配，那么您就找到了一个重复项。但是，如我们所说，始终存在拼写错误的可能，因此比较器会进一步检查两个文本片段，以识别不匹配的字符。如果字符数超出特定阈值，那么我们知道这两个文本片段足够不同，可以将其视为唯一。

/// <summary>
/// This method calculates the number of unmatched characters 
/// in the source string and determines
/// whether or not it falls within the target threshold.
/// </summary>
/// <param name="source">The source to check.</param>
/// <param name="target">The target string to check against.</param>
/// <returns>True if the string matches (or matches to within the tolerance), 
/// false otherwise.</returns>
private bool Compare(string source, string target)
{
  int targetLength = target.Length; 
  StringBuilder unmatched = new StringBuilder();
  while (target.Length > 0 && source.Length > 0 && unmatched.Length <= _diffCharacters)
  {
    if (target[0] == source[0])
    {
      target = target.Remove(0, 1);
      source = source.Remove(0, 1);
    }
    else
    {
      // Now, look for the next instance of this letter in source.
      int index = source.IndexOf(target[0]);
      if (index > 0)
      {
        unmatched.Append(source.Substring(0, index));
        source = source.Remove(0, index);
      }
    }
  }
  return unmatched.Length <= _diffCharacters;
}

就这样了 - 一个朴素的比较器。要运行它，您只需要执行以下操作

static void Main(string[] args)
{
  string s = "<H1>This is a sample.</H1>";
  NaiveComparer c = new NaiveComparer(s, StripTagType.StripTags, 10);
  if (c.Compare("Hello" + s + "aeiou123"))
  {
    Console.WriteLine("It matches");
  }
  if (!c.Compare("Hellot there from this completely arbitrary " + 
    "and totally different way of matching <<<< <<<<<< " + s + 
    ">>>>>>>>>>>>>>-----"))
    Console.WriteLine("No match");
  Console.ReadLine();
}

我希望这对您有所帮助，在您需要比较项目以查看它们是否足够不同以允许通过的情况下。

编辑历史

2008年5月7日修复了 Chris Maunder 发现的错误。感谢 Chris。