模糊搜索

Günther M. FOIDL

4.95/5 (39投票s)

2009 年 6 月 1 日

CPOL

1分钟阅读

152712

6134

一个简单的模糊字符串搜索实现。

引言

本文描述了一个简单的模糊（字符串）搜索实现。它可以用于近似字符串匹配（有关更多信息，请参阅 http://en.wikipedia.org/wiki/Fuzzy_string_searching）。

存在其他近似字符串搜索算法（例如 Soundex），但这些算法的实现起来并不那么容易。本文中的算法易于实现，可用于以简单的方式使用近似字符串搜索的任务。

使用 List<string> 进行搜索，因此搜索数据库非常容易。

背景

该算法使用 Levenshtein 距离来确定词列表中字符串与要查找的单词的匹配程度。有关 Levenshtein 距离的信息可以在 http://en.wikipedia.org/wiki/Levenshtein_distance 上找到。

使用代码

以下示例将展示如何简单地使用该类。

static void Main(string[] args)
{
    string word = "Code Project";
    List<string> wordList = new List<string>
    {
        "Code Project",
        "Code project",
        "codeproject",
        "Code Projekt",
        "Kode Project",
        "Other Project"
    };

    List<string> foundWords = FuzzySearch.Search(
        word,
        wordList,
        0.70);

    foundWords.ForEach(i => Console.WriteLine(i));
    Console.ReadKey();
}

输出

Code Project
Code project
codeproject
Code Projekt
Kode Project

实现

展示了一个基本的方法。可以使用更优化的算法代替 Levenshtein 距离 - 但这里，为了清晰起见，提供了一个相当简单的实现。

Levenshtein 距离

为了计算 Levenshtein 距离，我使用以下算法

public static int LevenshteinDistance(string src, string dest)
{
    int[,] d = new int[src.Length + 1, dest.Length + 1];
    int i, j, cost;
    char[] str1 = src.ToCharArray();
    char[] str2 = dest.ToCharArray();

    for (i = 0; i <= str1.Length; i++)
    {
        d[i, 0] = i;
    }
    for (j = 0; j <= str2.Length; j++)
    {
        d[0, j] = j;
    }
    for (i = 1; i <= str1.Length; i++)
    {
        for (j = 1; j <= str2.Length; j++)
        {

            if (str1[i - 1] == str2[j - 1])
                cost = 0;
            else
                cost = 1;

            d[i, j] =
                Math.Min(
                    d[i - 1, j] + 1,              // Deletion
                    Math.Min(
                        d[i, j - 1] + 1,          // Insertion
                        d[i - 1, j - 1] + cost)); // Substitution

            if ((i > 1) && (j > 1) && (str1[i - 1] == 
                str2[j - 2]) && (str1[i - 2] == str2[j - 1]))
            {
                d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }
    }

    return d[str1.Length, str2.Length];
}

搜索过程

在搜索过程中，对于词列表中的每个单词，都计算 Levenshtein 距离，并以此距离计算一个分数。该分数代表字符串匹配的程度。输入参数 fuzzyness 确定字符串可以差异的大小。

public static List<string> Search(
    string word,
    List<string> wordList,
    double fuzzyness)
{
    List<string> foundWords = new List<string>();

    foreach (string s in wordList)
    {
        // Calculate the Levenshtein-distance:
        int levenshteinDistance =
            LevenshteinDistance(word, s);

        // Length of the longer string:
        int length = Math.Max(word.Length, s.Length);

        // Calculate the score:
        double score = 1.0 - (double)levenshteinDistance / length;

        // Match?
        if (score > fuzzyness)
            foundWords.Add(s);
    }
    return foundWords;
}

LINQ 变体

这段代码也可以用 LINQ 编写。

public static List<string> Search(
    string word,
    List<string> wordList,
    double fuzzyness)
{
    // Tests have prove that the !LINQ-variant is about 3 times
    // faster!
    List<string> foundWords =
        (
            from s in wordList
            let levenshteinDistance = LevenshteinDistance(word, s)
            let length = Math.Max(s.Length, word.Length)
            let score = 1.0 - (double)levenshteinDistance / length
            where score > fuzzyness
            select s
        ).ToList();

    return foundWords;
}

历史

2009 年 6 月 1^日：初始发布。