模型中级开发 Visual Studio .NET C#

布尔检索模型

Samir Kunwar

4.50/5 (3投票s)

2012 年 5 月 1 日

CPOL

4分钟阅读

87723

3543

使用布尔检索模型进行信息检索

下载源代码 - 4.7 KB

引言

模型是对实际过程的理想化或抽象。信息检索模型可以描述计算过程。

例如

如何对文档进行排序
请注意，如何存储文档或索引已实现。

信息检索（IR）的目标是向用户提供那些能够满足其信息需求的文档。检索模型可以尝试描述人类过程，例如信息需求、交互。

检索模型可以分为

布尔检索模型
向量空间模型
概率模型
基于信念网络的模型

信息检索的布尔模型是一个经典的（IR）模型，也是第一个被广泛采用的模型。它目前被几乎所有商业 IR 系统使用。

精确匹配 vs 最佳匹配

在精确匹配中，查询指定精确的标准。每个文档要么匹配，要么不匹配查询。在精确匹配中检索到的结果是一组文档（没有排名）。

在最佳匹配中，查询描述了匹配度好或最佳的文档。在这种情况下，结果是一个文档的排名列表。这里我将要处理的布尔模型是最常见的精确匹配模型。

布尔模型的基本假设

索引词要么存在 (1) 要么不存在 (0) 于文档中
所有索引词都对信息需求提供同等的证据。
查询是索引词的布尔组合。

X AND Y：表示同时包含 X 和 Y 的文档
X OR Y：表示包含 X 或 Y 的文档
NOT X：表示不包含 X 的文档

布尔查询示例

用户的信息需求：有兴趣了解珠穆朗玛峰和尼泊尔

用户布尔查询：珠穆朗玛峰 AND 尼泊尔

实施部分

输入集合示例

Doc1= 英语教程和快车道

Doc2 = 学习潜在语义索引

Doc3 = 关于语义索引的书

Doc4 = 结构和语义索引的进展

Doc5 = 潜在结构的分析

查询问题：advance 和 structure AND NOT analysis

布尔模型索引构建

首先，我们构建词-文档关联矩阵，该矩阵表示所有不同词及其在每个文档中的存在的列表（关联向量）。如果文档包含该词，则关联向量为 1，否则为 0。

词/文档	Doc1	Doc2	Doc3	Doc4	Doc5
English	1	0	0	0	0
教程	1	0	0	0	0
快速	1	0	0	0	0
跟踪	1	0	0	0	0
书籍	0	1	0	0	0
Semantic	0	1	1	1	0
分析	0	1	0	0	1
学习	0	0	1	0	0
Latent	0	0	1	0	1
索引	0	0	1	1	0
高级	0	0	0	1	0
结构体	0	0	0	1	1

现在我们有了每个词的 0/1 向量。为了回答查询，我们取advance、structure 和 analysis的向量，对最后一个向量取补，然后进行按位 AND 操作。

Doc1	Doc2	Doc3	Doc4	Doc5
0	0	0	1	0
0	0	0	1	1	(AND)
0	0	0	1	0
1	0	1	1	0	(NOT analysis)
0	0	0	1	0
			Doc4

因此，这里检索到 Doc4。

深入代码

文档分词和索引

分词是将输入字符序列分成称为标记的部分。在这里，我使用termsCollection 来存储从输入文档中检索到的标记，并使用 HasSet 集合类型distinctTerm 来创建标记的索引，这确保了集合类型仅包含不同的标记。

//read all the text document on the specified directory
foreach (string file in Directory.EnumerateFiles("F:\\IR\\", "*.txt"))
{
    string contents = File.ReadAllText(file);
    String[] termsCollection = RemoveStopsWords(contents.ToUpper().Split(' '));
    foreach (string term in termsCollection)
    {
        //prepeare distinct terms collection
        //remove stop words
        if (!stopWords.Contains(term)) 
        {
            distinctTerm.Add(term);
        }
    }

    //add document and their terms collection
    documentCollection.Add(file, termsCollection.ToList());
    //add document and its content for displaying the search result
    documentContentList.Add(count, contents);
    count++;
}

删除停用词

大多数搜索引擎为了节省磁盘空间或加快搜索结果，不考虑非常常见的词。这些被过滤掉的词被称为“停用词”，这里我使用了通用集合 stopWords，它包含一个停用词列表，用于从文档中删除它们。

//removes all stop words
public static string[] RemoveStopsWords(string[] str)
{
    List<string> terms = new List<string>();
    foreach (string term in str)
    {
        if (!stopWords.Contains(term))
        {
            terms.Add(term);
        }
    }
    return terms.ToArray();
}

词 - 文档关联矩阵创建

如果一个词出现在文档中，则其词关联向量为 1，否则为 0。

函数 GetTermDocumentIncidenceMatrix 返回所有不同词的词文档关联矩阵，这些词存储在 HashSet 集合中。我使用了 Dictionary 集合类型 termDocumentIncidenceMatrix来存储词及其在每个文档中的关联向量。

//prepares Term Document Incidence Matrix
public static Dictionary<string, List<int>> GetTermDocumentIncidenceMatrix(HashSet<string> 
       distinctTerms,Dictionary<string,List<string>> documentCollection)
{
    Dictionary<string, List<int>> termDocumentIncidenceMatrix = 
                new Dictionary<string, List<int>>();
    List<int> incidenceVector = new List<int>();
    foreach (string term in distinctTerms)
    {
        //incidence vector for each terms
        incidenceVector = new List<int>();
        foreach(KeyValuePair<string ,List<string>> p in documentCollection)
        {
            
            if (p.Value.Contains(term))
            {
                //document contains the term
                incidenceVector.Add(1); 

            }
            else
            {
                //document do not contains the term
                incidenceVector.Add(0); 
            }
        }
        termDocumentIncidenceMatrix.Add(term, incidenceVector);

    }
    return termDocumentIncidenceMatrix;
}

检索单个词的关联向量

//returns term incidence vector
public static List<int> GetTermIncidenceVector(string term)
{
    return termDocumentIncidenceMatrix[term.ToUpper()];
}

查询处理器

字符串变量 bitWiseOp 保存了查询中存在的布尔运算符。通常，词 but 被视为在语义上等同于布尔 AND 运算符。集合 previousTermIncidenceV 和 nextTermIncidenceV 用于保存出现在查询中布尔运算符前后每个词的词关联向量。

//process the boolean query
public static List<int> ProcessQuery(string query)
{
    //query boolean operator
    string bitWiseOp = string.Empty; 
    string[] queryTerm = RemoveStopsWords(query.ToUpper().Split(''));
    //remove query term that doesnot appears on document collection
    FilterQueryTerm(ref queryTerm);
    List<int> previousTermIncidenceV = null;
    List<int> nextTermsIncidenceV = null;
    //holds the bitwise operation result
    List<int> resultSet = null;
    //On query X AND Y, X is previousTerm term and Y is nextTerm
    Boolean hasPreviousTerm = false; 
    Boolean hasNotOperation = false;
    foreach (string term in queryTerm)
    {
        //is a term
        if (!booleanOperator.Contains(term)&&!term.Equals("BUT"))
        {
            //query case: structure AND NOT analysis
            if (hasNotOperation) 
            {
                
                if (hasPreviousTerm)
                {
                    nextTermsIncidenceV = ProcessBooleanOperator("NOT", 
                      GetTermIncidenc eVector(term), nextTermsIncidenceV);
                }
                //query case: eg.NOT analysis
                else 
                {
                    previousTermIncidenceV = ProcessBooleanOperator("NOT", 
                      GetTermIncid enceVector(term), nextTermsIncidenceV);
                    resultSet = previousTermIncidenceV; 
                }
                hasNotOperation = false;
            }
            else if (!hasPreviousTerm)
            {
                previousTermIncidenceV = GetTermIncidenceVector(term);
                resultSet = previousTermIncidenceV;
                hasPreviousTerm = true; //
            }
            else
            {
                
                nextTermsIncidenceV = GetTermIncidenceVector(term);
            }
        }
        else if (term.Equals("NOT"))
        {
            //indicates that the  term in the next iteration should be complemented.
            hasNotOperation = true;
        }
        else
        {
            //'BUT' also should be evaluated as AND eg. structure BUT
            //NOT semantic should be evaluated as structure AND NOT semantic
            if (term.Equals("BUT")) 
            {
                bitWiseOp = "AND";
            }
            else
            bitWiseOp = term;
        }

        if (nextTermsIncidenceV != null && !hasNotOperation)
        {
            resultSet = ProcessBooleanOperator(bitWiseOp, 
                             previousTermIncidenceV, nextT ermsIncidenceV);
            previousTermIncidenceV = resultSet;
            hasPreviousTerm = true;
            nextTermsIncidenceV = null;
        }
    }

    return resultSet;
}

过滤查询词

private static void FilterQueryTerm(ref string[] str)
{
     List<string> _queryTerm = new List<string>();            
     foreach (string queryTerm in str) 
     {
         if (queryTerm.ToUpper().Equals("BUT") ||termDocumentIncidenceMatrix.ContainsKey(queryTerm.ToUpper()) ||booleanOperator.Contains(queryTerm))
            {
                    _queryTerm.Add(queryTerm);
                   
            }
      }
      str = _queryTerm.ToArray();
}

处理布尔运算符

public static List<int> ProcessBooleanOperator(string op, 
          List<int> previousTermV,List<int> nextTermV)
{
    List<int> resultSet = new List<int>();
    if(op.Equals("NOT"))
    {
        foreach(int a in previousTermV)
        {
            if (a == 1)
            {
                resultSet.Add(0);
            }
            else
            {
                resultSet.Add(1);
            }
        }
    }
    else if (op.ToUpper().Equals("AND")) //bitwise AND operation
    {
        for (int a = 0; a < previousTermV.Count; a++)
        {
            if (previousTermV[a] == 1 && nextTermV[a] == 1)
            {
                resultSet.Add(1);
            }
            else
            {
                resultSet.Add(0);
            }
        }
    }
    else if (op.ToUpper().Equals("OR")) //bitwise OR operation
    {
        for (int a = 0; a < previousTermV.Count; a++)
        {
            if (previousTermV[a] == 0 && nextTermV[a] == 0)
            {
                resultSet.Add(0);
            }
            else
            {
                resultSet.Add(1);
            }
        }
    }
    return resultSet;
}