电话文档机器学习算法初学者 C#

使用 TF-IDF 在 C# 中对词元进行排名

Jesús Utrera

5.00/5 (6投票s)

2015 年 12 月 16 日

CPOL

3分钟阅读

21171

使用文本检索 TF-IDF 技术对文本文档中的词元进行排名

引言

让我们看看维基百科对 TF-IDF 的定义。

«tf–idf，全称 词频 - 逆文档频率，是一种数值统计量，旨在反映一个词对于集合或语料库中的文档的重要程度。它通常被用作信息检索和文本挖掘中的加权因子。tf-idf 值与单词在文档中出现的次数成比例增加，但会受到该词在语料库中的频率的抵消，这有助于调整某些词在总体上更频繁出现的事实。

tf–idf 加权方案的变体通常被搜索引擎用作对文档的相关性进行评分和排名的核心工具，这取决于用户的查询。tf–idf 可以成功地用于各种主题领域的停用词过滤，包括文本摘要和分类。

通过对每个查询词的 tf–idf 进行求和来计算最简单的排名函数之一；许多更复杂的排名函数都是这种简单模型的变体。»

背景

在本技巧中，我们将使用 TF-IDF 对文档集合中最重要的词（即标记）进行排名。此代码非常简单，当然，对于大量文档，您应该执行更好的代码。

让我们看看数据源

为了解释这段代码，我使用了一个疾病文档集合（即什么是疾病、病因等）。这些数据存储在 MongoDB 数据库中。MongoDB 集合名为 Documents。此集合存储疾病名称 (Title) 和一系列部分，每个部分都存储部分标题 (title) 和主要文本 (text)。在此示例中，我将使用每个部分的所有文本。这些集合大约有 120 个文档，虽然不多，但足以测试该示例。

Exaample of one document

Using the Code

每个标记在我们的代码中都有以下结构

/// <summary>
/// Represents a token in a document
/// </summary>
public class Token
{
    /// <summary>
    /// Document in wich a token apperas
    /// </summary>
    [BsonRepresentation(BsonType.String)]
    [BsonRequired]
    [BsonElement("Document")]
    public string Document { get; set; }

    /// <summary>
    /// Token of a document
    /// </summary>
    [BsonRepresentation(BsonType.String)]
    [BsonRequired]
    [BsonElement("Token")]
    public string Word { get; set; }

    /// <summary>
    /// Number of times the token appears in the document
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("Count")]
    public int Count { get; set; }

    /// <summary>
    /// NUmber of times the most seen token of the document appears
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("Max")]
    public int Max { get; set; }

    /// <summary>
    /// Normalized frequency of the token in document (Count / Max)
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("TF")]
    public double TF { get; set; }

    /// <summary>
    /// Total number of documents in the collection
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("DN")]
    public int DN { get; set; }

    /// <summary>
    /// Number of documents where the token is
    /// </summary>
    [BsonRepresentation(BsonType.Int32)]
    [BsonRequired]
    [BsonElement("CN")]
    public int CN { get; set; }

    /// <summary>
    /// IDF [Log(DN/(1+CN))]
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("IDF")]
    public double IDF { get; set; }

    /// <summary>
    /// TF-IDF value of the token in document [TF*IDF]
    /// </summary>
    [BsonRepresentation(BsonType.Double)]
    [BsonRequired]
    [BsonElement("TFIDF")]
    public double TFIDF { get; set; }
}

使奇迹发生的类如下所示

public class Parser
{
    public List<Document> Documents { get; set; }

    public Parser()
    {
        this.Initialize();
    }

    /// <summary>
    /// read all the documents from database
    /// </summary>
    private void Initialize()
    {
        DocumentProvider documentProvider = new DocumentProvider();
        this.Documents = documentProvider.GetAll();
    }

    internal void Execute()
    {
        List<Token> tokenList = new List<Token>();

        //Insert the tokens of documents anr count
        foreach (Document doc in this.Documents)
        {
            foreach (Section secc in doc.Sections)
            {
                List<string> tokens = Generics.ExtractTokens(secc.Text);

                foreach (string word in tokens)
                {
                    if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
                    {
                        tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
                    }
                    else
                    {
                        tokenList.Add(new Token()
                        {
                            Document = doc.Title,
                            Word = word,
                            Count = 1
                        });
                    }
                }
            }
        }

        //Save the other properties
        foreach (Token item in tokenList)
        {
            if (item.Max == 0)
            {
                item.Max = (from elto in tokenList
                               where elto.Document == item.Document
                               select elto.Count).Max();
                item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
            }
            item.DN = tokenList.GroupBy(x => x.Document).Count();
            item.CN = tokenList.Where(x => x.Word == item.Word).Count();
            item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
            item.TFIDF = item.TF * item.IDF;
        }

        //Save in database
        TokenProvider tokenProvider = new TokenProvider();
        tokenList.ForEach(item => tokenProvider.Insert(item));
    }
}

第一次，我们提取所有文档并将它们存储在一个列表中。Execute 方法执行此过程。

第一步是创建并填充标记列表。为此，我们提取所有文档的每个部分的所有标记，并填充标记列表或添加一个标记出现次数。

foreach (Document doc in this.Documents)
{
    foreach (Section secc in doc.Sections)
    {
        List<string> tokens = Generics.ExtractTokens(secc.Text);

        foreach (string word in tokens)
        {
            if (tokenList.Any(x => x.Document == doc.Title && x.Word == word))
            {
                tokenList.First(x => x.Document == doc.Title && x.Word == word).Count++;
            }
            else
            {
                tokenList.Add(new Token()
                {
                    Document = doc.Title,
                    Word = word,
                    Count = 1
                });
            }
        }
    }
}

第二步是处理每个标记并填充其他属性。为了在其他研究中使用这些数据，我们填充所有属性。

为了填充标记频率，我们将对文档内的标记计数进行归一化。通过这种技术，我们可以避免对长文档的偏见。为此，我们将主要标记频率除以文档中出现次数最多的标记的较高频率

        Count
TF  =  -------
         Max

我们通过将文档计数除以标记出现的文档数量并计算除数值的对数来获取标记的逆文档频率

             DN
IDF  =  log ----
             CN

所有这些都在第二轮完成。

//Save the other properties
foreach (Token item in tokenList)
{
    if (item.Max == 0)
    {
        item.Max = (from elto in tokenList
                       where elto.Document == item.Document
                       select elto.Count).Max();
        item.TF = Convert.ToDouble(item.Count) / Convert.ToDouble(item.Max);
    }
    item.DN = tokenList.GroupBy(x => x.Document).Count();
    item.CN = tokenList.Where(x => x.Word == item.Word).Count();
    item.IDF = Convert.ToDouble(Math.Log10(item.DN / (item.CN)));
    item.TFIDF = item.TF * item.IDF;
}

最后，我们将这些结果存储在数据库中。

关注点

该过程结束后，我们为每个文档都有一个标记排名。我们可以看到哪些标记对文档最具描述性，哪些标记的描述性较差。如您所见，我们可以使用停用词来删除常用词，或者进行数据的 词干提取 ，只处理词干。

原文（西班牙语）

http://jesusutrera.com/articles/article03.html