解析句子和构建文本统计（C#）

salysle

2.73/5 (6投票s)

2008年6月4日

CPOL

5分钟阅读

62362

1323

本文描述了从文本主体中解析句子的三种方法；本文介绍了三种方法，以说明使用不同方法的优缺点。

下载源码 - 29.65 KB

引言

本文描述了从文本主体中解析句子的三种方法；本文介绍了三种方法，以说明使用不同方法的优缺点。演示应用程序还描述了一种在文本主体上生成句子计数、单词计数和字符计数统计信息的方法。

图 1：正在运行的测试应用程序。

从文本主体中解析句子的三种方法包括：

解析合理：一种基于使用典型句子终止符分割文本的方法，其中保留句子终止符。
最佳解析：一种基于使用正则表达式分割文本的方法，并且保留句子终止符，以及
无结尾解析：一种使用典型句子终止符分割文本的方法，其中不将终止符保留为句子的一部分。

演示应用程序包含一个 `textbox` 控件中的一些默认文本；三个按钮用于使用上述三种方法之一解析文本；以及三个标签控件用于显示在文本主体上生成的摘要统计信息。一旦应用程序运行，点击任意一个按钮将导致解析出的每个句子显示在窗体底部的 `listbox` 控件中，并显示窗体右上角三个标签中的摘要统计信息。

入门

要开始，请解压包含的项目并在 Visual Studio 2008 环境中打开解决方案。在解决方案资源管理器中，您应该会看到这些文件（图 2）。

图 2：解决方案资源管理器。

从图 2 可以看出，有一个包含单个窗体的 WinForms 项目。此应用程序所需的所有代码都包含在该窗体的代码中。

主窗体 (Form1.cs)

应用程序的主窗体 `Form1` 包含了所有必需的代码。该窗体包含一个 `textbox` 控件中的默认文本；三个按钮用于执行用于将文本主体解析为 `string` 集合的三个函数，每个句子一个。您可以替换、删除或添加 `textbox` 控件中的文本，以针对您自己的文本运行这些方法。三个标签控件用于显示 `textbox` 控件中包含的文本的摘要统计信息（句子、单词和字符计数）。每次将文本解析为句子时，都会更新这些摘要统计信息。

如果您愿意在 IDE 中打开代码视图，您将看到代码文件以以下库导入开始：

using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

请注意，已修改默认设置，现在包括对正则表达式库的引用。

导入之后，定义了命名空间、类和构造函数。

namespace SentenceParser
{
    /// <summary>
    /// Demonstrate three approaches to parsing
    /// a body of text into sentences and also
    /// demonstrates an approach to building
    /// statistics on the text to include the
    /// number of sentences, the number of
    /// words and the number of characters
    /// used in the text.
    /// </summary>
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

接下来是一个名为“最佳句子解析器”的区域；该区域包含一个名为 `SplitSentences` 的函数，该函数接受一个 `string` 作为参数。此方法在解析句子方面通常能产生最佳结果，但如果文本包含错误，可能会发出不准确的值。该区域还包含一个按钮点击事件处理程序，用于调用 `SplitSentences` 函数。

代码已添加注释，阅读这些注释将解释函数内部发生了什么。

#region Best Sentence Parser

        /// <summary>
        /// This is generally the most accurate approach to
        /// parsing a body of text into sentences to include
        /// the sentence's termination (e.g., the period,
        /// question mark, etc).  This approach will handle
        /// duplicate sentences with different terminations.
        ///
        /// </summary>
        /// <param name="sSourceText"></param>
        /// <returns></returns>
        private ArrayList SplitSentences(string sSourceText)
        {
            // create a local string variable
            // set to contain the string passed it
            string sTemp = sSourceText;

            // create the array list that will
            // be used to hold the sentences
            ArrayList al = new ArrayList();

            // split the sentences with a regular expression
            string[] splitSentences =
                Regex.Split(sTemp, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-
                Z])");

            // loop the sentences
            for (int i = 0; i < splitSentences.Length; i++)
            {
                // clean up the sentence one more time, trim it,
                // and add it to the array list
                string sSingleSentence =
                    splitSentences[i].Replace(Environment.NewLine,
                    string.Empty);
                al.Add(sSingleSentence.Trim());
            }

            // update the statistics displayed on the text
            // characters
            lblCharCount.Text = "Character Count: " +
                GenerateCharacterCount(sTemp).ToString();
            // sentences
            lblSentenceCount.Text = "Sentence Count: " +
                GenerateSentenceCount(splitSentences).ToString();
            // words
            lblWordCount.Text = "Word Count: " +
                GenerateWordCount(al).ToString();

            // return the arraylist with
            // all sentences added
            return al;
        }

        /// <summary>
        /// Calls the SplitSentences (best approach) method
        /// to split the text into sentences and displays
        /// the results in a list box
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void btnParseBest_Click(object sender, EventArgs e)
        {
            lstSentences.Items.Clear();

            ArrayList al = SplitSentences(txtParagraphs.Text);
            for (int i = 0; i < al.Count; i++)
                //populate a list box
                lstSentences.Items.Add(al[i].ToString());
        }

#endregion

接下来是一个名为“合理句子解析器”的区域；该区域包含一个名为 `ReasonableParser` 的函数，该函数接受一个 `string` 作为参数。此方法在解析句子方面通常会产生合理的结果，但如果输入的 `string` 包含具有不同终止符的重复句子，则不会应用正确的句子终止符。此问题可以通过使用递归函数继续遍历重复句子的每个实例来解决，但使用前一个代码区域中指示的方法工作量较小。该区域还包含一个按钮点击事件处理程序，用于调用 `ReasonableParser` 函数。

代码已添加注释，阅读这些注释将解释函数内部发生了什么。

#region Reasonable Sentence Parser

        /// <summary>
        /// This does a fair job of parsing the sentences
        /// unless there are duplicate sentences;
        /// you'd have to resort to recursion in order
        /// to get through the issue of multiple duplicate sentences.
        /// </summary>
        /// <param name="sTextToParse"></param>
        /// <returns></returns>
        private ArrayList ReasonableParser(string sTextToParse)
        {
            ArrayList al = new ArrayList();

            // get a string from the contents of a textbox
            string sTemp = sTextToParse;
            sTemp = sTemp.Replace(Environment.NewLine, " ");

            // split the string using sentence terminations
            char[] arrSplitChars = { '.', '?', '!' };  // things that end a
            sentence

            //do the split
            string[] splitSentences = sTemp.Split(arrSplitChars,
            StringSplitOptions.RemoveEmptyEntries);

            // loop the array of splitSentences
            for (int i = 0; i < splitSentences.Length; i++)
            {
                // find the position of each sentence in the
                // original paragraph and get its termination ('.', '?', '!')
                int pos = sTemp.IndexOf(splitSentences[i].ToString());
                char[] arrChars = sTemp.Trim().ToCharArray();
                char c = arrChars[pos + splitSentences[i].Length];

                // since this approach looks only for the first instance
                // of the string, it does not handle duplicate sentences
                // with different terminations.  You could expand this
                // approach to search for later instances of the same
                // string to get the proper termination but the previous
                // method of using the regular expression to split the
                // string is reliable and less bothersome.

                // add the sentences termination to the end of the sentence
                al.Add(splitSentences[i].ToString().Trim() + c.ToString());
            }

            // Update the show of statistics
            lblCharCount.Text = "Character Count: " +
                GenerateCharacterCount(sTemp).ToString();

            lblSentenceCount.Text = "Sentence Count: " +
                GenerateSentenceCount(splitSentences).ToString();

            lblWordCount.Text = "Word Count: " +
                GenerateWordCount(al).ToString();

            return al;
        }

        /// <summary>
        /// Calls the ReasonableParser method and
        /// displays the results
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void btnParseReasonable_Click(object sender, EventArgs e)
        {
            lstSentences.Items.Clear();
            ArrayList al = ReasonableParser(txtParagraphs.Text);

            for (int i = 0; i < al.Count; i++)
            {
                lstSentences.Items.Add(al[i].ToString());
            }
        }

#endregion

接下来是一个名为“无句子结尾解析”的区域；该区域包含一个名为 `IDontCareHowItEndsParser` 的函数，该函数接受一个 `string` 作为参数。此方法在解析句子方面通常会产生良好的结果，但不会将终止符添加到解析后的句子中；如果您不关心句子结尾使用哪种终止符，这是一个很好的方法。该区域还包含一个按钮点击事件处理程序，用于调用 `IDontCareHowItEndsParser` 函数。

代码已添加注释，阅读这些注释将解释函数内部发生了什么。

#region Parse Without Sentence Terminations

        /// <summary>
        /// If you don't care about retaining the sentence
        /// terminations, this approach works fine.  This
        /// will return an array list containing all of the
        /// sentences contained in the input string but
        /// each sentence will be stripped of its termination.
        /// </summary>
        /// <param name="sTextToParse"></param>
        /// <returns></returns>
        private ArrayList IDontCareHowItEndsParser(string sTextToParse)
        {
            string sTemp = sTextToParse;
            sTemp = sTemp.Replace(Environment.NewLine, " ");

            // split the string using sentence terminations
            char[] arrSplitChars = { '.', '?', '!' };  // things that end a
            sentence

            //do the split
            string[] splitSentences = sTemp.Split(arrSplitChars,
            StringSplitOptions.RemoveEmptyEntries);

            ArrayList al = new ArrayList();
            for (int i = 0; i < splitSentences.Length; i++)
            {
                splitSentences[i] = splitSentences[i].ToString().Trim();
                al.Add(splitSentences[i].ToString());
            }

            // show statistics
            lblCharCount.Text = "Character Count: " +
            GenerateCharacterCount(sTemp).ToString();
            lblSentenceCount.Text = "Sentence Count: " +
            GenerateSentenceCount(splitSentences).ToString();
            lblWordCount.Text = "Word Count: " +
            GenerateWordCount(al).ToString();

            return al;
        }

        /// <summary>
        /// Calls the IDontCareHowItEndsParser and displays
        /// the results
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void btnParseNoEnding_Click(object sender, EventArgs e)
        {
            lstSentences.Items.Clear();

            ArrayList al = IDontCareHowItEndsParser(txtParagraphs.Text);

            for (int i = 0; i < al.Count; i++)
            {
                lstSentences.Items.Add(al[i].ToString());
            }
        }

#endregion

最后一个区域是“生成统计信息”。该区域包含三个函数，它们返回文本主体的字符计数、单词计数和句子计数。同样，此部分也添加了注释；阅读注释以了解每个函数如何工作。

#region Generate Statistics

        /// <summary>
        /// Generate the total character count for
        /// the entire body of text as converted to
        /// one string
        /// </summary>
        /// <param name="allText"></param>
        /// <returns>int count of all characters</returns>
        public int GenerateCharacterCount(string allText)
        {
            int rtn = 0;

            // clean up the string by
            // removing newlines and by trimming
            // both ends
            string sTemp = allText;
            sTemp = sTemp.Replace(Environment.NewLine, string.Empty);
            sTemp = sTemp.Trim();

            // split the string into sentences
            // using a regular expression
            string[] splitSentences =
                Regex.Split(sTemp,
                    @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");

            // loop through the sentences to get character counts
            for(int cnt=0; cnt<splitSentences.Length; cnt++)
            {
                // get the current sentence
                string sSentence = splitSentences[cnt].ToString();

                // trim it
                sSentence = sSentence.Trim();

                // convert it to a character array
                char[] sentence = sSentence.ToCharArray();

                // test each character and
                // add it to the return value
                // if it passes
                for (int i = 0; i < sentence.Length; i++)
                {
                    // make sure it is a letter, number,
                    // punctuation or whitespace before
                    // adding it to the tally
                    if (char.IsLetterOrDigit(sentence[i]) ||
                        char.IsPunctuation(sentence[i]) ||
                        char.IsWhiteSpace(sentence[i]))
                        rtn += 1;
                }
            }

            // return the final tally
            return rtn;
        }

        /// <summary>
        /// Generate a count of all words contained in the text
        /// passed into to this function is looking for
        /// an array list as an argument; the array list contains
        /// one entry for each sentence contained in the
        /// text of interest.
        /// </summary>
        /// <param name="allSentences"></param>
        /// <returns>int count of all words</returns>
        public int GenerateWordCount(ArrayList allSentences)
        {
            // declare a return value
            int rtn = 0;

            // iterate through the entire list
            // of sentences
            foreach (string sSentence in allSentences)
            {
                // define an empty space as the split
                // character
                char[] arrSplitChars = {' '};

                // create a string array and populate
                // it with a split on the current sentence;
                // use the string split option to remove
                // empty entries so that empty sentences do not
                // make it into the word count.
                string[] arrWords = sSentence.Split(arrSplitChars,
                StringSplitOptions.RemoveEmptyEntries);
                rtn += arrWords.Length;
            }

            // return the final word count
            return rtn;
        }

        /// <summary>
        /// Return a count of all of the sentences contained in the
        /// text examined; this method is looking for a string
        /// array containing all of the sentences; it just
        /// returns a count for the string array.
        /// </summary>
        /// <param name="allSentences"></param>
        /// <returns></returns>
        public int GenerateSentenceCount(string[] allSentences)
        {
            // create a return value
            int rtn = 0;

            // set the return value to
            // the length of the sentences array
            rtn = allSentences.Length;

            // return the count
            return rtn;
        }

#endregion

摘要

本文旨在描述从文本主体中解析句子的几种方法。此外，本文还描述了可用于生成文本主体摘要统计信息的三个函数。当然，还有其他方法可用于执行这些操作。总的来说，解析句子的最佳方法似乎是使用正则表达式。修改正则表达式可能会产生不同的结果，这些结果可能更适合您正在处理的文本类型；但是，我发现这种方法即使对于复杂的文本主体也有效，只要文本被正确格式化为适当的句子。

历史

2008年6月3日：初始版本