解析句子和构建文本统计(C#)
本文描述了从文本主体中解析句子的三种方法;本文介绍了三种方法,以说明使用不同方法的优缺点。

引言
本文描述了从文本主体中解析句子的三种方法;本文介绍了三种方法,以说明使用不同方法的优缺点。演示应用程序还描述了一种在文本主体上生成句子计数、单词计数和字符计数统计信息的方法。

从文本主体中解析句子的三种方法包括:
- 解析合理:一种基于使用典型句子终止符分割文本的方法,其中保留句子终止符。
- 最佳解析:一种基于使用正则表达式分割文本的方法,并且保留句子终止符,以及
- 无结尾解析:一种使用典型句子终止符分割文本的方法,其中不将终止符保留为句子的一部分。
演示应用程序包含一个 `textbox` 控件中的一些默认文本;三个按钮用于使用上述三种方法之一解析文本;以及三个标签控件用于显示在文本主体上生成的摘要统计信息。一旦应用程序运行,点击任意一个按钮将导致解析出的每个句子显示在窗体底部的 `listbox` 控件中,并显示窗体右上角三个标签中的摘要统计信息。
入门
要开始,请解压包含的项目并在 Visual Studio 2008 环境中打开解决方案。在解决方案资源管理器中,您应该会看到这些文件(图 2)。

从图 2 可以看出,有一个包含单个窗体的 WinForms 项目。此应用程序所需的所有代码都包含在该窗体的代码中。
主窗体 (Form1.cs)
应用程序的主窗体 `Form1` 包含了所有必需的代码。该窗体包含一个 `textbox` 控件中的默认文本;三个按钮用于执行用于将文本主体解析为 `string` 集合的三个函数,每个句子一个。您可以替换、删除或添加 `textbox` 控件中的文本,以针对您自己的文本运行这些方法。三个标签控件用于显示 `textbox` 控件中包含的文本的摘要统计信息(句子、单词和字符计数)。每次将文本解析为句子时,都会更新这些摘要统计信息。
如果您愿意在 IDE 中打开代码视图,您将看到代码文件以以下库导入开始:
using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
请注意,已修改默认设置,现在包括对正则表达式库的引用。
导入之后,定义了命名空间、类和构造函数。
namespace SentenceParser
{
/// <summary>
/// Demonstrate three approaches to parsing
/// a body of text into sentences and also
/// demonstrates an approach to building
/// statistics on the text to include the
/// number of sentences, the number of
/// words and the number of characters
/// used in the text.
/// </summary>
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
接下来是一个名为“最佳句子解析器”的区域;该区域包含一个名为 `SplitSentences` 的函数,该函数接受一个 `string` 作为参数。此方法在解析句子方面通常能产生最佳结果,但如果文本包含错误,可能会发出不准确的值。该区域还包含一个按钮点击事件处理程序,用于调用 `SplitSentences` 函数。
代码已添加注释,阅读这些注释将解释函数内部发生了什么。
#region Best Sentence Parser
/// <summary>
/// This is generally the most accurate approach to
/// parsing a body of text into sentences to include
/// the sentence's termination (e.g., the period,
/// question mark, etc). This approach will handle
/// duplicate sentences with different terminations.
///
/// </summary>
/// <param name="sSourceText"></param>
/// <returns></returns>
private ArrayList SplitSentences(string sSourceText)
{
// create a local string variable
// set to contain the string passed it
string sTemp = sSourceText;
// create the array list that will
// be used to hold the sentences
ArrayList al = new ArrayList();
// split the sentences with a regular expression
string[] splitSentences =
Regex.Split(sTemp, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-
Z])");
// loop the sentences
for (int i = 0; i < splitSentences.Length; i++)
{
// clean up the sentence one more time, trim it,
// and add it to the array list
string sSingleSentence =
splitSentences[i].Replace(Environment.NewLine,
string.Empty);
al.Add(sSingleSentence.Trim());
}
// update the statistics displayed on the text
// characters
lblCharCount.Text = "Character Count: " +
GenerateCharacterCount(sTemp).ToString();
// sentences
lblSentenceCount.Text = "Sentence Count: " +
GenerateSentenceCount(splitSentences).ToString();
// words
lblWordCount.Text = "Word Count: " +
GenerateWordCount(al).ToString();
// return the arraylist with
// all sentences added
return al;
}
/// <summary>
/// Calls the SplitSentences (best approach) method
/// to split the text into sentences and displays
/// the results in a list box
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void btnParseBest_Click(object sender, EventArgs e)
{
lstSentences.Items.Clear();
ArrayList al = SplitSentences(txtParagraphs.Text);
for (int i = 0; i < al.Count; i++)
//populate a list box
lstSentences.Items.Add(al[i].ToString());
}
#endregion
接下来是一个名为“合理句子解析器”的区域;该区域包含一个名为 `ReasonableParser` 的函数,该函数接受一个 `string` 作为参数。此方法在解析句子方面通常会产生合理的结果,但如果输入的 `string` 包含具有不同终止符的重复句子,则不会应用正确的句子终止符。此问题可以通过使用递归函数继续遍历重复句子的每个实例来解决,但使用前一个代码区域中指示的方法工作量较小。该区域还包含一个按钮点击事件处理程序,用于调用 `ReasonableParser` 函数。
代码已添加注释,阅读这些注释将解释函数内部发生了什么。
#region Reasonable Sentence Parser
/// <summary>
/// This does a fair job of parsing the sentences
/// unless there are duplicate sentences;
/// you'd have to resort to recursion in order
/// to get through the issue of multiple duplicate sentences.
/// </summary>
/// <param name="sTextToParse"></param>
/// <returns></returns>
private ArrayList ReasonableParser(string sTextToParse)
{
ArrayList al = new ArrayList();
// get a string from the contents of a textbox
string sTemp = sTextToParse;
sTemp = sTemp.Replace(Environment.NewLine, " ");
// split the string using sentence terminations
char[] arrSplitChars = { '.', '?', '!' }; // things that end a
sentence
//do the split
string[] splitSentences = sTemp.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries);
// loop the array of splitSentences
for (int i = 0; i < splitSentences.Length; i++)
{
// find the position of each sentence in the
// original paragraph and get its termination ('.', '?', '!')
int pos = sTemp.IndexOf(splitSentences[i].ToString());
char[] arrChars = sTemp.Trim().ToCharArray();
char c = arrChars[pos + splitSentences[i].Length];
// since this approach looks only for the first instance
// of the string, it does not handle duplicate sentences
// with different terminations. You could expand this
// approach to search for later instances of the same
// string to get the proper termination but the previous
// method of using the regular expression to split the
// string is reliable and less bothersome.
// add the sentences termination to the end of the sentence
al.Add(splitSentences[i].ToString().Trim() + c.ToString());
}
// Update the show of statistics
lblCharCount.Text = "Character Count: " +
GenerateCharacterCount(sTemp).ToString();
lblSentenceCount.Text = "Sentence Count: " +
GenerateSentenceCount(splitSentences).ToString();
lblWordCount.Text = "Word Count: " +
GenerateWordCount(al).ToString();
return al;
}
/// <summary>
/// Calls the ReasonableParser method and
/// displays the results
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void btnParseReasonable_Click(object sender, EventArgs e)
{
lstSentences.Items.Clear();
ArrayList al = ReasonableParser(txtParagraphs.Text);
for (int i = 0; i < al.Count; i++)
{
lstSentences.Items.Add(al[i].ToString());
}
}
#endregion
接下来是一个名为“无句子结尾解析”的区域;该区域包含一个名为 `IDontCareHowItEndsParser` 的函数,该函数接受一个 `string` 作为参数。此方法在解析句子方面通常会产生良好的结果,但不会将终止符添加到解析后的句子中;如果您不关心句子结尾使用哪种终止符,这是一个很好的方法。该区域还包含一个按钮点击事件处理程序,用于调用 `IDontCareHowItEndsParser` 函数。
代码已添加注释,阅读这些注释将解释函数内部发生了什么。
#region Parse Without Sentence Terminations
/// <summary>
/// If you don't care about retaining the sentence
/// terminations, this approach works fine. This
/// will return an array list containing all of the
/// sentences contained in the input string but
/// each sentence will be stripped of its termination.
/// </summary>
/// <param name="sTextToParse"></param>
/// <returns></returns>
private ArrayList IDontCareHowItEndsParser(string sTextToParse)
{
string sTemp = sTextToParse;
sTemp = sTemp.Replace(Environment.NewLine, " ");
// split the string using sentence terminations
char[] arrSplitChars = { '.', '?', '!' }; // things that end a
sentence
//do the split
string[] splitSentences = sTemp.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries);
ArrayList al = new ArrayList();
for (int i = 0; i < splitSentences.Length; i++)
{
splitSentences[i] = splitSentences[i].ToString().Trim();
al.Add(splitSentences[i].ToString());
}
// show statistics
lblCharCount.Text = "Character Count: " +
GenerateCharacterCount(sTemp).ToString();
lblSentenceCount.Text = "Sentence Count: " +
GenerateSentenceCount(splitSentences).ToString();
lblWordCount.Text = "Word Count: " +
GenerateWordCount(al).ToString();
return al;
}
/// <summary>
/// Calls the IDontCareHowItEndsParser and displays
/// the results
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void btnParseNoEnding_Click(object sender, EventArgs e)
{
lstSentences.Items.Clear();
ArrayList al = IDontCareHowItEndsParser(txtParagraphs.Text);
for (int i = 0; i < al.Count; i++)
{
lstSentences.Items.Add(al[i].ToString());
}
}
#endregion
最后一个区域是“生成统计信息”。该区域包含三个函数,它们返回文本主体的字符计数、单词计数和句子计数。同样,此部分也添加了注释;阅读注释以了解每个函数如何工作。
#region Generate Statistics
/// <summary>
/// Generate the total character count for
/// the entire body of text as converted to
/// one string
/// </summary>
/// <param name="allText"></param>
/// <returns>int count of all characters</returns>
public int GenerateCharacterCount(string allText)
{
int rtn = 0;
// clean up the string by
// removing newlines and by trimming
// both ends
string sTemp = allText;
sTemp = sTemp.Replace(Environment.NewLine, string.Empty);
sTemp = sTemp.Trim();
// split the string into sentences
// using a regular expression
string[] splitSentences =
Regex.Split(sTemp,
@"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
// loop through the sentences to get character counts
for(int cnt=0; cnt<splitSentences.Length; cnt++)
{
// get the current sentence
string sSentence = splitSentences[cnt].ToString();
// trim it
sSentence = sSentence.Trim();
// convert it to a character array
char[] sentence = sSentence.ToCharArray();
// test each character and
// add it to the return value
// if it passes
for (int i = 0; i < sentence.Length; i++)
{
// make sure it is a letter, number,
// punctuation or whitespace before
// adding it to the tally
if (char.IsLetterOrDigit(sentence[i]) ||
char.IsPunctuation(sentence[i]) ||
char.IsWhiteSpace(sentence[i]))
rtn += 1;
}
}
// return the final tally
return rtn;
}
/// <summary>
/// Generate a count of all words contained in the text
/// passed into to this function is looking for
/// an array list as an argument; the array list contains
/// one entry for each sentence contained in the
/// text of interest.
/// </summary>
/// <param name="allSentences"></param>
/// <returns>int count of all words</returns>
public int GenerateWordCount(ArrayList allSentences)
{
// declare a return value
int rtn = 0;
// iterate through the entire list
// of sentences
foreach (string sSentence in allSentences)
{
// define an empty space as the split
// character
char[] arrSplitChars = {' '};
// create a string array and populate
// it with a split on the current sentence;
// use the string split option to remove
// empty entries so that empty sentences do not
// make it into the word count.
string[] arrWords = sSentence.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries);
rtn += arrWords.Length;
}
// return the final word count
return rtn;
}
/// <summary>
/// Return a count of all of the sentences contained in the
/// text examined; this method is looking for a string
/// array containing all of the sentences; it just
/// returns a count for the string array.
/// </summary>
/// <param name="allSentences"></param>
/// <returns></returns>
public int GenerateSentenceCount(string[] allSentences)
{
// create a return value
int rtn = 0;
// set the return value to
// the length of the sentences array
rtn = allSentences.Length;
// return the count
return rtn;
}
#endregion
摘要
本文旨在描述从文本主体中解析句子的几种方法。此外,本文还描述了可用于生成文本主体摘要统计信息的三个函数。当然,还有其他方法可用于执行这些操作。总的来说,解析句子的最佳方法似乎是使用正则表达式。修改正则表达式可能会产生不同的结果,这些结果可能更适合您正在处理的文本类型;但是,我发现这种方法即使对于复杂的文本主体也有效,只要文本被正确格式化为适当的句子。
历史
- 2008年6月3日:初始版本