65.9K
CodeProject 正在变化。 阅读更多。
Home

解析句子和构建文本统计(Visual Basic)

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.25/5 (6投票s)

2008年6月4日

CPOL

5分钟阅读

viewsIcon

64741

downloadIcon

781

本文介绍了从文本主体中解析句子的三种方法;文章展示了这三种方法,以描述使用每种不同方法执行此任务的优缺点。

引言

本文介绍了从文本主体中解析句子的三种方法;文章展示了这三种方法,以描述使用每种不同方法执行此任务的优缺点。演示应用程序还描述了一种生成句子计数、单词计数和字符计数统计信息的方法。

图 1:正在运行的测试应用程序。

从文本主体中解析句子的三种方法包括:

  • Parse Reasonable:一种基于使用典型句子终止符分割文本的方法,其中保留句子终止符。
  • Parse Best:一种基于使用正则表达式分割文本的方法,其中保留句子终止符,并且
  • Parse Without Endings:一种使用典型句子终止符分割文本的方法,其中不保留终止符作为句子的一部分。

演示应用程序包含一个 textbox 控件中的一些默认文本;三个按钮用于使用上述三种方法之一解析文本,以及三个标签控件用于显示在文本主体上生成的摘要统计信息。运行应用程序后,单击其中一个按钮将在窗体底部的 listbox 控件中显示每个解析出的句子,并在窗体右上角使用三个标签显示摘要统计信息。

入门

要开始,请解压包含的项目并在 Visual Studio 2008 环境中打开解决方案。在解决方案资源管理器中,您应该会看到这些文件(图 2)。

图 2:解决方案资源管理器。

从图 2 可以看出,有一个包含单个窗体的 WinForms 项目。此应用程序所需的所有代码都包含在此窗体的代码中。

主窗体 (Form1.vb)

应用程序的主窗体 Form1 包含所有必需的代码。窗体包含一个 textbox 控件中的默认文本;三个按钮用于执行用于将文本主体解析为字符串集合的每个函数;每句话一个。您可以替换、删除或向 textbox 控件中的文本添加内容,以针对您自己的文本运行这些方法。三个标签控件用于显示 textbox 控件中包含的文本的摘要统计信息(句子、单词和字符计数)。每次将文本解析成句子后,这些摘要统计信息都会更新。

如果您愿意在 IDE 中打开代码视图,您会看到代码文件以以下库 imports 开始。

Imports System
Imports System.Collections
Imports System.ComponentModel
Imports System.Data
Imports System.Drawing
Imports System.Text
Imports System.Windows.Forms
Imports System.Text.RegularExpressions

请注意,默认设置已更改,现在包括对正则表达式库的引用。

imports 之后,定义了类和构造函数。

Public Class Form1

    Public Sub New()

        ' This call is required by the Windows Form Designer.
        InitializeComponent()

        ' Add any initialization after the InitializeComponent() call.

    End Sub

接下来是一个名为“Best Sentence Parser”的区域;该区域包含一个名为 SplitSentences 的函数,该函数接受一个 string 作为参数。此方法在解析句子方面通常会产生最佳结果,但如果文本包含错误,可能会产生不准确的值。该区域还包含一个按钮单击事件处理程序,用于调用 SplitSentences 函数。

代码已添加注释,阅读这些注释将解释函数内部发生的情况。

#Region "Best Sentence Parser"

    ''' <summary>
    ''' This is generally the most accurate approach to
    ''' parsing a body of text into sentences to include
    ''' the sentence's termination (e.g., the period,
    ''' question mark, etc).  This approach will handle
    ''' duplicate sentences with different terminations.
    ''' </summary>
    ''' <param name="sSourceText"></param>
    ''' <returns></returns>
    ''' <remarks></remarks>
    Private Function SplitSentences(ByVal sSourceText As String) As ArrayList

        ' create a local string variable
        ' set to contain the string passed it
        Dim sTemp As String = sSourceText

        ' create the array list that will
        ' be used to hold the sentences
        Dim al As New ArrayList()

        ' split the sentences with a regular expression
        Dim RegexSentenceParse As String() = _
            Regex.Split(sTemp, "(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])")

        ' loop the sentences
        Dim i As Integer = 0
        For i = 0 To RegexSentenceParse.Length - 1
            ' clean up the sentence one more time, trim it,
            ' and add it to the array list
            Dim sSingleSentence As String = _
            RegexSentenceParse(i).Replace(Environment.NewLine, String.Empty)
            al.Add(sSingleSentence.Trim())
        Next

        ' update the statistics displayed on the text
        ' characters
        lblCharCount.Text = "Character Count: " & _
        GenerateCharacterCount(sTemp).ToString()

        ' sentences
        lblSentenceCount.Text = "Sentence Count: " & _
        GenerateSentenceCount(RegexSentenceParse).ToString()

        ' words
        lblWordCount.Text = "Word Count: " & _
        GenerateWordCount(al).ToString()

        ' return the arraylist with
        ' all sentences added
        Return al

    End Function

    ''' <summary>
    ''' Calls the SplitSentences (best approach) method
    ''' to split the text into sentences and displays
    ''' the results in a list box
    ''' </summary>
    ''' <param name="sender"></param>
    ''' <param name="e"></param>
    ''' <remarks></remarks>
    Private Sub btnParseNoEnding_Click(ByVal sender As System.Object, ByVal e 
    As System.EventArgs) Handles btnParseNoEnding.Click

        lstSentences.Items.Clear()

        Dim al As New ArrayList()
        al = SplitSentences(txtParagraphs.Text)
        Dim i As Integer
        For i = 0 To al.Count - 1
            'populate a list box
            lstSentences.Items.Add(al(i).ToString())
        Next
    End Sub

#End Region

接下来是一个名为“Reasonable Sentence Parser”的区域;该区域包含一个名为 ReasonableParser 的函数,该函数接受一个 string 作为参数。此方法在解析句子方面通常会产生不错的结果,但如果输入 string 包含具有不同终止符的重复句子,则它不会应用正确的句子终止符。此问题可以通过使用递归函数来继续遍历重复句子的每个实例来解决,但是使用先前代码区域中指示的方法工作量较少。该区域还包含一个按钮单击事件处理程序,用于调用 ReasonableParser 函数。

代码已添加注释,阅读这些注释将解释函数内部发生的情况。

#Region "Reasonable Sentence Parser"

    ''' <summary>
    ''' This does a fair job of parsing the sentences
    ''' unless there are duplicate sentences
    ''' you'd have to resort to recursion in order
    ''' to get through the issue of multiple duplicate sentences.
    ''' </summary>
    ''' <param name="sTextToParse"></param>
    ''' <returns></returns>
    ''' <remarks></remarks>
    Private Function ReasonableParser(ByVal sTextToParse As String) As 
    ArrayList

        Dim al As New ArrayList()

        ' get a string from the contents of a textbox
        Dim sTemp As String = sTextToParse
        sTemp = sTemp.Replace(Environment.NewLine, " ")

        ' split the string using sentence terminations
        Dim arrSplitChars As Char() = {".", "?", "!"}  ' things that end a 
        sentence

        'do the split
        Dim splitSentences As String() = sTemp.Split(arrSplitChars, 
        StringSplitOptions.RemoveEmptyEntries)

        ' loop the array of splitSentences
        Dim i As Integer
        For i = 0 To splitSentences.Length - 1

            ' find the position of each sentence in the
            ' original paragraph and get its termination ('.', '?', '!') 
            Dim pos As Integer = sTemp.IndexOf(splitSentences(i).ToString())
            Dim arrChars As Char() = sTemp.Trim().ToCharArray()
            Dim c As Char = arrChars(pos + splitSentences(i).Length)

            ' since this approach looks only for the first instance
            ' of the string, it does not handle duplicate sentences
            ' with different terminations.  You could expand this
            ' approach to search for later instances of the same
            ' string to get the proper termination but the previous
            ' method of using the regular expression to split the
            ' string is reliable and less bothersome.

            ' add the sentences termination to the end of the sentence
            al.Add(splitSentences(i).ToString().Trim() & c.ToString())
        Next

        ' Update the show of statistics
        lblCharCount.Text = "Character Count: " & _
        GenerateCharacterCount(sTemp).ToString()

        lblSentenceCount.Text = "Sentence Count: " & _
        GenerateSentenceCount(splitSentences).ToString()

        lblWordCount.Text = "Word Count: " & _
        GenerateWordCount(al).ToString()

        Return al

    End Function

    ''' <summary>
    ''' Calls the ReasonableParser method and
    ''' displays the results
    ''' </summary>
    ''' <param name="sender"></param>
    ''' <param name="e"></param>
    ''' <remarks></remarks>
    Private Sub btnParseReasonable_Click(ByVal sender As System.Object, ByVal 
    e As System.EventArgs) Handles btnParseReasonable.Click

        lstSentences.Items.Clear()

        Dim al = ReasonableParser(txtParagraphs.Text)
        Dim i As Integer
        For i = 0 To al.Count - 1
            lstSentences.Items.Add(al(i).ToString())
        Next

    End Sub

#End Region

接下来是一个名为“Parse Without Sentence Terminations”的区域;该区域包含一个名为 IDontCareHowItEndsParser 的函数,该函数接受一个 string 作为参数。此方法在解析句子方面通常会产生良好的结果,但不会将终止符添加到解析出的句子中;如果您不关心句子末尾使用什么终止符,这是一种好方法。该区域还包含一个按钮单击事件处理程序,用于调用 IDontCareHowItEndsParser 函数。

代码已添加注释,阅读这些注释将解释函数内部发生的情况。

#Region "Parse Without Sentence Terminations"

    '/ <summary>
    '/ If you don't care about retaining the sentence
    '/ terminations, this approach works fine.  This
    '/ will return an array list containing all of the
    '/ sentences contained in the input string but
    '/ each sentence will be stripped of its termination.
    '/ </summary>
    '/ <param name="sTextToParse"></param>
    '/ <returns></returns>

    Private Function IDontCareHowItEndsParser(ByVal sTextToParse As String) 
    As ArrayList

        Dim sTemp As String = sTextToParse
        sTemp = sTemp.Replace(Environment.NewLine, " ")

        ' split the string using sentence terminations
        Dim arrSplitChars As Char() = {".", "?", "!"}  ' things that end a 
        sentence

        'do the split
        Dim splitSentences As String() = sTemp.Split(arrSplitChars, 
        StringSplitOptions.RemoveEmptyEntries)

        Dim al As New ArrayList()
        Dim i As Integer
        For i = 0 To splitSentences.Length - 1
            splitSentences(i) = splitSentences(i).ToString().Trim()
            al.Add(splitSentences(i).ToString())
        Next

        ' show statistics
        lblCharCount.Text = "Character Count: " + 
        GenerateCharacterCount(sTemp).ToString()
        
        lblSentenceCount.Text = "Sentence Count: " +   
        GenerateSentenceCount(splitSentences).ToString()
        
        lblWordCount.Text = "Word Count: " + GenerateWordCount(al).ToString()

        Return al

    End Function

    ''' <summary>
    ''' Calls the IDontCareHowItEndsParser and displays
    ''' </summary>
    ''' <param name="sender"></param>
    ''' <param name="e"></param>
    ''' <remarks></remarks>
    Private Sub btnParseBest_Click(ByVal sender As System.Object, ByVal e As 
    System.EventArgs) Handles btnParseBest.Click

        lstSentences.Items.Clear()
        Dim al = IDontCareHowItEndsParser(txtParagraphs.Text)
        Dim i As Integer

        For i = 0 To al.Count - 1
            lstSentences.Items.Add(al(i).ToString())
        Next

    End Sub

#End Region

最后一个区域是“Generate Statistics”。该区域包含三个函数,分别返回文本主体的字符计数、单词计数和句子计数。同样,本节也添加了注释;请阅读注释以获取每个函数如何工作的描述。

#Region "Generate Statistics"

    ''' <summary>
    ''' Generate the total character count for
    ''' the entire body of text as converted to
    ''' one string
    ''' </summary>
    ''' <param name="allText"></param>
    ''' <returns></returns>
    ''' <remarks></remarks>
    Public Function GenerateCharacterCount(ByVal allText As String) As 
    Integer

        Dim rtn As Integer = 0

        ' clean up the string by
        ' removing newlines and by trimming
        ' both ends
        Dim sTemp As String = allText
        sTemp = sTemp.Replace(Environment.NewLine, String.Empty)
        sTemp = sTemp.Trim()

        ' split the string into sentences 
        ' using a regular expression
        Dim splitSentences As String() = _
            Regex.Split(sTemp, _
                "(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])")

        ' loop through the sentences to get character counts
        Dim cnt As Integer

        For cnt = 0 To splitSentences.Length - 1

            ' get the current sentence
            Dim sSentence As String = splitSentences(cnt).ToString()

            ' trim it
            sSentence = sSentence.Trim()

            ' convert it to a character array
            Dim sentence As Char() = sSentence.ToCharArray()

            ' test each character and
            ' add it to the return value
            ' if it passes
            Dim i As Integer

            For i = 0 To sentence.Length - 1

                ' make sure it is a letter, number,
                ' punctuation or whitespace before
                ' adding it to the tally
                If Char.IsLetterOrDigit(sentence(i)) Or _
                    Char.IsPunctuation(sentence(i)) Or _
                    Char.IsWhiteSpace(sentence(i)) Then

                    rtn += 1

                End If
            Next

        Next

        ' return the final tally
        Return rtn
    End Function

    ''' <summary>
    ''' Generate a count of all words contained in the text
    ''' passed into to this function is looking for
    ''' an array list as an argument the array list contains
    ''' one entry for each sentence contained in the
    ''' text of interest.
    ''' </summary>
    ''' <param name="allSentences"></param>
    ''' <returns></returns>
    ''' <remarks></remarks>
    Public Function GenerateWordCount(ByVal allSentences As ArrayList) As 
    Integer

        ' declare a return value
        Dim rtn As Integer = 0

        ' iterate through the entire list
        ' of sentences
        Dim sSentence As String

        For Each sSentence In allSentences

            ' define an empty space as the split
            ' character
            Dim arrSplitChars As Char() = New Char() {" "}

            ' create a string array and populate
            ' it with a split on the current sentence
            ' use the string split option to remove
            ' empty entries so that empty sentences do not
            ' make it into the word count.
            Dim arrWords As String() = sSentence.Split(arrSplitChars, 
            StringSplitOptions.RemoveEmptyEntries)
            
            rtn += arrWords.Length

        Next

        ' return the final word count
        Return rtn

    End Function

    ''' <summary>
    ''' Return a count of all of the sentences contained in the
    ''' text examined this method is looking for a string
    ''' array containing all of the sentences it just
    ''' returns a count for the string array.
    ''' </summary>
    ''' <param name="allSentences"></param>
    ''' <returns></returns>
    ''' <remarks></remarks>
    Public Function GenerateSentenceCount(ByVal allSentences As String()) As 
    Integer

        ' create a return value
        Dim rtn As Integer = 0

        ' set the return value to
        ' the length of the sentences array
        rtn = allSentences.Length

        ' return the count
        Return rtn

    End Function

#End Region

摘要

本文旨在描述从文本主体中解析句子的几种方法。此外,本文描述了三个可用于生成文本主体摘要统计信息的函数。当然,还有其他方法可以完成这些事情。总的来说,解析句子的最佳方法似乎是使用正则表达式。修改正则表达式可能会产生不同的结果,这些结果可能更适合您正在处理的文本类型;但是,我发现这种方法即使对于复杂的文本主体也有效,只要文本格式正确。

历史

  • 2008年6月3日:初始版本
© . All rights reserved.