使用 OCR 和模式识别在扫描文档中编辑敏感信息

Greg Freeland

0/5 (0投票)

2008年12月1日

CPOL

8分钟阅读

32768

了解如何在将扫描图像转换为可搜索文档后，查找、编辑或替换您定义的文本模式。隐藏敏感个人信息，如社会安全号码和信用卡号码，以保护隐私。Pegasus Imaging 的 SDK 和此示例项目将向您展示如何操作。

引言

OCR 结合强大的近似正则表达式引擎，可以捕获和搜索图像中文本中的数据，否则这些数据将丢失。即使在当今的数字时代，许多公司仍然依赖纸质文件。为了弥合差距，光学字符识别 (OCR) 捕获这些纸质文档上的数据，并将其带入数字工作空间。OCR 技术在许多不同的场景中都非常有用，通过为 OCR 技术添加带有近似匹配的正则表达式搜索，您可以创建功能更强大的解决方案。可搜索文档创建、捕获银行支票金额、从发票中获取金额、编辑敏感数据以及为后续搜索编制文档索引，这些只是 OCR 和正则表达式搜索的一些典型用途。

在本文中，我们将回顾这项技术可以提供解决方案的一些现有问题。我们还将概述用于创建这些问题解决方案的技术。最后，我们将通过实现一个用例来展示这项组合技术的强大功能。相关的示例代码和 Pegasus Imaging 的全页 OCR SDK 试用下载可以在这里找到。

以下用例是 OCR 的常见应用场景。

可搜索文档创建

当文档以图像形式存在时，无论是数字传真还是扫描文档，它们都不是易于搜索的格式。OCR 将文本图像转换为实际的可搜索文本。您可以将此文本与 PDF 文件或 XPS 文件中的原始图像结合。如果您需要出于法律原因保留原始图像（例如，当图像上有签名时），但同时需要搜索文本，这将非常有用。Google Desktop 和 Windows Desktop Search 将索引这些 OCR 创建的 PDF 文件和 XPS 文件，使您能够通过常规文本搜索找到所需的文档。全页 OCR 解决方案，如 OCR Xpress，最适合此用途。

表单处理

保险表格、入学考试、纳税申报单、发票和支票是许多企业日常处理的文档。一些企业每天收到数千甚至可能数百万份此类文档。表单处理是处理这些文档的自动化方法。大多数表单处理解决方案都使用 OCR 来收集机器打印数据，使用 ICR 来收集手写数据，并使用 OMR 来检测已填写的复选框或圆圈。结构化表单处理通常使用区域 OCR 和 ICR，例如 SmartZone v2，从表单字段收集数据。半结构化和非结构化表单处理根据实现的不同，在使用区域或全页 OCR 方面有所不同。

编辑敏感数据

从图像中编辑敏感数据是 OCR 的另一个重要用途。随着人们对隐私的持续担忧，从图像中编辑社会安全号码、出生日期和其他敏感数据的要求越来越普遍。企业和政府组织经常在网站上发布客户提交的文档图像。收集这些文档的组织必须在发布之前删除或编辑这些文档中存在的敏感数据。最近的隐私立法使得许多类型的文档图像都必须遵守这一要求。在我们下面的示例中，我们将开发一个简单的搜索和编辑程序，以演示精确 OCR 引擎与近似正则表达式引擎的组合功能。

技术限制

如今，许多 OCR 引擎的准确率都达到了或超过了 99%。在许多用例中，这对于要解决的问题来说是足够的准确率。然而，一些应用程序需要更高的准确率。有许多方法可以提高 OCR 引擎的识别准确率。从干净的图像开始是提高准确率的一种方法。在将彩色或灰度图像转换为黑白（二值）图像时使用最佳技术也可以提高识别准确率。从更高分辨率的图像（300 DPI 或更高）开始，有助于识别过程。使用多个 OCR 引擎并比较结果也可以减少识别错误。

不幸的是，并非所有这些选项都可能实现。图像可能源于组织无法控制的范围，导致获得的图像质量很差，带有撕裂、斑点或黑暗、低分辨率的图像。一些图像清理会有帮助，但可能不足以使 OCR 引擎达到 100% 的准确率，尽管宣传可能相反。

克服限制

当解决方案涉及搜索文本模式且无法保证 100% 的 OCR 准确率时，需要另一项技术来帮助改进搜索结果。近似正则表达式引擎有助于改进搜索结果。

什么是近似正则表达式搜索引擎？

正则表达式允许用户定义用于在字符串中搜索特定文本的模式。如果您曾经在命令行中使用过“dir *.c”，那么您正在使用正则表达式的一个变体。理解正则表达式的最佳方法是通过示例。您可以使用模式“\d\d\d”来搜索连续的任意三个数字。此正则表达式应用于字符串“abc 123”将匹配“123”。正则表达式引擎将返回输入字符串中的索引，以指示它找到匹配的位置。在此示例中，索引为 4（从零开始）。

近似正则表达式通过允许字符串中的插入、删除和替换字符的错误来扩展正则表达式的功能，并且仍然可以匹配模式。

如果 OCR 引擎错误地读取字符串并返回“abc 1Z3”，其中 2 被字母 Z 替换，则近似正则表达式在允许替换的情况下仍然可以匹配“/d/d/d”模式。将“Z”替换为“2”或任何其他数字，允许模式匹配“1Z3”。如果 OCR 引擎在字符串中插入文本，例如“abc 1i23”，那么在允许插入的情况下，模式仍然匹配“1i23”。并且当对字符串“abc 12”执行删除操作时，模式会匹配“12”。

示例实现：搜索图像中的信息

在这个示例中，我们首先需要下载包含 ImagXpress v9 的 OCR Xpress v2 SDK。接下来，我们将使用 ImagXpress 加载一个包含文本的图像作为输入。使用 OCR Xpress 引擎，我们将识别图像上的文本。然后，我们将使用 OCR Xpress v2 内置的近似正则表达式引擎来搜索识别出的文本的正则表达式模式。接下来，我们将使用 NotateXpress v9（也包含在 OCR Xpress SDK 中）在屏幕上突出显示文本。最后，我们导出为可搜索的、图像叠加文本的 PDF，其中图像中的文本已被编辑，并从可搜索文本中删除。

文本识别

创建此应用程序的第一步是加载图像并对加载的图像执行识别。下面的代码中显示了一些其他的维护步骤。工具包使整个过程简单易用。

// Open the selected file with ImagXpress
//
ImageX documentImage = ImageX.FromStream(m_imagXpress,
                                         openFileDialog.OpenFile());

// OcrXpress will not assume ownership of the
System.Drawing.Bitmap 
// created by ImagXpress. The calling application
will need to dispose
// of the Bitmap instead. The using statement will do
this efficiently.
//
using (System.Drawing.Bitmap
       theImage = documentImage.ToBitmap(false))
{
    m_ocrXpressPage = m_ocrXpress.Document.AddPage(theImage);
}

// Process image with OcrXpress, so we have results
to search
//
m_ocrXpress.Document.AutoRotate(m_ocrXpressPage);
m_ocrXpress.Document.Deskew(m_ocrXpressPage);
m_ocrXpress.Document.Recognize(m_ocrXpressPage);

// Give image to ImagXpress viewer to display
//
if (m_ocrXpressPage.BitonalBitmap == null)
m_imageXView.Image = ImageX.FromBitmap(m_imagXpress,
                                       m_ocrXpressPage.Bitmap);
else
m_imageXView.Image
= ImageX.FromBitmap(m_imagXpress,
                    m_ocrXpressPage.BitonalBitmap);

设置搜索模式

加载并识别图像后，用户可以输入搜索字符串，或从预定义的搜索模式（如电话号码）中选择。下面的代码显示了如何在 OCR Xpress 中设置模式然后执行匹配。如果应用程序用户选择近似匹配，我们将设置结构以允许最多两个错误，这可以是零次或一次替换、零到两次删除和最多两次插入的组合。

using (PatternMatcher

       search = new PatternMatcher(m_ocrXpress))
{
    List<MatchResult>
    searchResults;
    search.Pattern = txtSearchPattern.Text;
    if (chkMatchApproximate.CheckState == CheckState.Checked)
    {
        search.MaximumInsertions = 2;
        search.MaximumDeletions = 2;
        search.MaximumSubstitutions = 1;
        search.MaximumErrors = 2;
    }
    search.CaseSensitive = chkCaseSensitive.Checked;
    searchResults = search.PerformMatching(m_ocrXpressPage);
}

请注意，在此示例图像中，“OCR”一词的一个实例被墨迹（通过绘画程序）损坏，导致“O”看起来像“Q”。标准正则表达式引擎在搜索“OCR Xpress”时不会匹配此模式，但当我们开启近似匹配时，它会找到此实例，以及其他几个“OCR”和“Xpress”之间空格被消除的实例。

在列表框中显示结果

为了显示结果，我们在 System.Collections.ArrayList 中构建了一个匹配结果数组，并将其连接到 Windows.Forms.ListBox 控件进行显示。我们使用包含匹配项的文本行的片段填充了 ArrayList，名为 listBoxItems。

// Get the OCR results from the global page variable
// 
PageResult page =
m_ocrXpressPage.GetResult();

// Loop through and report each match
//
foreach (MatchResult
         result in searchResults)
{
    TextBlockResult block =
        page.GetTextBlockResult(result.TextBlockIndex);

    // If the entire search result is contained within a 
    // single text line result, then the process of
    collecting
        // the text of the search result is a bit easier
        //
        if (result.TextLineStartIndex ==
            result.TextLineEndIndex)
        {
            // Get the text line result where the match begins.
            In this
                // case, it is also the text line result where the
                match ends.
                //
                TextLineResult line =
                block.GetTextLineResult(result.TextLineStartIndex);

            // Get one word before the match to show context
            //
            int wordsBeforeIndex =
                GetStartIndexOfWordsBefore(line.Text, 
                result.CharStartIndex, 1);
            string itemString =
                line.Text.Substring(wordsBeforeIndex,
                result.CharStartIndex - wordsBeforeIndex);
            itemString += "[";
            itemString += line.Text.Substring(
                result.CharStartIndex, result.CharEndIndex - result.CharStartIndex);
            itemString += "]";

            // Get eight words after the match to show context
            //
            int wordsAfterIndex =
                GetEndIndexOfWordsAfter(line.Text, 
                result.CharEndIndex, 8);
            itemString += line.Text.Substring(result.CharEndIndex, 
                wordsAfterIndex - result.CharEndIndex);
        }
        else
        {
            //
            // Download the sample to see the code for this case 
            //
        }
}

在屏幕上突出显示

ListBox 中的一个事件会调用一个函数，该函数使用 NotateXpress 在图像上突出显示文本。OCR Xpress 提供图像中字符的坐标。

// If the entire search result is contained within a 
// single text line result, then the process of
highlighting
// the result is a bit easier
//
if (result.TextLineStartIndex == result.TextLineEndIndex)
{
    if (result.CharStartIndex ==
        result.CharEndIndex)
    {
        // A match result that contains no characters can
        result
            // from the regular expression "^" or
            "$".
            //
            return;
    }
    rectAnnotation = new RectangleTool();
    rectAnnotation.BackStyle = BackStyle.Translucent;
    rectAnnotation.FillColor = fillColor;
    rectAnnotation.Moveable = rectAnnotation.Sizeable = false;

    // Get the text line result where the match begins.
    In this
        // case, it is also the text line result where the
        match ends.
        //
        textLine = textBlock.GetTextLineResult(result.TextLineStartIndex);

    // Ensures that we don't get a first character result
    that
        // is a space character. Since space characters do
        not
        // provide area information, this would throw off the
        // highlight bounding area
        //
        int i1, i2;
    for (i1 = result.CharStartIndex; i1 <

        result.CharEndIndex; i1++)
    {
        firstCharacterResult = textLine.GetCharacter(i1);
        if (firstCharacterResult.Text != " ")
            break;
    }

    // Ensures that we don't get a last character result
    that
        // is a space character. Since space characters do
        not
        // provide area information, this would throw off the
        // highlight bounding area
        //
        for (i2 = result.CharEndIndex - 1; i2 >= i1;
            i2--)
        {
            lastCharacterResult = textLine.GetCharacter(i2);
            if (lastCharacterResult.Text != " ")
                break;
        }

        // Construct the area of the highlight for the search
        result
            // based on the text line Y and Height values. Then
            use
            // the first and last characters of the search result
            to 
            // create the X and Width values.
            // 
            System.Drawing.Rectangle boundingRectangle =
            new System.Drawing.Rectangle();
        boundingRectangle.Y = textLine.Area.Y;
        boundingRectangle.Height = textLine.Area.Height;
        boundingRectangle.X = firstCharacterResult.Area.X;
        boundingRectangle.Width = (lastCharacterResult.Area.Width + 
            lastCharacterResult.Area.X) - firstCharacterResult.Area.X;
        rectAnnotation.BoundingRectangle = boundingRectangle;
        layer.Elements.Add(rectAnnotation);
}

图像突出显示后，我们将调整滚动条位置，以便突出显示的文本显示在屏幕上。

// Adjust ImagXpress scroll position, so highlight
// will be visible.
//
int xVis = rectAnnotation.BoundingRectangle.X +

rectAnnotation.BoundingRectangle.Width / 2;
int yVis = rectAnnotation.BoundingRectangle.Y +

rectAnnotation.BoundingRectangle.Height / 2;
double xOffset = xVis * m_imageXView.ZoomFactor
- m_imageXView.Width / 2;
double yOffset = yVis * m_imageXView.ZoomFactor
- m_imageXView.Height / 2;
m_imageXView.ScrollPosition = new Point((int)xOffset,
                                        (int)yOffset);

编辑和导出

最后，如果用户对搜索结果满意，他们可以编辑匹配的文本并将编辑后的文本导出为可搜索的 PDF。我们使用 NotateXpress 将编辑标记集成到图像中，然后在导出前将其替换到 OCR Xpress 中。底层文本也被编辑，通过用“X”替换受影响的文本，同时其余文本仍然可搜索。

// Accumulate all MatchResult objects in the result
// list box, so we can use them for redaction
//
List<MatchResult>
results = new List<MatchResult>();
foreach (ResultListBoxItem
         item in listBoxItems)
{
    results.Add((MatchResult)item.ValueObject);
}

// Redact and replace search results
//

using (ImageXView
       imageXView = new ImageXView(m_imagXpress))
{
    // Make a copy of the displayed image, so the
    // redactions do not affect it
    //
    imageXView.Image = m_imageXView.Image.Copy();

    RedactSearchResultsOnImage(results, imageXView);

    // Set the redacted image back into the primary
    // image of the OcrXpress page, so it will be
    // used during export
    //
    using (System.Drawing.Bitmap
        redactedImage = imageXView.Image.ToBitmap(false))
    {
        m_ocrXpressPage.Bitmap = redactedImage;
    }

    RedactSearchResultsInCurrentPage(results);

    m_ocrXpress.Document.Export(ExportFormat.PdfImageWithText,

        saveFileDialog.FileName);
}

结论

开发人员可以通过结合精确的 OCR 和近似正则表达式引擎来创建强大的图像搜索解决方案。这项技术可以解决众多普遍存在于各行各业的常见问题。我们在这里创建的简单 OCR 和搜索示例展示了 OCR Xpress 和 ImagXpress SDK 如何使创建如此强大的业务解决方案变得如此容易。

您可以在 www.pegasusimaging.com 找到 Pegasus Imaging 的产品下载和功能。如有更多信息，请联系 sales@jpg.com 或 support@jpg.com。