使用 XLINQ 进行 OfficeQuery

Danny Hauptman

5.00/5 (1投票)

2010年1月17日

CPOL

9分钟阅读

19435

133

使用 XLINQ 搜索 Word 2007 压缩文档。

下载源代码 - 2.38 KB

引言

我先说明一下，最初的想法和代码是由微软在其“虚拟实验室”网站上提供的。这是一个很棒的服务，虽然并非所有实验室都能正常工作，但那些能正常工作的通常都做得非常好。当您以前的雇主“不小心”格式化了您的电脑，丢失了所有项目和唯一的 Visual Studio 时，它简直是救命稻草。

您基本上是通过远程连接到一个虚拟计算机，不同的实验室允许您在不同版本的 Windows、SQL Server 和 Visual Studio 上进行操作，这使得它非常有用。这不像“普通”的实验室，它们强迫您完成所教的每一个步骤；如果您愿意，您可以随意尝试和测试您自己的代码，或者尝试与它们在实验室中建议的不同方法。

我发现的唯一缺点是您无法将使用的代码保存或复制到您的计算机，也无法通过电子邮件将解决方案的副本发送给自己。我可以滔滔不绝地说下去，但您应该去看看。

这是一个简单的应用程序，提供了一种相当酷的方法来搜索 MS Word 2007 的 .docx 文件。如果您真的想深入研究代码并避免遵循本文的每一个细微之处，以下是 UI 的快速需求列表。

简化的 UI 入口：以下是 UI 的基本要求

BackgroundWorker
FolderBrowserDialog (文件夹浏览器对话框)
LinkButton (将用于打开文件夹对话框)
Button (用于调用搜索)
TextBox (用于保存查询字符串)
Label (用于显示找到搜索项的次数)
RichTextBox (用于保存结果)

快速浏览“快速背景”下的段落；它解释了我们正在搜索什么以及如何搜索；还要注意我们将为这个项目需要的 `using` 语句。

快速背景

随着 Office 2007 的发布，我们有了一种新的 Word、Excel 和 PowerPoint 文档文件格式，称为 OpenXML。

首先，我们需要找到一个扩展名为 .docx 的 Word 2007 文档，其中包含相当数量的文本。然后，我们可以复制一份并将其重命名为 .zip 扩展名。

现在，事情变得很有趣了。双击 zip 文件，您应该会注意到除了您刚刚保存的文档之外，还有一些文件夹和文件。您会看到一个名为 word 的文件夹；在该文件夹内，有一个名为 document.xml 的 XML 文件：这就是我们稍后将要搜索的内容。您可以双击 document.xml 文件在 Internet Explorer 中查看 XML。

现在，当您查看 document.xml 文件时，您会注意到有 t 标签，它们代表与 w 标签相关的文本，w 标签代表文档的 Word 命名空间。它们看起来应该像这样：<t>Your text here bla bla bla...</t>。这些标签是为了帮助您解析和搜索文档。

开始 - 创建 UI

创建一个新的 C# Forms 项目，并确保包含以下内容

using System.IO.Packaging;
using System.Xml;
using System.Xml.Linq;

在深入代码之前，我们应该先搭建 UI。首先，从工具箱的“对话框”部分找到并添加一个 FolderBrowserDialog。接下来，我们将一个 BackgroundWorker 添加到窗体上（它也位于工具箱的“对话框”部分）。

在窗体的顶部，添加一个 LinkButton 控件，将其 Text 属性设置为“Click to Select Folder”（点击选择文件夹），并将其 Name 属性设置为 linkFolderSelect；这将打开文件夹浏览器。

在链接按钮下方，我们可以添加一个 Button 控件；将其 Text 属性设置为“Search”（搜索），并将其 Name 属性设置为 btnSearch。我们还应该确保其 Enabled 属性在属性窗口中设置为 false。我们“禁用”此按钮是因为我们不希望用户在添加要搜索的文件之前点击搜索，这将有助于防止意外异常。

接下来，我们可以在窗体左侧添加一个 Label，并将其 Text 设置为类似“Query String”（查询字符串）的内容。紧邻 Label，我们将添加一个 TextBox 并将其 Name 属性设置为 tbSearchParam。这将是用户键入他们想要在加载的文档中搜索的字符串的地方。

好了！快完成了。让我们在刚才添加的 Label 和 TextBox 下方放置一个 Label。我们可以将其 Text 属性设置为类似“Results”（结果）的内容，并将其 Name 属性设置为 lblResults。最后，我们将在“Results”标签下方添加一个 RichTextBox 控件，并将其拉伸到窗体底部；由于它将显示搜索结果，它应该有相当大的尺寸。让我们将 RichTextBox 的 Name 属性设置为 tbResults。

现在，开始编码

添加两个私有字符串，它们将在整个项目中使用。创建一个字符串来保存来自我们命名为 tbSearchParam 的文本框的搜索参数。下一个字符串将保存选定文件夹的名称。最后，添加一个 private List<string>，它将包含我们查询的结果。您可以在下方查看此代码

namespace officeQuery
{
  public partial class Form1: Form
  {
    //here are a few variables we'll be using thoughout the project
     private string _searchPararm;
     private string _selectedFolder;
     private List<string> _results;

在 Form 构造函数中，我们将初始化 FolderBrowserDialog，使其在首次点击时默认打开我们想要的文件夹。我只是将其设置为“C:\”目录，但您也可以同样轻松地将其设置为“C:\Documents”...等等。

public Form1()
{
   InitializeComponent();
  
   //Here we are initializing where the fileDialog will open by default
   folderBrowserDialog1.SelectedPath = @"C:\";
}

现在，我们可以回到设计器双击链接标签，为它创建点击时触发的空方法。在这里，您将添加打开文件夹浏览器的代码。这是打开文件夹浏览器的通用代码，可以移植到其他应用程序。我们通过调用 ShowDialog(this) 并将其结果分配给 DialogResult 变量来打开并显示 FolderBrowserDialog。当用户选择一个文件并点击 OK 时，我们将其目录路径分配给我们的私有字符串变量 _selectedFolder。既然我们有了一个要查询的文件，我们就可以启用 Search 按钮，这样用户现在就可以查询文档了。

/*
method: linkFolderSelect_LinkClicked
accepts: object, LinkLabelLinkClickedEventArgs
returns: nothing
Desc: Open a folder dialog box and assigns the file selected 
      to our private variable _selectedFolder. 
      It will also enable the search button
*/
private void linkFolderSelect_LinkClicked(object sender, 
             LinkLabelLinkClickedEventArgs e)
{
   //begin opening the dialog - by calling the Show Dialog method
   DialogResult res = folderBrowserDialog1.ShowDialog(this);

   //now the dialog will wait until the user is done 
   //selecting whatever they want to select
   if(res == DialogResult.OK)
   {
       //get the path of the document and assign it to our variable
     _selectedFolder = folderBrowserDialog1.SelectedPath;
      FileInfo fInfo = new FileInfo(_selectedFolder);
      linkFolderSelect.Text = fInfo.Name;
   
      //enable the search btn, since we have a document to search
      btnSearch.Enabled = true;
   }
}

回到设计器双击我们命名为 btnSearch 的搜索按钮，这将带我们进入按钮的点击事件。

此事件会做一些工作。首先，我们将初始化 BackgroundWorker，调用其事件处理程序 RunWorkerCompletedEventHandler，并将我们的方法 QueryComplete 传递给它。QueryComplete 方法将在我们完成查询后处理结果的格式化和显示。接下来，我们从 BackgroundWorker 对象调用 DoWorkEventHandler 并将我们的方法 Query 传递给它。这处理了大部分处理工作，如我们稍后将看到的。

这是我们 buttonClick 事件到目前为止的代码。

/*
method: btnSearch_Click
accepts: object sender, EventArgs  e
returns: nothing
Desc: Sets up the BackgroundWorker object and begins 
      to query the document for the query string.
*/
   private void btnSearch_Click(object sender, EventArgs e)
   {

     //initialize backgroundWorker1 by calling its constructor
     backgroundWorker1 = new BackgroundWorker();
    
     //invoke the RunWorkerCompletedEventHandler and send it QueryComplete method
     //to handle the formatting when the query is complete
     backgroundWorker1.RunWorkerCompleted += 
         new RunWorkerCompletedEventHandler(QueryComplete);

     //invoke the DoWorkEventHandler and send it Query method to handle the LINQ query 
     backgroundWorker1.DoWork += DoWorkEventHandler(Query);

让我们完成我们的按钮点击事件。一旦我们处理完 BackgroundWorker，我们应该确保清除将保存结果的 RichTextBox（以确保没有旧的搜索结果）。我们将 Search 和 LinkButton 的 Enabled 属性设置为 false。检索搜索参数并将其分配给我们的变量 _searchParam。最后，我们在 BackgroundWorker 上调用 RunWorderAsynch() 来开始工作。

    //Now we should make sure the RichTextBox result window is clear
    tbResults.Clear();

    //here we display 'searching'  on the results label
    lblResults.Text = "Searching...";

    //reset the search and link buttons to false -- because we are finishing the query
    btnSearch.Enabled = false;
    linkFolderSelect.Enabled = false;

    //here we will retrieve the _searchParam variable the text from the search box
    _searchParam = tbSearchParam.Text.Trim();

    //finally begin the backgroundWorker
    backgroundWorker1.RunWorerAsynch();
}

我将简要描述 QueryComplete 方法，因为它相当简单，并且完全按照其名称所示进行操作：处理已完成的查询并显示找到的任何项。这是我们在调用 RunWorkerCompletedEventHandler 时传递给 BackgroundWorker 对象的方法。

首先，我们将找到的搜索项的出现次数分配给我们的 lblResults 标签。然后，使用 foreach 结构循环遍历我们的 _results 变量。在循环过程中，我们将查找并突出显示我们找到的与用户搜索参数相对应的区域。

完整的 QueryComplete 方法如下所示

/*
Method: QueryComplete
Accepts: object sender, RunWorkerCompletedEventArgs e
Returns: nothing
Desc: This method deals with the completed query and displays 
      any of the found items in the RichTextBox as well 
      as highlighting the query string.
*/
void QueryComplete(object sender, RunWorkerCompletedEventArgs e)
{

    //display how many items we found within the query
    lblResults.Text = string.Format(
      "Results [{0} result(s) found]", _results.Count);

    //loop thru the results and split them
    foreach( string s in _results)
    {
        string[] result = s.Split('|');
        string t = result[0];
        int i = t.IndexOf( _searchParam);

        //loop through the split result string
        //and make necessary highlighting
        while( t.IndexOf( _searchParam) > 0)
        {
           tbResults.AppendText(t.Substring(0, i));
           tbResults.SelectionColor = Color.Red;
           tbResults.AppendText( _searchParam);
           tbResults.SelectionColor = Color.Black;
           t = t.Substring(i + _searchParam.Length);
           i = t.IndexOf( _searchParam );
        
        } //end while

        //append the new text to the RichTextBox
        tbResults.AppendText( t );
        //format the search results within the RichTextBox
        tbResults.SelectionColor = Color.DarkGreen;
        tbResults.AppendText(string.Format(" [{.}] ", result[1]));
        tbResults.AppendText(Environment.NewLine);
        tbResults.SelectionColor = Color.Black;

    } //end foerach
      
    //reset the Button and LinkButton
    btnSearch.Enabled = true;
    linkFolderSelect.Enabled = true;
}

Query 方法是我们调用事件处理程序 DoWorkEventHandler 时传递给 BackgroundWorker 的。

此方法创建一个新的 List<t> 对象（字符串类型）来保存我们的结果。然后，我们循环遍历包含我们要搜索的文件夹的 DirectoryInfo 对象。最后，我们将 .docx 文件发送到 WordDocumentQuery 方法，这是 LINQ 魔力发生的地方。

/*
Method: Query
Accepts: object sender, DoWorkEventArgs e
Returns: nothing
Desc: This method initialize the _results, gets selected folder 
  and calls the WordDocumentQuery method, which will perform the LINQ query
*/
void Query(object sender, DoWorkEventArgs e)
{
    //create the _results object
    _results = new List<string>();
       
    //create a directoryinfo object of the folder to be searched
    DirectoryInfo dir = new DirectoryInfo(_selectedFolder);
 
    //now we will search the folder for '.docx' extensions
    foreach(FileInfo f in dir.GetFiles("*.docx"))
    {
        WordDocumentQuery(f)
    }  
}

我们终于到了这个项目的最后一个方法；WordDocumentQuery。此方法将接受一个 FileInfo 对象。首先，我们添加一个 XNamespace 对象。接下来，我们添加 Package 类的一个实例。此类允许我们访问文件的整个内容。我们创建一个包含我们要搜索的 XML 文件的 Uri 对象，即位于 zip 文件中的 document.xml 文件。PackagePart 对象代表 URI 的内容——正如“docPart”这个名字所暗示的那样——它只是整个包的一部分。接下来，我们创建一个基于我们感兴趣的 PackagePart 的 XmlReader 对象。

因此，现在我们已经创建了一种读取 Word 文档 zip 文件中 document.xml 部分的 XML 内容的方法。

/*
Method: WordDocumentQuery
Accepts: FileInfo wordDocPath
         this is where we want to search
Returns: nothing
Desc: This method creates a Package of what we are going to search 
      through then a PackagePart, defining the part of the document 
      we want to search. Then we will perform a LINQ query and loop through 
      the results while assigning the query results to the _results variable.*/
void WordDocumentQuery( FileInfo wordDocPath )
{
     XNamespace wordNamespace = 
       "http:/schemas.openxmlformats.org/wordprocessingml/2006/main";
  
     //create a Package
     Package package = Package.Open(wordDocPath.FullName, FileMode.Open);
     Uri uri = new Uri("/word/document.xml", UriKind.Relative);
   
     //create a PackagePart of the document.xml
     PackagePart docPart = package.GetPart(uri);
     XmlReader reader = XmlReader.Create(docPart.GetStream(FileMode.Open));

在这里，我们将使用 LINQ to XML。首先，我们需要创建一个 XElement 对象并调用其 Load 方法来加载我们之前创建的 XMLReader。这就是从传统的 System.Xml 转向 LINQ API 的地方。

好了，我们来看 LINQ 查询，它很简单，但语法与传统的 SQL 查询略有不同。在此示例中，我们正在查找所有 <t> 元素（包含文本的元素）。我们**还**要过滤它们，只选择包含我们的搜索参数的元素。

创建查询后，我们将其放入 foreach 语句中，并通过将结果转换为 Array 来拆分查询。在 foreach 中，我们将结果不断追加到我们的变量 _results 中。最后，别忘了关闭 Package 对象。

            //create XElement and load it with our reader object
            XElement wordDoc = XElement.Load(reader);

            //create the XML query
            var query =
                   From c in wordDoc.Descendants(wordNamespace + "t")
                   Where c.Value.Contains(_searchParm)
                   Select c;

            //loop through the LINQ to XML result set
            foreach (string s in query.ToArray())
            {
                string res = string.Format("{0}|{1}", 
                                    s, wordDocPath.Name);
                _results.Add(res);
            }

            //close the package when you are done
            package.close();

        }
    } //end of class
} //end of namespace

Using the Code

我希望这是一篇任何人都可以轻松理解并写出来的文章。我添加了一个 .cs 文件，但无法调试它——所以如果有人发现错误，请告诉我，我会更正。

关注点

再次强调，这个程序的基本想法来自微软的虚拟实验室，我强烈建议您在 Google 上搜索并尝试一下。

历史

这是第一个版本，但如果有人遇到错误，我很乐意及时更新和更正本文。