使用 XPath 和修改后的 XmlTextReader 和 SgmlReader 进行快速屏幕抓取

Scott Holodak

4.94/5 (10投票s)

2007年5月10日

CPOL

4分钟阅读

54471

986

通过向 XmlTextReader 和 SgmlReader 添加类似 XPath 的位置跟踪功能，实现轻量级、高性能的屏幕抓取。

Screenshot - FastXPathReaderScreenshot.png

引言

这是对 XmlTextReader 和 Microsoft 的 SgmlTextReader 的一个简单轻量级的修改，它以 XPath 表达式的形式添加了位置跟踪。目的是在不使用正则表达式或在内存中创建 XmlDocument 的情况下提取网站数据。

背景

多年来，我做了不少屏幕抓取方面的工作。我注意到，每次开始一个新的项目时，我都会放弃现有的代码，然后决定重写。随着时间的推移，我尝试了各种标签平衡器、DOM 与拉取解析器等。我使用过从嵌套的 if 语句和布尔变量到堆栈的各种方法来跟踪文档中的位置。结果代码很难调试，维护起来也很痛苦（例如，当网站模板发生变化时）。

我在这里提出的解决方案基于 XML 拉取解析（即 Readers），使用堆栈（某种程度上）来跟踪位置，并以 XPath 表达式的形式公开该位置。XPath 表达式是正确的；但是，它不区分命名空间，不是最小表达式，不匹配属性（一旦找到元素，就可以轻松获取属性），并且可能匹配文档中的多个部分。第一个是因为我比较懒，而其他是为了速度和简洁。例如

While you might like to get this:
id('featured1ct')/span/h3/a

...or this
//html/body/div[1]/div[3]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]/span/h3/a

You get one or more of these instead
//html/body/div/div/div/div/div/div/div/div/span/span/h3/a

...quickly

如果您只关心匹配文档片段的特定实例，您需要自己计算它匹配了多少次。

我知道有一个 XPathReader 项目，我也玩过。它为提取数据提供了强大的解决方案，但即使不看代码，我也可以告诉您，它在后台做了比大多数简单屏幕抓取工作所需更多的操作，并且内存占用更大。考虑到 Readers 的单向性，您仍然需要小心地按照文档结构来排序您的表达式。最后，除非我遗漏了什么，否则没有简单的方法可以将两个 readers 连接起来。换句话说，我找不到一种方法可以让 SgmlReader 直接输入到 XPathReader 中，而不先将输出转储到 MemoryStream。如果您确实需要匹配前几个表达式中的一个，请选择 XPathReader。否则，您来对地方了。

代码

FastSgmlXPathReader 和 FastXPathReader 分别用于 HTML 和 XML 文档。实现方式略有不同，但最终结果是相同的。

FastSgmlXPathReader（用于 HTML）

using System;
using System.Collections.Generic;
using System.Text;
using Sgml;
using System.Windows.Forms;
using System.Xml;

namespace FastXPathReader {
  public class FastSgmlXPathReader : SgmlReader {
    // Not a stack b/c we need to view the entire list
    // to create the string representation

    private List<string> PositionTracker = new List<string>();

    // Used to build the string representation

    private StringBuilder XPathBuilder = new StringBuilder();

    // Override the Read() function to track changes to the XPath

    public override bool Read() {      
      bool Value = base.Read();
      if (Value && base.NodeType == XmlNodeType.Element) {
        while (PositionTracker.Count > this.Depth) {
          // Remove any elements beyond this depth

          PositionTracker.RemoveAt(PositionTracker.Count - 1);
        }
        if (this.Depth != PositionTracker.Count) {
          // Add a new element at this depth

          PositionTracker.Add(this.Name);
        } else {
          // Change the element at this depth

          PositionTracker[PositionTracker.Count - 1] = this.Name;
        }
      }
      return Value;
    }

    // Build an XPath expression from the current location.

    public string XPath {
      get {
        XPathBuilder.Length = 0;
        XPathBuilder.Append("/");
        for (int i = 0; i < PositionTracker.Count; i++) {
          XPathBuilder.Append("/" + PositionTracker[i]);
        }
        return XPathBuilder.ToString();
      }
    }

    // Call the base constructors

    public FastSgmlXPathReader() : base() { }
  }
}

FastXPathReader（用于 XML）

using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;

namespace FastXPathReader {
  public class FastXPathReader : XmlTextReader {
    // Not a stack b/c we need to view
    // the entire list to create the string representation

    private List<string> PositionTracker = new List<string>();
    
    // Used to build the string representation

    private StringBuilder XPathBuilder = new StringBuilder();

    // Override the Read() function to track changes to the XPath

    public override bool Read() {
      bool Value = base.Read();
      if (Value) {
        if (base.NodeType == XmlNodeType.Document || 
                   base.NodeType == XmlNodeType.Element) {
          if (PositionTracker.Count < this.Depth || 
                   this.Depth == 0 || PositionTracker.Count == 0) {
            // Add the item

            PositionTracker.Add(this.Name);
          } else {
            if (PositionTracker.Count == 0) { 
              // Don't change the root node.

            } else if (PositionTracker.Count > this.Depth) {
              // Change the item at this depth

              PositionTracker[PositionTracker.Count - 1] = this.Name;
            } else {
              // Add a new item for this depth

              PositionTracker.Add(this.Name);
            }
          }
        } else if (base.NodeType == XmlNodeType.EndElement) {
          // Strange bug fix/workaround, but don't remove the root element

          if (PositionTracker.Count > 1) {
            PositionTracker.RemoveAt(PositionTracker.Count - 1);
          }
        }   
      }
      return Value;
    }

    // Build an XPath expression from the current location.

    public string XPath {
      get {
        XPathBuilder.Length = 0;
        XPathBuilder.Append("/");
        for (int i = 0; i < PositionTracker.Count; i++) {
          XPathBuilder.Append("/" + PositionTracker[i]);
        }
        return XPathBuilder.ToString();
      }
    }

    // Call the base constructors

    public FastXPathReader(Stream input) : base(input) {}
    public FastXPathReader(string url) : base(url) {}
    public FastXPathReader(TextReader input) : base(input) { }
    protected FastXPathReader(XmlNameTable nt) : base(nt) { }
    public FastXPathReader(Stream input, XmlNameTable nt) : base(input, nt) { }
    public FastXPathReader(string url, Stream input) : base(url, input) { }
    public FastXPathReader(string url, TextReader input) : base(url, input) { }
    public FastXPathReader(string url, XmlNameTable nt) : base(url, nt) { }
    public FastXPathReader(TextReader input, XmlNameTable nt) : base(input, nt) { }
    public FastXPathReader(Stream xmlFragment, XmlNodeType fragType, 
                           XmlParserContext context) : 
                           base(xmlFragment, fragType, context) { }
    public FastXPathReader(string url, Stream input, 
                           XmlNameTable nt) : base(url, input, nt) { }
    public FastXPathReader(string url, TextReader input, 
                           XmlNameTable nt) : base(url, input, nt) { }
    public FastXPathReader(string xmlFragment, XmlNodeType fragType, 
                           XmlParserContext context) : 
                           base(xmlFragment, fragType, context) { }
  }
}

使用代码

我提供了两个示例。第一个抓取 yahoo.com 上的头条新闻和描述，第二个从 Last.fm 的 Audioscrobbler 网站服务中提取一个乐队（示例中是 Alice in Chains）的前三张专辑。前者演示了抓取不平衡的 HTML 文档，而后者提供了一种快速简便的方法来获取 XML 文档中的数据。

在这两种情况下，您只需通过 Reader 的 Read() 语句进行循环，并对 Reader 中添加的 XPath 属性执行 'switch' 来查找您想要的 XPath。

Yahoo 示例（HTML）

// Create a request for the Yahoo! homepage

HttpWebRequest Request = (HttpWebRequest)
   HttpWebRequest.Create("http://www.yahoo.com/");

// Pretend we're Firefox so we know what Yahoo! is serving up.

Request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0;" + 
                    " en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3";

// Get the response from the server

using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse()) {
  // Create a FastSgmlXPathReader SGML Reader

  using (FastSgmlXPathReader SgmlReader = new FastSgmlXPathReader()) {
    // Wrap the response stream in a StreamReader

    using (StreamReader InputStreamReader = 
           new StreamReader(Response.GetResponseStream())) {
      // Initialize the SgmlReader

      SgmlReader.InputStream = InputStreamReader;
      SgmlReader.DocType = "HTML";
      bool AllDone = false;
      while (!AllDone && SgmlReader.Read()) {
        if (SgmlReader.NodeType == XmlNodeType.Element) {
          switch (SgmlReader.XPath) {
            case "//html/body/div/div/div/div/div/div/div/div/span/span/h3/a":
              string Url = "http://www.yahoo.com/" + SgmlReader["href"];
              lnkHeadline.Text = SgmlReader.ReadInnerXml();
              lnkHeadline.Links.Add(0, lnkHeadline.Text.Length, Url);
              break;
            case "//html/body/div/div/div/div/div/div/div/div/span/span/p":
              string Details = SgmlReader.ReadInnerXml();
              lblDetails.Text = Details.Substring(0, Details.IndexOf('<'));
              AllDone = true;
              break;
          }
        }
      }
    }
  }
}

以及 Last.FM 示例（XML）

Request = (HttpWebRequest)HttpWebRequest.Create("http://ws.audioscrobbler.com" + 
                         "/1.0/artist/Alice+In+Chains/topalbums.xml");
Request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; " + 
                    "en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3";

// Get the response from the server

int i = 0;
using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse()) {
  using (FastXPathReader XPathReader = 
         new FastXPathReader(Response.GetResponseStream())) {
    bool AllDone = false;
    while (!AllDone && XPathReader.Read()) {
      if (XPathReader.NodeType == XmlNodeType.Element) {
        switch (XPathReader.XPath) {
          case "//topalbums":
            lblArtist.Text = XPathReader["artist"];
            break;
          case "//topalbums/album/name":
            i++;
            if (i == 1) {
              lbl1.Text = "1. " + XPathReader.ReadInnerXml();
            } else if (i == 2) {
              lbl2.Text = "2. " + XPathReader.ReadInnerXml();
            } else if (i == 3) {
              lbl3.Text = "3. " + XPathReader.ReadInnerXml();
            } else {
              AllDone = true;
            }
            break;
        }
      }
    }
  }
}

正如您所见，无论是处理 HTML 还是 XML，提取数据都相当容易。大部分工作只是设置 WebRequests 并获取响应流。请注意，您正在对 HttpWebResponse 中的 XmlNode 进行即时读取，而没有任何中间存储。

关注点

虽然 FastSgmlXPathReader 在 XML 文档上应该能正常工作，但在您知道文档格式良好时，为了提高性能，应始终使用 FastXPathReader。

有趣的是，我发现 FastSgmlXPathReader 和 FastXPathReader 的代码竟然不一样，尽管它们基本上以相同的方式工作。我花了几个小时试图弄清楚它们为什么行为不同，最终决定让它们都表现得一样。我怀疑这与 SgmlReader 动态创建元素有关，以及这如何影响后续的 Read() 操作。如果有人有任何见解，我洗耳恭听。

如果您在使用 SGML 解析器本身时遇到任何问题（例如，实体问题），您需要自己做一些研究。我与该项目无关。

可能的性能增强

我意识到 XPath 属性的代码可以更高效。我曾考虑完全放弃堆栈（List<string>），通过扫描 '/' 的最后位置来截断和追加到 StringBuilder。我愿意接受您的建议。

资源

我想链接到 GotDotNet 工作区，但看起来 GotDotNet 正在走向终结。

历史

2007 年 5 月 10 日： 首次发布。