C#/VB - 自动化网络爬虫/网络机器人

David Cruwys

4.66/5 (31投票s)

2004 年 3 月 16 日

5分钟阅读

146586

2810

构建灵活的 WebRobot 并使用 WebSpider 处理整个站点

下载 C#/VB.NET 源代码 & 演示 - 52.9 Kb

引言

什么是网络爬虫

网络爬虫或爬虫程序是一种自动化程序，它会跟踪网站上的链接并调用网络机器人来处理每个链接的内容。

什么是网络机器人

网络机器人是一种处理通过链接找到的内容的程序。网络机器人可用于索引页面或根据预定义的查询提取有用信息，常见的例子包括 - 链接检查器、电子邮件地址提取器、多媒体提取器和更新监视器。

背景

我最近有一个合同，需要构建一个网页链接检查器，该组件必须能够检查存储在数据库中的链接，以及通过本地文件系统和互联网检查网站上的链接。

本文解释了网络机器人、网络爬虫以及如何通过专门的内容处理程序来增强网络机器人，代码中已删除一些冗余代码，例如 try 块、变量初始化和次要方法。

类概述

构成网络机器人的类有：WebPageState，它表示一个 URI 及其在处理链中的当前状态，以及 IWebPageProcessor 的一个实现，该实现执行实际的 URI 读取、调用内容处理程序和处理页面错误。

网络爬虫只有一个类 WebSpider，它维护一个待处理/已处理 URI 列表，这些 URI 包含在 WebPageState 对象列表中，并针对每个 WebPageState 运行 WebPageProcessor 以提取其他页面的链接并测试 URI 是否有效。

使用代码 - WebRobot

网页处理由实现 IWebPageProcessor 的对象处理。Process 方法期望接收一个 WebPageState，该对象将在页面处理过程中更新，如果一切成功，该方法将返回 true。在页面读取后，还可以通过将 WebPageContentDelegate 委托分配给处理器来调用任意数量的内容处理程序。

public delegate void WebPageContentDelegate( WebPageState state );

public interface IWebPageProcessor
{
   bool Process( WebPageState state );

   WebPageContentDelegate ContentHandler { get; set; }
}

WebPageState 对象保存正在处理的 URI 的状态和内容信息。此对象的所有属性都是可读/写的，但 URI 必须通过构造函数传入。

public class WebPageState
{
   private WebPageState( ) {}

   public WebPageState( Uri uri )
   {
      m_uri             = uri;
   }

   public WebPageState( string uri )
      : this( new Uri( uri ) ) { }

   Uri      m_uri;                           // URI to be processed
   string   m_content;                       // Content of webpage
   string   m_processInstructions   = "";    // User defined instructions 
                 // for content handlers
   bool     m_processStarted        = false; 
                // Becomes true when processing starts
   bool     m_processSuccessfull    = false; 
                // Becomes true if process was successful
   string   m_statusCode;                    
                // HTTP status code
   string   m_statusDescription;             
               // HTTP status description, or exception message

   // Standard Getters/Setters....
}

WebPageProcessor 是 IWebPageProcessor 的一个实现，它负责实际的读取内容、处理错误代码/异常和调用内容处理程序的工作。WebPageProcessor 可以被替换或扩展以提供附加功能，但添加内容处理程序通常是更好的选择。

   public class WebPageProcessor : IWebPageProcessor
   {
      public bool Process( WebPageState state )
      {
         state.ProcessStarted       = true;
         state.ProcessSuccessfull   = false;

         // Use WebRequest.Create to handle URI's for 
         // the following schemes: file, http & https
         WebRequest  req = WebRequest.Create( state.Uri );
         WebResponse res = null;

         try
         {
            // Issue a response against the request. 
            // If any problems are going to happen they
            // they are likly to happen here in the form of an exception.
            res = req.GetResponse( );

            // If we reach here then everything is likly to be OK.
            if ( res is HttpWebResponse )
            {
               state.StatusCode        = 
                ((HttpWebResponse)res).StatusCode.ToString( );
               state.StatusDescription = 
                ((HttpWebResponse)res).StatusDescription;
            }
            if ( res is FileWebResponse )
            {
               state.StatusCode        = "OK";
               state.StatusDescription = "OK";
            }

            if ( state.StatusCode.Equals( "OK" ) )
            {
               // Read the contents into our state 
               // object and fire the content handlers
               StreamReader   sr    = new StreamReader( 
                 res.GetResponseStream( ) );

               state.Content        = sr.ReadToEnd( );

               if ( ContentHandler != null )
               {
                  ContentHandler( state );
               }
            }

            state.ProcessSuccessfull = true;
         }
         catch( Exception ex )
         {
            HandleException( ex, state );
         }
         finally
         {
            if ( res != null )
            {
               res.Close( );
            }
         }

         return state.ProcessSuccessfull;
      }
   }

   // Store any content handlers
   private WebPageContentDelegate m_contentHandler = null;

   public WebPageContentDelegate ContentHandler
   {
      get { return m_contentHandler; }
      set { m_contentHandler = value; }
   }

WebPageProcessor 中还有其他私有方法，用于在处理 "file://" 方案时处理 HTTP 错误代码和文件未找到错误，以及更严重的异常。

使用代码 - WebSpider

WebSpider 类实际上只是以特定方式调用 WebRobot 的一个载体。它为机器人提供了一个专门的内容处理程序，用于爬行网络链接，并维护一个待处理页面和已访问页面的列表。当前的 WebSpider 被设计为从给定的 URI 开始，并将完全页面处理限制在一个基本路径内。

// CONSTRUCTORS
//
// Process a URI, until all links are checked, 
// only add new links for processing if they
// point to the same host as specified in the startUri.
public WebSpider(
   string            startUri
   ) : this ( startUri, -1 ) { }

// As above only limit the links to uriProcessedCountMax.
public WebSpider(
   string            startUri,
   int               uriProcessedCountMax
   ) : this ( startUri, "", uriProcessedCountMax, 
     false, new WebPageProcessor( ) ) { }

// As above, except new links are only added if
// they are on the path specified by baseUri.
public WebSpider(
   string            startUri,
   string            baseUri,
   int               uriProcessedCountMax
   ) : this ( startUri, baseUri, uriProcessedCountMax, 
     false, new WebPageProcessor( ) ) { }

// As above, you can specify whether the web page
// content is kept after it is processed, by
// default this would be false to conserve memory
// when used on large sites.
public WebSpider(
   string            startUri,
   string            baseUri,
   int               uriProcessedCountMax,
   bool              keepWebContent,
   IWebPageProcessor webPageProcessor )
{
   // Initialize web spider ...
}

为什么会有基本路径限制？

由于互联网上有数万亿个页面，因此该爬虫会检查它找到的所有链接以确定它们是否有效，但只有当这些链接属于初始网站或该网站子路径的上下文时，它才会将新链接添加到待处理队列中。

因此，如果我们从 www.myhost.com/index.html 开始，并且此页面链接到 www.myhost.com/pageWithSomeLinks.html 和 www.google.com/pageWithManyLinks.html，那么 WebRobot 将会针对这两个链接被调用以检查它们是否有效，但它只会将 www.myhost.com/pageWithSomeLinks.html 内找到的新链接添加到队列中。

调用 Execute 方法来启动爬虫。此方法会将 startUri 添加到待处理页面的 Queue 中，然后调用 IWebPageProcessor，直到没有页面可供处理为止。

public void Execute( )
{
   AddWebPage( StartUri, StartUri.AbsoluteUri );

   while ( WebPagesPending.Count > 0 &&
      ( UriProcessedCountMax == -1 || UriProcessedCount 
        < UriProcessedCountMax ) )
   {
      WebPageState state = (WebPageState)m_webPagesPending.Dequeue( );

      m_webPageProcessor.Process( state );

      if ( ! KeepWebContent )
      {
         state.Content = null;
      }

      UriProcessedCount++;
   }
}

只有当 URI（不包括锚点）指向一个路径或一个有效的页面（例如 .html、.aspx、.jsp 等）并且之前没有被见过时，页面才能被添加到队列中。

private bool AddWebPage( Uri baseUri, string newUri )
{
   Uri      uri      = new Uri( baseUri, 
     StrUtil.LeftIndexOf( newUri, "#" ) );

   if ( ! ValidPage( uri.LocalPath ) || m_webPages.Contains( uri ) )
   {
      return false;
   }
   WebPageState state = new WebPageState( uri );

   if ( uri.AbsoluteUri.StartsWith( BaseUri.AbsoluteUri ) )
   {
      state.ProcessInstructions += "Handle Links";
   }

   m_webPagesPending.Enqueue  ( state );
   m_webPages.Add             ( uri, state );

   return true;
}

运行爬虫的示例

以下代码显示了调用 WebSpider 的三个示例，所示路径仅为示例，不代表此网站的真实结构。注意： 示例中的 Bondi Beer 网站是我使用我自己的 SiteGenerator 构建的。这个易于使用的程序可以从专有数据文件、XML/XSLT 文件、数据库、RSS feed 等动态内容生成静态网站...

/*
* Check for broken links found on this website, limit the spider to 100 pages.
*/
WebSpider spider = new WebSpider( "http://www.bondibeer.com.au/", 100 );
spider.execute( );

/*
* Check for broken links found on this website, 
* there is no limit on the number
* of pages, but it will not look for new links on
* pages that are not within the
* path http://www.bondibeer.com.au/products/.  This
* means that the home page found
* at http://www.bondibeer.com.au/home.html may be
* checked for existence if it was
* called from the somepub/index.html but any
* links within that page will not be
* added to the pending list, as there on an a lower path.
*/
spider = new WebSpider(
      "http://www.bondibeer.com.au/products/somepub/index.html",
      "http://www.bondibeer.com.au/products/", -1 );
spider.execute( );

/*
* Check for pages on the website that have funny 
* jokes or pictures of sexy women.
*/
spider = new WebSpider( "http://www.bondibeer.com.au/" );
spider.WebPageProcessor.ContentHandler += 
  new WebPageContentDelegate( FunnyJokes );
spider.WebPageProcessor.ContentHandler += 
  new WebPageContentDelegate( SexyWomen );
spider.execute( );

private void FunnyJokes( WebPageState state )
{
   if( state.Content.IndexOf( "Funny Joke" ) > -1 )
   {
      // Do something
   }
}
private void SexyWomen( WebPageState state )
{
   Match       m     = RegExUtil.GetMatchRegEx( 
     RegularExpression.SrcExtractor, state.Content );
   string      image;

   while( m.Success )
   {
      m     = m.NextMatch( );
      image = m.Groups[1].ToString( ).toLowerCase( );

      if ( image.indexOf( "sexy" ) > -1 || 
        image.indexOf( "women" ) > -1 )
      {
         DownloadImage( image );
      }
   }
}

结论

WebSpider 足够灵活，可用于各种有用的场景，并且可以成为数据挖掘互联网和内联网网站的强大工具。我很想听听人们是如何使用这段代码的。

待解决的问题

这些问题并不严重，但如果有人有什么想法，请分享。

state.ProcessInstructions - 这实际上只是一个快速的解决方案，用于提供内容处理程序可以自行使用的指令。我正在寻找一个更优雅的解决方案来解决这个问题。
多线程爬虫 - 该项目最初是一个多线程爬虫，但很快就被放弃了，因为我发现使用线程处理每个 URI 的性能要慢得多。似乎瓶颈在于 GetResponse，它似乎无法在多个线程中运行。
URI 有效，但返回查询数据是一个错误的页面。 - 当前处理器不处理 URI 指向有效页面但 Web 服务器返回的页面被认为是错误的场景。例如 http://www.validhost.com/validpage.html?opensubpage=invalidid。解决此问题的一个方法是读取返回页面的内容并查找关键信息，但这种技术有点不可靠。