65.9K
CodeProject 正在变化。 阅读更多。
Home

HTTP 数据客户端 - 网络抓取

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.79/5 (8投票s)

2011年7月20日

CPOL

12分钟阅读

viewsIcon

49016

downloadIcon

1659

一个基于 HTTPWebRequest 的库,它抽象了从 Web 源检索数据的方式。

引言

本文的目的是描述如何使用 C# 实现一个通用的“HTTP 数据客户端”(如果听起来有点啰嗦,请原谅),它允许您以优雅的方式查询任何您想要的基于 Web 的资源。我首先要说明的是,这不是“完美的解决方案”,肯定有很多地方可以改进,所以请随意改进。整个概念基于 .NET 在 System.Net 命名空间下提供的 HTTPWebRequest 对象。

必备组件

在开始深入探讨架构和代码之前,有一些额外的库被“HTTP 数据客户端”项目使用并需要。以下是库的列表

  • Db4Object(这是一个面向对象的数据库;我主要将其用于嵌入式应用程序;它引用了两个程序集文件:Db4objects.Db4o.dllDb4objects.Db4o.NativeQueries.dll;您可以从以下位置获取 DB4Object:http://www.db4o.com/DownloadNow.aspx)。
  • HTML Agility Pack(这是一个允许您使用各种技术处理 HTML 内容的库,当您想将 HTML DOM 转换为 XML 时非常方便;它引用了一个程序集文件:HtmlAgilityPack.dll;您可以从以下位置获取该库:http://htmlagilitypack.codeplex.com)。
  • Microsoft MSHTML(其目的是渲染和解析 HTML 和 JavaScript 内容)。

如果您想知道我为什么要使用两个不同的库来解析 HTML 内容,答案很简单。HTML Agility Pack 大部分时间表现都很好;您得到的结果通常是您期望的,但并非总是如此。因此,如果一个库未能提供预期的结果,我可以切换到另一个。在我看来,MSHTML 库最大的缺点是在非桌面应用程序(例如:网站、Web 服务等)中集成时处理速度很慢。DB4Object 在此项目中的作用是存储配置设置和缓存内容。关于 DB4Object 有一点需要特别说明的是,非服务器版本不支持多线程(您可以轻松地将其替换为您认为合适的任何其他存储)。

架构

我的解决方案包含四个项目

  • HtmlAgilityPack(实际的 HTML Agility Pack 项目,包含源代码)
  • HttpData.Client(实现 HTML 处理规则的主项目)
  • HttpData.Client.MsHtmlToXml(MSHTML 的包装项目及其一些扩展)
  • HttpData.Client.Pdf(使用 IFilter 实现一些 PDF 处理的项目;对本文不重要)

讨论 HTML Agility Pack 没有意义,因为您可以在 http://htmlagilitypack.codeplex.com 上找到所有详细信息和文档。我将主要关注 HttpData.Client,并尽量为您提供尽可能多的详细信息和解释。HTTP 数据客户端的设计方式与 .NET SQL 客户端(System.Data.SqlClient)类似,您会注意到项目中包含的类及其逻辑非常相似(希望这不仅仅是我的想象)。我将列出接口和类,并提供有关其逻辑和目的的详细信息。

IHDPAdapter 和 HDPAdapter

HDPAdapter 类的目的是允许将 XML 数据与 DataTableDataSet 等其他数据对象集成。IHDPAdapter 接口公开了两个将 XML 数据转换为 DataTableDataSet 的方法。目前,仅实现了 DataTable 转换方法。以下是接口和类的代码片段

IHDPAdapter 代码
using System.Data;
using System.Xml;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for integration with data objects
    /// (DataTable, DataSet, etc). Is implemented by HDPAdapter.
    /// </summary>
    public interface IHDPAdapter
    {
        #region
        /// <summary>
        /// Get or set the select HDPCommand object.
        /// </summary>
        IHDPCommand SelectCommand{ get; set; }
        #endregion

        #region METHODS
        /// <summary>
        /// Fill a data table with the content from a specified xml document object.
        /// </summary>
        /// <param name="table">Data table to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        int Fill(DataTable table, XmlDocument source, bool useNodes);

        /// <summary>
        /// (NOT IMPLEMENTED) Fill a data set with the content
        /// from a specified xml document object.
        /// </summary>
        /// <param name="dataset">Data set to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names should
        /// be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        int Fill(DataSet dataset, XmlDocument source, bool useNodes);
        #endregion
    }
}
HDPAdapter 代码
using System;
using System.Xml;
using System.Xml.XPath;
using System.Data;
using System.Text;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for integration
    /// with data objects (DataTable, DataSet, etc).
    /// </summary>
    public class HDPAdapter : IHDPAdapter 
    {
        #region PRIVATE VARIABLES
        private IHDPCommand _selectCommand;

        #endregion

        #region Properties
        /// <summary>
        /// Get or set the select IHDPCommand object.
        /// </summary>
        IHDPCommand IHDPAdapter.SelectCommand
        {
            get{ return _selectCommand; }
            set{ _selectCommand = value; }
        }

        /// <summary>
        /// Get or set the select HDPCommand object.
        /// </summary>
        public HDPCommand SelectCommand
        {
            get{ return (HDPCommand)_selectCommand; }
            set{ _selectCommand = value; }
        }

        /// <summary>
        /// Get or set the connection string.
        /// </summary>
        public string ConnectionString { get; set; }

        #endregion

        #region .ctor
        /// <summary>
        /// Create a new instance of HDPAdapter.
        /// </summary>
        public HDPAdapter()
        {
        }

        /// <summary>
        /// Create a new instance of HDPAdapter.
        /// </summary>
        /// <param name="connectionString">Connection string
        /// associated with HDPAdapter object.</param>
        public HDPAdapter(string connectionString)
        {
            this.ConnectionString = connectionString;
        }
        #endregion

        #region Public Methods
        /// <summary>
        /// Fill a data table with the content from a specified xml document object.
        /// </summary>
        /// <param name="table">Data table to be filled.</param>
        /// <param name="source">Xml document
        /// object of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        public int Fill(DataTable table, XmlDocument source, bool useNodes)
        {
            bool columnsCreated = false;
            bool resetRow = false;

            if(table == null || source == null)
                return 0;

            if (table.TableName.Length == 0)
                return 0;

            StringBuilder sbExpression = new StringBuilder("//");
            sbExpression.Append(table.TableName);

            XPathNavigator xpNav = source.CreateNavigator();
            if (xpNav != null)
            {
                XPathNodeIterator xniNode = xpNav.Select(sbExpression.ToString());

                while(xniNode.MoveNext())
                {
                    XPathNodeIterator xniRowNode = 
                       xniNode.Current.SelectChildren(XPathNodeType.Element);
                    while (xniRowNode.MoveNext())
                    {
                        if(resetRow)
                        {
                            xniRowNode.Current.MoveToFirst();
                            resetRow = false;
                        }

                        DataRow row = null;
                        if (columnsCreated)
                            row = table.NewRow();
                    
                        if(useNodes)
                        {
                            XPathNodeIterator xniColumnNode = 
                               xniRowNode.Current.SelectChildren(XPathNodeType.Element);
                            while (xniColumnNode.MoveNext())
                            {
                                if (!columnsCreated)
                                {
                                    DataColumn column = 
                                      new DataColumn(xniColumnNode.Current.Name);
                                    table.Columns.Add(column);
                                }
                                else
                                    row[xniColumnNode.Current.Name] = 
                                      xniColumnNode.Current.Value;
                            }
                        }
                        else
                        {
                            XPathNodeIterator xniColumnNode = xniRowNode.Clone();
                            bool onAttribute = xniColumnNode.Current.MoveToFirstAttribute();
                            while (onAttribute)
                            {
                                if (!columnsCreated)
                                {
                                    DataColumn column = 
                                      new DataColumn(xniColumnNode.Current.Name);
                                    table.Columns.Add(column);
                                }
                                else
                                    row[xniColumnNode.Current.Name] = 
                                      xniColumnNode.Current.Value;

                                onAttribute = xniColumnNode.Current.MoveToNextAttribute();
                            }
                        }

                        if (!columnsCreated)
                        {
                            columnsCreated = true;
                            resetRow = true;
                        }

                        if (row != null)
                            table.Rows.Add(row);
                    }
                }
            }

            return table.Rows.Count;
        }

        /// <summary>
        /// (NOT IMPLEMENTED) Fill a data set with the
        /// content from a specified xml document object.
        /// </summary>
        /// <param name="dataset">Data set to be filled.</param>
        /// <param name="source">Xml document object
        /// of which content will fill the data table.</param>
        /// <param name="useNodes">True if nodes names
        /// should be used for columns names, otherwise attributes will be used.</param>
        /// <returns>Number of filled rows.</returns>
        public int Fill(DataSet dataset, XmlDocument source, bool useNodes)
        {
            throw new NotImplementedException();
        }
        #endregion

        #region Private Methods
        #endregion
    }
}

IHDPConnection 和 HDPConnection

顾名思义,这代表连接类,它将以抽象的方式管理连接的行为。该接口公开了一组与之相关的​​方法和属性。只有三个方法被公开并实现

  • Open 方法(将连接状态更改为打开;此方法有一个重载,接受将要打开的 Web 资源的 URL 作为参数)
  • Close 方法(将连接状态更改为关闭,如果正在使用缓存存储,则会关闭它)
  • CreateCommand 方法(它创建一个新的 HDPCommand 对象并为其分配当前连接)

现在让我们看看 IHDPConnection 公开并由 HDPConnection 实现的属性

  • ConnectionURL(表示将使用当前连接打开的 Web 资源的 URL)
  • KeepAlive(定义查询完成后连接是否应保持打开状态)
  • AutoRedirect(定义连接是否允许执行任何自动重定向)
  • MaxAutoRedirects(定义可以执行多少次自动重定向)
  • UserAgent(定义将与连接关联的用户代理,例如:Internet Explorer、Chrome、Opera 等)
  • ConnectionState(只读属性,提供有关连接状态的信息;连接是打开还是关闭)
  • Proxy(定义在执行查询时将使用的代理)
  • Cookies(当前或查询期间与连接关联的 Cookie)
  • ContentType(定义查询时期望的内容类型,例如:application/x-www-form-urlencoded、application/json 等)
  • Headers(包含当前或查询期间与连接关联的标头)
  • Referer(包含在查询连接 URL 时将使用的引用者)
IHDPConnection 代码
using System.Collections.Generic;
using System.Net;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for connection management of different
    /// web sources. Is implemented by HDPConnection.
    /// </summary>
    public interface IHDPConnection
    {
        #region MEMBERS
        #region METHODS
        /// <summary>
        /// Open connection.
        /// </summary>
        void Open();

        /// <summary>
        /// Close connection.
        /// </summary>
        void Close();

        /// <summary>
        /// Create a new HDPCommand object associated with this connection.
        /// </summary>
        /// <returns>HDPCommand object associated with this connection.</returns>
        IHDPCommand CreateCommand();
        #endregion

        #region PROPERTIES
        /// <summary>
        /// Get or set connection url.
        /// </summary>
        string ConnectionURL { get; set; }

        /// <summary>
        /// Get or set the value which specifies
        /// if the connection should be maintained openend.
        /// </summary>
        bool KeepAlive { get; set; }

        /// <summary>
        /// Get or set the value which specifies if auto redirection is allowed.
        /// </summary>
        bool AutoRedirect { get; set; }

        /// <summary>
        /// Get or set the value which specifies if maximum number of auto redirections.
        /// </summary>
        int MaxAutoRedirects { get; set; }

        /// <summary>
        /// Get or set the value which specifies the user agent to be used.
        /// </summary>
        string UserAgent { get; set; }

        /// <summary>
        /// Get the value which specifies the state of the connection.
        /// </summary>
        HDPConnectionState ConnectionState { get; }

        /// <summary>
        /// Get or set the value which specifies the connection proxy.
        /// </summary>
        HDPProxy Proxy { get; set; }

        /// <summary>
        /// Get or set the value which specifies the coockies used by connection.
        /// </summary>
        CookieCollection Cookies { get; set; }

        /// <summary>
        /// Get or set the value which specifies the content type.
        /// </summary>
        string ContentType { get; set; }

        /// <summary>
        /// Get or set headers details used in HttpWebRequest operations.
        /// </summary>
        List<HDPConnectionHeader> Headers { get; set; }

        /// <summary>
        /// Get or set Http referer.
        /// </summary>
        string Referer { get; set; }
        #endregion
        #endregion
    }
}
HDPConnection 代码
using System.Collections.Generic;
using System.Net;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for connection management of different web sources.
    /// </summary>
    public class HDPConnection : IHDPConnection
    {
        #region Private Variables
        private HDPConnectionState _connectionState;
        private string _connectionURL;
        private HDPCache cache;
        private bool useCache;
        #endregion

        #region Properties
        /// <summary>
        /// Get the value which specifies if caching will be used.
        /// </summary>
        public bool UseCahe
        {
            get { return useCache; }
        }

        /// <summary>
        /// Get HDPCache object.
        /// </summary>
        public HDPCache Cache
        {
            get { return cache; }
        }
        #endregion

        #region .ctor
        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        public HDPConnection()
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = "";
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        public HDPConnection(string connectionURL)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = connectionURL;
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="cacheDefinitions">HDPCacheDefinition
        /// object used by caching mechanism.</param>
        public HDPConnection(HDPCacheDefinition cacheDefinitions)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = "";
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
            cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
            useCache = true;
        }

        /// <summary>
        /// Instantiate a new HDPConnection object.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        /// <param name="cacheDefinitions">HDPCacheDefinition
        /// object used by caching mechanism.</param>
        public HDPConnection(string connectionURL, HDPCacheDefinition cacheDefinitions)
        {
            _connectionState = HDPConnectionState.Closed;
            _connectionURL = connectionURL;
            Cookies = new CookieCollection();
            MaxAutoRedirects = 1;
            cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
            useCache = true;
        }
        #endregion

        #region Public Methods
        #endregion

        #region IHDPConnection Members
        #region Methods
        /// <summary>
        /// Open connection.
        /// </summary>
        public void Open()
        {
            _connectionState = HDPConnectionState.Open;
        }

        /// <summary>
        /// Open connection using a specific url.
        /// </summary>
        /// <param name="connectionURL">Url of the web source.</param>
        public void Open(string connectionURL)
        {
            _connectionURL = connectionURL;
            _connectionState = HDPConnectionState.Open;
        }

        /// <summary>
        /// Close connection.
        /// </summary>
        public void Close()
        {
            _connectionState = HDPConnectionState.Closed;

            if (cache != null)
                cache.CloseStorageConnection();
        }

        /// <summary>
        /// Create a new IHDPCommand object associated with this connection.
        /// </summary>
        /// <returns>IHDPCommand object associated with this connection.</returns>
        IHDPCommand IHDPConnection.CreateCommand()
        {
            HDPCommand command = new HDPCommand { Connection = this };
            return command;
        }

        /// <summary>
        /// Create a new HDPCommand object associated with this connection.
        /// </summary>
        /// <returns>HDPCommand object associated with this connection.</returns>
        public HDPCommand CreateCommand()
        {
            HDPCommand command = new HDPCommand { Connection = this };
            return command;
        }
        #endregion

        #region Properties
        /// <summary>
        /// Get or set connection url.
        /// </summary>
        public string ConnectionURL
        {
            get { return _connectionURL; }
            set { _connectionURL = value; }
        }

        /// <summary>
        /// Get or set the value which specifies if auto redirection is allowed.
        /// </summary>
        public bool AutoRedirect { get; set; }

        /// <summary>
        /// Get or set the value which specifies if maximum number of auto redirections.
        /// </summary>
        public int MaxAutoRedirects { get; set; }

        /// <summary>
        /// Get or set the value which specifies if the
        /// connection should be maintained openend.
        /// </summary>
        public bool KeepAlive { get; set; }

        /// <summary>
        /// Get or set the value which specifies the user agent to be used.
        /// </summary>
        public string UserAgent { get; set; }

        /// <summary>
        /// Get or set the value which specifies the content type.
        /// </summary>
        public string ContentType { get; set; }

        /// <summary>
        /// Get or set the value which specifies the coockies used by connection.
        /// </summary>
        public CookieCollection Cookies { get; set; }

        /// <summary>
        /// Get the value which specifies the state of the connection.
        /// </summary>
        public HDPConnectionState ConnectionState
        {
            get { return _connectionState; }
        }

        /// <summary>
        /// Get or set the value which specifies the connection proxy.
        /// </summary>
        public HDPProxy Proxy { get; set; }

        /// <summary>
        /// Get or set headers details used in HttpWebRequest operations.
        /// </summary>
        public List<HDPConnectionHeader> Headers { get; set; }

        /// <summary>
        /// Get or set Http referer.
        /// </summary>
        public string Referer { get; set; }
        #endregion
        #endregion

        #region IDisposable Members
        ///<summary>
        /// Dispose current object.
        ///</summary>
        public void Dispose()
        {
            this.dispose();
            System.GC.SuppressFinalize(this);
        }

        private void dispose()
        {
            if (_connectionState == HDPConnectionState.Open)
                this.Close();
        }
        #endregion
    }
}

IHDPCommand 和 HDPCommand

这代表了我们提供查询 Web 资源和处理结果(响应)功能的引擎。它提供了多种处理查询响应内容的方式,例如:XPath、RegEx、XSLT、Reflection 等。我将只详细讨论主要方法,其余方法都基于这些方法,我假设随方法提供的注释足以提供正确的方向。但在我开始讲方法之前,先介绍一下属性。我不会在这里粘贴 HDPCommand 类的内容,因为它有很多代码行。您可以通过提供的源代码详细分析它。

  • Connection(定义与此命令关联的连接对象)
  • Parameters(定义在查询过程中使用的参数)
  • CommandType(定义在查询过程中使用的命令类型;它是 GET 或 POST)
  • CommandText(定义将要执行的命令的内容;如果这是一个 GET 命令,则存储带有查询参数的 URL;如果这是一个 POST 命令,则存储 POST 操作的主体内容)
  • CommandTimeout(定义预计从 Web 资源获得响应的时间段)
  • Response(根据查询操作包含从 Web 资源收到的响应字符串)
  • Uri(包含被查询 Web 资源的 URI)
  • Path(包含被查询 Web 资源的路径)
  • LastError(包含过程中遇到的最后一个错误消息)
  • ContentLength(根据查询操作包含从 Web 资源收到的内容的长度)

现在我们可以看看公开/实现的方法了。

  • GetParametersCount(获取查询过程中使用的参数数量)
  • CreateParameter(创建一个新的参数以用于查询过程)
  • ExecuteNonQuery(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回收到的结果数量;它有一个参数,指定是否在结束时清除查询过程中使用的参数集合)
  • Execute(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个布尔值:如果查询成功执行,则为 true,如果失败,则为 false)
  • ExecuteStream(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回底层的 HTTP 响应流)
  • CloseResponse(关闭由 ExecuteStream 方法打开的 HTTP 响应流)
  • ExecuteNavigator(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个 XPathNavigator 对象,用于导航通过转换为 XML 的响应;它有一个参数,指定是否在结束时清除查询过程中使用的参数集合)
  • ExecuteDocument(它有一个重载,使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个 IXPathNavigable 对象,用于导航通过转换为 XML 的响应,“expression”参数表示将在结果处理中使用的 XPath 表达式)
  • ExecuteBinary(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字节数组格式返回结果;这主要用于从 Web 资源查询二进制内容,例如:PDF 文件、图像;一个重载方法的参数对输出缓冲区施加了限制)
  • ExecuteBinaryConversion(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字符串格式返回结果;这用于从 Web 资源查询二进制内容,例如:PDF 文件,并将 PDF 文件内容从二进制转换为字符串)
  • ExecuteString(使用 GET 或 POST 方法在 Web 资源上执行查询,并以纯字符串格式返回结果)
  • ExecuteValue(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回作为字符串的结果,该字符串是应用于 XPath 表达式的表示;而不是 XPath 表达式,它也可以是 RegEx)
  • ExecuteCollection(使用 GET 或 POST 方法在 Web 资源上执行查询,并以通用字符串集合形式返回结果;结果是应用于结果的 XPath 或 RegEx 表达式的表示)
  • ExecuteArray(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字符串数组形式返回结果;结果是应用于结果的 XPath 或 RegEx 表达式的表示)
IHDPCommand 代码
using System.Collections.Generic;
using System.IO;
using System.Xml.XPath;

namespace HttpData.Client
{
    /// <summary>
    /// Provides functionality for querying and processing data from
    /// different web sources. Is implemented by HDPCommand.
    /// </summary>
    public interface IHDPCommand
    {
        #region Members
        #region Properties
        /// <summary>
        /// Get or set the command connection object.
        /// </summary>
        IHDPConnection Connection { get; set; }

        /// <summary>
        /// Get or set the command parameters collection.
        /// </summary>
        IHDPParameterCollection Parameters { get; }

        /// <summary>
        /// Get or set the command type.
        /// </summary>
        HDPCommandType CommandType { get; set; }

        /// <summary>
        /// Get or set the command text. 
        /// </summary>
        string CommandText { get; set; }

        /// <summary>
        /// Get or set the command timeout.
        /// </summary>
        int CommandTimeout { get; set; }

        /// <summary>
        /// Get the response retrieved from the server.
        /// </summary>
        string Response { get; }

        /// <summary>
        /// Get web resource URI.
        /// </summary>
        string Uri { get; }

        /// <summary>
        /// Get web resource absolute path.
        /// </summary>
        string Path { get; }

        /// <summary>
        /// Get the last error occurend.
        /// </summary>
        string LastError { get; }

        /// <summary>
        /// Get the content length of response.
        /// </summary>
        long ContentLength { get; }
        #endregion

        #region Methods
        /// <summary>
        /// Get the parameters number.
        /// </summary>
        /// <returns>Number of parameters.</returns>
        int GetParametersCount();

        /// <summary>
        /// Create a new IHDPParameter object.
        /// </summary>
        /// <returns>IHDPParameter parameter object.</returns>
        IHDPParameter CreateParameter();

        /// <summary>
        /// Execute a expression against the web server and return the number of results.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Number of results determined by the expression.</returns>
        int ExecuteNonQuery(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and does not read the response stream.
        /// </summary>
        /// <returns>True is the command executed with success otherwise false.</returns>
        bool Execute();

        /// <summary>
        /// Execute a query against the web server.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>True is the command executed with success otherwise false.</returns>
        bool Execute(bool clearParams);

        /// <summary>
        /// Execute a query against the web server.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Returns the underlying http response stream.</returns>
        Stream ExecuteStream(bool clearParams);

        /// <summary>
        /// Closes the http response object..
        /// Usable only with ExecuteStream method.
        /// </summary>
        void CloseResponse();

        /// <summary>
        /// Execute a query against the web server and return a XPathNavigator
        /// object used to navigate thru the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>XPathNavigator object used
        /// to navigate thru the query result.</returns>
        XPathNavigator ExecuteNavigator(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return
        /// a IXPathNavigable object used to navigate thru the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>IXPathNavigable object used to navigate thru the query result.</returns>
        IXPathNavigable ExecuteDocument(bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a IXPathNavigable
        /// object used to navigate thru query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>IXPathNavigable object used to navigate thru query result.</returns>
        IXPathNavigable ExecuteDocument(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a byte[] object which
        /// contains the binary query result. Used when querying
        /// binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Byte array object which contains the binary query result.</returns>
        byte[] ExecuteBinary(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a byte[] object which
        /// contains the binary query result. Used when querying
        /// binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="boundaryLimit">Specify the limit
        /// of the buffer which must be read.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>Byte array object which contains the binary query result.</returns>
        byte[] ExecuteBinary(int boundaryLimit, bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a string object
        /// which contains the representation of the binary query result.
        /// Used when querying binary content from web server (E.g: PDF files).
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains the representation
        /// of the binary query result.</returns>
        string ExecuteBinaryConversion(bool clearParams);

        /// <summary>
        /// Execute a query against the web server and return a string
        /// object which contains the representation of the query result.
        /// </summary>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains
        /// the representation of the query result.</returns>
        string ExecuteString(bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a string object which contains
        /// the representation of the query result value.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String object which contains
        /// the representation of the query result value.</returns>
        string ExecuteValue(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a string object which
        /// contains the representation of the query result value.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>String object which contains
        /// the representation of the query result value.</returns>
        string ExecuteValue(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a List object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters collection
        /// should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>List object which contains
        /// the representation of the query result.</returns>
        List<string> ExecuteCollection(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a List object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>List object which contains
        /// the representation of the query result.</returns>
        List<string> ExecuteCollection(string expression, bool clearParams);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a regular expression and return a string array object which
        /// contains the representation of the query result.
        /// </summary>
        /// <param name="expression">Regular expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <param name="isRegEx">Specify the a regular expression
        /// is used, it must always be to true.</param>
        /// <returns>String array object which contains
        /// the representation of the query result.</returns>
        string[] ExecuteArray(string expression, bool clearParams, bool isRegEx);

        /// <summary>
        /// Execute a query against the web server, on query reult it will apply
        /// a xpath expression and return a string array object
        /// which contains the representation of the query result.
        /// </summary>
        /// <param name="expression">XPath expression.</param>
        /// <param name="clearParams">Specify if the parameters
        /// collection should be cleared after the query is executed.</param>
        /// <returns>String array object which contains
        /// the representation of the query result.</returns>
        string[] ExecuteArray(string expression, bool clearParams);
        #endregion
        #endregion
    }
}

HDPCache、HDPCacheDefinition、HDPCacheObject 和 HDPCacheStorage

HDPCacheHDPCacheDefinitionHDPCacheObjectHDPCacheStorage 是处理缓存的类。我不会在这里过多纠缠这个话题,因为它在这种情况下并不那么重要。如果您愿意,可以自己更详细地研究这些类。我认为代码注释将帮助您快速理解它们的用途和功能。HDPCacheObject 类很简单;它包含一组定义缓存行为的属性。以下是它的属性

  • StorageActiveUntil(定义缓存被视为有效的日期)
  • MemorySizeLimit(限制缓存的内存大小)
  • ObjectsNumberLimit(限制缓存中的对象数量)
  • UseStorage(定义是否将缓存持久化到磁盘)
  • RetrieveFromStorage(定义是否应在磁盘上持久化的缓存中搜索特定值)
  • RealtimePersistance(定义在将新值添加到缓存后是否会实时持久化到磁盘)
  • StorageName(定义在磁盘上缓存文件的名称)

HDPCacheObject 类只是一个值对属性集和一个时间戳字段,用于标识缓存对象的年龄。缓存系统的工作方式非常清晰简单。当查询 Web 资源时,其 URL 代表缓存对象的键,查询结果代表缓存对象的值。当使用 HDPCommand 对象查询 Web 资源时,如果启用了缓存,则每个查询 URL 和响应内容都会存储在内存缓存中。如果使用相同的 URL 再次查询相同的 Web 资源,则不会执行 HTTP 请求,而是从内存缓存中检索响应内容。HDPCacheDefinition 中还有额外的选项,允许您控制缓存的行为。例如,如果您设置的缓存内存限制为 1024 KB,那么每次向缓存添加新值时,都会计算其内存占用。如果超出设定的限制,根据其他行为定义,缓存内容将被持久化到磁盘或被删除。我想提一下 MemorySizeLimitObjectsNumberLimit 是互斥的。因此,如果您为 MemorySizeLimit 定义的值大于 0,那么定义 ObjectsNumberLimit 的值就没有意义,因为它不会被考虑,反之亦然。

HDPCacheDefinition 代码
using System;

namespace HttpData.Client
{
    ///<summary>
    /// Defines the cache options.
    ///</summary>
    public class HDPCacheDefinition
    {
        #region Public Variables
        /// <summary>
        /// Specifies the date until which the cache is valid.
        /// </summary>
        public DateTime StorageActiveUntil = DateTime.Now.AddDays(1);

        /// <summary>
        /// Specifies the limit size of the cache memory.
        /// </summary>
        public long MemorySizeLimit;

        /// <summary>
        /// Specifies the limit number of objects which can be stored in the cache.
        /// </summary>
        public int ObjectsNumberLimit = 10000;

        /// <summary>
        /// Specifies if disk storage will be used.
        /// </summary>
        public bool UseStorage = true;
        
        ///<summary>
        /// Specifies if the data should be retrieved from the disk storage.
        ///</summary>
        public bool RetrieveFromStorage;

        /// <summary>
        /// Specifies if the persistance of the cache on disk will be done in real time.
        /// </summary>
        public bool RealtimePersistance;
        
        /// <summary>
        /// Specifies the name of the file of the disk storage.
        /// </summary>
        public string StorageName = "HttpDataProcessorCahe.che";
        #endregion
    }
}
HDPCacheObject 代码
using System;

namespace HttpData.Client
{
    /// <summary>
    /// Container for the cached data based on key value pair.
    /// </summary>
    [Serializable]
    public class HDPCacheObject
    {
        #region Private Variables
        private string key;
        private object value;
        private DateTime cacheDate;
        #endregion

        #region Properties
        /// <summary>
        /// Get or set the cache object key.
        /// </summary>
        public string Key
        {
            get { return key; }
            set { key = value; }
        }

        /// <summary>
        /// Get or set the cache object value.
        /// </summary>
        public object Value
        {
            get { return value; }
            set { this.value = value; }
        }

        /// <summary>
        /// Get or set the cache object date.
        /// </summary>
        public DateTime CacheDate
        {
            get { return cacheDate; }
        }
        #endregion

        #region .ctor
        /// <summary>
        /// Instantiate a new HDPCacheObject object.
        /// </summary>
        public HDPCacheObject()
        {
            cacheDate = DateTime.Now;
        }

        /// <summary>
        /// Instantiate a new HDPCacheObject object.
        /// </summary>
        /// <param name="key">Key for the cache object</param>
        /// <param name="value">Value for the cache object</param>
        public HDPCacheObject(string key, object value)
        {
            this.key = key;
            this.value = value;

            cacheDate = DateTime.Now;
        }
        #endregion

        #region Public Methods
        #endregion

        #region Private Methods
        #endregion
    }
}

Using the Code

我将提供几个示例,以便您了解它是如何工作的。我认为这是理解地球如何旋转的最佳方式。例如,我们想从以下页面检索所有佛罗里达州城市:http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34。以下是实现上述任务的代码。

using System;
using System.Collections.Generic;
using HttpData.Client;

namespace CityStates
{
    class Program
    {
        static void Main(string[] args)
        {
            private const string connectionUrl = 
               "http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34";

        //Create a new instance of HDPCacheDefinition object.
            HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
            {
                UseStorage = false,
                StorageActiveUntil = DateTime.Now,
                ObjectsNumberLimit = 10000,
                RealtimePersistance = false,
                RetrieveFromStorage = false,
            //We will not use a disk storage
                StorageName = null
            };

            //Create a new instance of HDPConnection object.
            //Pass as parameters the initial connection URL and the cache definition object.
            HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
            {
                //Define the content type we would expect.
                ContentType = HDPContentType.TEXT,
                //We want to allow autoredirects
                AutoRedirect = true,
                //Do not perform more than 10 autoredirects
                MaxAutoRedirects = 10,
                //The user agent is FireFox 3
                UserAgent = HDPAgents.FIREFOX_3,
                //We do not want to use a proxy
                Proxy = null // If you want to use a proxy: Proxy = 
                  // new HDPProxy("http://127.0.0.1:999/"
                  //    /*This is your proxy address and its port*/, 
                  // "PROXY_USER_NAME", "PROXY_PASSWORD")
            };
            //Open the connection
            connection.Open();

            //Create a new instance of HDPCommand object.
            //Pass as parameter the HDPConnection object.
            HDPCommand command = new HDPCommand(connection)
            {
                //Activate the memory cache for fast access
                //on same  web resource multiple times
                ActivatePool = true,
                //We will perform an GET action
                CommandType = HDPCommandType.Get,
                //Set the time out period
                CommandTimeout = 60000,
                //Use MSHTML library instead of HtmlAgilityPack
                //(if the value is false then HtmlAgilityPack would be used)
                UseMsHtml = true
            };

            //Execute the query on the web resource. The received
            //HTTPWebResponse content will be converted to XML
            // and the XPath expression will be executed.
            //The method will return the list of Florida state cities.
            List<string> cities = 
              command.ExecuteCollection("//ul/li/b//text()[normalize-space()]", true);
            
        foreach (string city in cities)
            Console.WriteLine(city);
            
        connection.Close();
        }
    }
}

现在来看另一个例子。假设我们要使用用户名和密码登录 LinkedIn 网络。以下是实现此目的的代码

using System;
using System.Collections.Generic;
using HttpData.Client;

namespace CityStates
{
    class Program
    {
        static void Main(string[] args)
        {
            private const string connectionUrl = 
              "https://www.linkedin.com/secure/login?trk=hb_signin";

            //Create a new instance of HDPCacheDefinition object.
            HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
            {
                UseStorage = false,
                StorageActiveUntil = DateTime.Now,
                ObjectsNumberLimit = 10000,
                RealtimePersistance = false,
                RetrieveFromStorage = false,
            //We will not use a disk storage
                StorageName = null
            };

            //Create a new instance of HDPConnection object.
            //Pass as parameters the initial connection URL and the cache definition object.
            HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
            {
                //Define the content type we would expect.
                ContentType = HDPContentType.TEXT,
                //We want to allow autoredirects
                AutoRedirect = true,
                //Do not perform more than 10 autoredirects
                MaxAutoRedirects = 10,
                //The user agent is FireFox 3
                UserAgent = HDPAgents.FIREFOX_3,
                //We do not want to use a proxy
                Proxy = null // If you want to use a proxy: Proxy =
                // new HDPProxy("http://127.0.0.1:999/"
                //   /*This is your proxy address and its port*/,
                // "PROXY_USER_NAME", "PROXY_PASSWORD")
            };
            //Open the connection
            connection.Open();

            //Create a new instance of HDPCommand object.
            //Pass as parameter the HDPConnection object.
            HDPCommand command = new HDPCommand(connection)
            {
                //Activate the memory cache for fast access
                //on same  web resource multiple times
                ActivatePool = true,
                //We will perform an GET action
                CommandType = HDPCommandType.Get,
                //Set the time out period
                CommandTimeout = 60000,
                //Use HtmlAgilityPack (if the value is true then MSHTML would be used)
                UseMsHtml = false
            };

            //Define the query parameters used in the POST action.
            //The actual parameter name used by a browser
            //to authenticate you on Linkedin is without '@' sign.
            //Use a HTTP request analyzer and you will notice the difference.
            //This is how the actual POST body will look like:
            //   csrfToken="ajax:-3801133150663455891"&session_key
            //      ="YOUR_EMAIL@gmail.com"&session_password="YOUR_PASSWORD"
            //       &session_login="Sign+In"&session_login=""&session_rikey=""
            HDPParameterCollection parameters = new HDPParameterCollection();
            HDPParameter pToken = 
              new HDPParameter("@csrfToken", "ajax:-3801133150663455891");
            HDPParameter pSessionKey = 
              new HDPParameter("@session_key", "YOUR_EMAIL@gmail.com");
            HDPParameter pSessionPass = 
              new HDPParameter("@session_password", "YOUR_PASSWORD");
            HDPParameter pSessionLogin = 
              new HDPParameter("@session_login", "Sign+In");
            HDPParameter pSessionLogin_ = new HDPParameter("@session_login", "");
            HDPParameter pSessionRiKey = new HDPParameter("@session_rikey", "");

            parameters.Add(pToken);
            parameters.Add(pSessionKey);
            parameters.Add(pSessionPass);
            parameters.Add(pSessionLogin);
            parameters.Add(pSessionLogin_);
            parameters.Add(pSessionRiKey);

             //If everything went ok then linkeding will ask us to redirect
             //(unfortunately autoredirect doesn't work in this case).
            //Get the manual redirect URL value.
            string value = command.ExecuteValue(
                    "//a[@id='manual_redirect_link']/@href", true);
            if (value != null && String.Compare(value, 
                      "http://www.linkedin.com/home") == 0)
            {
                command.Connection.ConnectionURL = value;
                command.CommandType = HDPCommandType.Get;

                //Using the manual redirect URL, check if the opened
                //web page contains the welcome message.
                //If it does contain the message, then we are in.
                string content = 
                  command.ExecuteString("//title[contains(.,'Welcome,')]", true);

                if (content.Length > 0)
                    Console.WriteLine(content);
                else
                    Console.WriteLine("Login failed!");
            }
            
            connection.Close();
        }
    }
}

在您的示例项目中,如果您将使用 MSHTML,请添加以下 app.config 内容。

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="LogFilePath" value="..\Log\My-Log.txt"/>
    <add key="HtmlTagsPath" value="HtmlTags.txt"/>
    <add key="AttributesTagsPath" value="HtmlAttributes.txt"/>
  </appSettings>
</configuration>

注释

  • HttpData.Client.Pdf - 并非所有内容都属于我。我不记得我从哪里得到的部分内容。
  • HDPUtils.cs - 我不为它的内容感到自豪,我认为它相当混乱,所以请暂时忽略它。

问题

HtmlAgilityPack - 使用时,有时它转换的内容与实际的 HTML DOM 结构不匹配,特别是当涉及 form 元素时。

MSHTML - 使用时,它会剥离 html 标签和 body 标签之间的所有内容(包括 html 标签)。它还会根据有效元素和属性列表验证输入的 HTML 内容,因此不匹配的内容将被删除。需要注意的一点是,默认情况下会删除 JavaScript 内容。您可以从 HttpData.Client.MsHtmlToXm 项目中的 HtmlLoader.cs 类中更改此行为。

关注点

上面的库可以在哪种应用程序中使用,这一点非常明显。

历史

还没有更新,但我相信将来会有更新。

© . All rights reserved.