HTTP 数据客户端 - 网络抓取






4.79/5 (8投票s)
一个基于 HTTPWebRequest 的库,它抽象了从 Web 源检索数据的方式。
引言
本文的目的是描述如何使用 C# 实现一个通用的“HTTP 数据客户端”(如果听起来有点啰嗦,请原谅),它允许您以优雅的方式查询任何您想要的基于 Web 的资源。我首先要说明的是,这不是“完美的解决方案”,肯定有很多地方可以改进,所以请随意改进。整个概念基于 .NET 在 System.Net
命名空间下提供的 HTTPWebRequest
对象。
必备组件
在开始深入探讨架构和代码之前,有一些额外的库被“HTTP 数据客户端”项目使用并需要。以下是库的列表
- Db4Object(这是一个面向对象的数据库;我主要将其用于嵌入式应用程序;它引用了两个程序集文件:Db4objects.Db4o.dll 和 Db4objects.Db4o.NativeQueries.dll;您可以从以下位置获取 DB4Object:http://www.db4o.com/DownloadNow.aspx)。
- HTML Agility Pack(这是一个允许您使用各种技术处理 HTML 内容的库,当您想将 HTML DOM 转换为 XML 时非常方便;它引用了一个程序集文件:HtmlAgilityPack.dll;您可以从以下位置获取该库:http://htmlagilitypack.codeplex.com)。
- Microsoft MSHTML(其目的是渲染和解析 HTML 和 JavaScript 内容)。
如果您想知道我为什么要使用两个不同的库来解析 HTML 内容,答案很简单。HTML Agility Pack 大部分时间表现都很好;您得到的结果通常是您期望的,但并非总是如此。因此,如果一个库未能提供预期的结果,我可以切换到另一个。在我看来,MSHTML 库最大的缺点是在非桌面应用程序(例如:网站、Web 服务等)中集成时处理速度很慢。DB4Object 在此项目中的作用是存储配置设置和缓存内容。关于 DB4Object 有一点需要特别说明的是,非服务器版本不支持多线程(您可以轻松地将其替换为您认为合适的任何其他存储)。
架构
我的解决方案包含四个项目
- HtmlAgilityPack(实际的 HTML Agility Pack 项目,包含源代码)
- HttpData.Client(实现 HTML 处理规则的主项目)
- HttpData.Client.MsHtmlToXml(MSHTML 的包装项目及其一些扩展)
- HttpData.Client.Pdf(使用 IFilter 实现一些 PDF 处理的项目;对本文不重要)
讨论 HTML Agility Pack 没有意义,因为您可以在 http://htmlagilitypack.codeplex.com 上找到所有详细信息和文档。我将主要关注 HttpData.Client
,并尽量为您提供尽可能多的详细信息和解释。HTTP 数据客户端的设计方式与 .NET SQL 客户端(System.Data.SqlClient
)类似,您会注意到项目中包含的类及其逻辑非常相似(希望这不仅仅是我的想象)。我将列出接口和类,并提供有关其逻辑和目的的详细信息。
IHDPAdapter 和 HDPAdapter
HDPAdapter
类的目的是允许将 XML 数据与 DataTable
和 DataSet
等其他数据对象集成。IHDPAdapter
接口公开了两个将 XML 数据转换为 DataTable
或 DataSet
的方法。目前,仅实现了 DataTable
转换方法。以下是接口和类的代码片段
using System.Data;
using System.Xml;
namespace HttpData.Client
{
/// <summary>
/// Provides functionality for integration with data objects
/// (DataTable, DataSet, etc). Is implemented by HDPAdapter.
/// </summary>
public interface IHDPAdapter
{
#region
/// <summary>
/// Get or set the select HDPCommand object.
/// </summary>
IHDPCommand SelectCommand{ get; set; }
#endregion
#region METHODS
/// <summary>
/// Fill a data table with the content from a specified xml document object.
/// </summary>
/// <param name="table">Data table to be filled.</param>
/// <param name="source">Xml document object
/// of which content will fill the data table.</param>
/// <param name="useNodes">True if nodes names
/// should be used for columns names, otherwise attributes will be used.</param>
/// <returns>Number of filled rows.</returns>
int Fill(DataTable table, XmlDocument source, bool useNodes);
/// <summary>
/// (NOT IMPLEMENTED) Fill a data set with the content
/// from a specified xml document object.
/// </summary>
/// <param name="dataset">Data set to be filled.</param>
/// <param name="source">Xml document object
/// of which content will fill the data table.</param>
/// <param name="useNodes">True if nodes names should
/// be used for columns names, otherwise attributes will be used.</param>
/// <returns>Number of filled rows.</returns>
int Fill(DataSet dataset, XmlDocument source, bool useNodes);
#endregion
}
}
using System;
using System.Xml;
using System.Xml.XPath;
using System.Data;
using System.Text;
namespace HttpData.Client
{
/// <summary>
/// Provides functionality for integration
/// with data objects (DataTable, DataSet, etc).
/// </summary>
public class HDPAdapter : IHDPAdapter
{
#region PRIVATE VARIABLES
private IHDPCommand _selectCommand;
#endregion
#region Properties
/// <summary>
/// Get or set the select IHDPCommand object.
/// </summary>
IHDPCommand IHDPAdapter.SelectCommand
{
get{ return _selectCommand; }
set{ _selectCommand = value; }
}
/// <summary>
/// Get or set the select HDPCommand object.
/// </summary>
public HDPCommand SelectCommand
{
get{ return (HDPCommand)_selectCommand; }
set{ _selectCommand = value; }
}
/// <summary>
/// Get or set the connection string.
/// </summary>
public string ConnectionString { get; set; }
#endregion
#region .ctor
/// <summary>
/// Create a new instance of HDPAdapter.
/// </summary>
public HDPAdapter()
{
}
/// <summary>
/// Create a new instance of HDPAdapter.
/// </summary>
/// <param name="connectionString">Connection string
/// associated with HDPAdapter object.</param>
public HDPAdapter(string connectionString)
{
this.ConnectionString = connectionString;
}
#endregion
#region Public Methods
/// <summary>
/// Fill a data table with the content from a specified xml document object.
/// </summary>
/// <param name="table">Data table to be filled.</param>
/// <param name="source">Xml document
/// object of which content will fill the data table.</param>
/// <param name="useNodes">True if nodes names
/// should be used for columns names, otherwise attributes will be used.</param>
/// <returns>Number of filled rows.</returns>
public int Fill(DataTable table, XmlDocument source, bool useNodes)
{
bool columnsCreated = false;
bool resetRow = false;
if(table == null || source == null)
return 0;
if (table.TableName.Length == 0)
return 0;
StringBuilder sbExpression = new StringBuilder("//");
sbExpression.Append(table.TableName);
XPathNavigator xpNav = source.CreateNavigator();
if (xpNav != null)
{
XPathNodeIterator xniNode = xpNav.Select(sbExpression.ToString());
while(xniNode.MoveNext())
{
XPathNodeIterator xniRowNode =
xniNode.Current.SelectChildren(XPathNodeType.Element);
while (xniRowNode.MoveNext())
{
if(resetRow)
{
xniRowNode.Current.MoveToFirst();
resetRow = false;
}
DataRow row = null;
if (columnsCreated)
row = table.NewRow();
if(useNodes)
{
XPathNodeIterator xniColumnNode =
xniRowNode.Current.SelectChildren(XPathNodeType.Element);
while (xniColumnNode.MoveNext())
{
if (!columnsCreated)
{
DataColumn column =
new DataColumn(xniColumnNode.Current.Name);
table.Columns.Add(column);
}
else
row[xniColumnNode.Current.Name] =
xniColumnNode.Current.Value;
}
}
else
{
XPathNodeIterator xniColumnNode = xniRowNode.Clone();
bool onAttribute = xniColumnNode.Current.MoveToFirstAttribute();
while (onAttribute)
{
if (!columnsCreated)
{
DataColumn column =
new DataColumn(xniColumnNode.Current.Name);
table.Columns.Add(column);
}
else
row[xniColumnNode.Current.Name] =
xniColumnNode.Current.Value;
onAttribute = xniColumnNode.Current.MoveToNextAttribute();
}
}
if (!columnsCreated)
{
columnsCreated = true;
resetRow = true;
}
if (row != null)
table.Rows.Add(row);
}
}
}
return table.Rows.Count;
}
/// <summary>
/// (NOT IMPLEMENTED) Fill a data set with the
/// content from a specified xml document object.
/// </summary>
/// <param name="dataset">Data set to be filled.</param>
/// <param name="source">Xml document object
/// of which content will fill the data table.</param>
/// <param name="useNodes">True if nodes names
/// should be used for columns names, otherwise attributes will be used.</param>
/// <returns>Number of filled rows.</returns>
public int Fill(DataSet dataset, XmlDocument source, bool useNodes)
{
throw new NotImplementedException();
}
#endregion
#region Private Methods
#endregion
}
}
IHDPConnection 和 HDPConnection
顾名思义,这代表连接类,它将以抽象的方式管理连接的行为。该接口公开了一组与之相关的方法和属性。只有三个方法被公开并实现
Open
方法(将连接状态更改为打开;此方法有一个重载,接受将要打开的 Web 资源的 URL 作为参数)Close
方法(将连接状态更改为关闭,如果正在使用缓存存储,则会关闭它)CreateCommand
方法(它创建一个新的HDPCommand
对象并为其分配当前连接)
现在让我们看看 IHDPConnection
公开并由 HDPConnection
实现的属性
ConnectionURL
(表示将使用当前连接打开的 Web 资源的 URL)KeepAlive
(定义查询完成后连接是否应保持打开状态)AutoRedirect
(定义连接是否允许执行任何自动重定向)MaxAutoRedirects
(定义可以执行多少次自动重定向)UserAgent
(定义将与连接关联的用户代理,例如:Internet Explorer、Chrome、Opera 等)ConnectionState
(只读属性,提供有关连接状态的信息;连接是打开还是关闭)Proxy
(定义在执行查询时将使用的代理)Cookies
(当前或查询期间与连接关联的 Cookie)ContentType
(定义查询时期望的内容类型,例如:application/x-www-form-urlencoded、application/json 等)Headers
(包含当前或查询期间与连接关联的标头)Referer
(包含在查询连接 URL 时将使用的引用者)
using System.Collections.Generic;
using System.Net;
namespace HttpData.Client
{
/// <summary>
/// Provides functionality for connection management of different
/// web sources. Is implemented by HDPConnection.
/// </summary>
public interface IHDPConnection
{
#region MEMBERS
#region METHODS
/// <summary>
/// Open connection.
/// </summary>
void Open();
/// <summary>
/// Close connection.
/// </summary>
void Close();
/// <summary>
/// Create a new HDPCommand object associated with this connection.
/// </summary>
/// <returns>HDPCommand object associated with this connection.</returns>
IHDPCommand CreateCommand();
#endregion
#region PROPERTIES
/// <summary>
/// Get or set connection url.
/// </summary>
string ConnectionURL { get; set; }
/// <summary>
/// Get or set the value which specifies
/// if the connection should be maintained openend.
/// </summary>
bool KeepAlive { get; set; }
/// <summary>
/// Get or set the value which specifies if auto redirection is allowed.
/// </summary>
bool AutoRedirect { get; set; }
/// <summary>
/// Get or set the value which specifies if maximum number of auto redirections.
/// </summary>
int MaxAutoRedirects { get; set; }
/// <summary>
/// Get or set the value which specifies the user agent to be used.
/// </summary>
string UserAgent { get; set; }
/// <summary>
/// Get the value which specifies the state of the connection.
/// </summary>
HDPConnectionState ConnectionState { get; }
/// <summary>
/// Get or set the value which specifies the connection proxy.
/// </summary>
HDPProxy Proxy { get; set; }
/// <summary>
/// Get or set the value which specifies the coockies used by connection.
/// </summary>
CookieCollection Cookies { get; set; }
/// <summary>
/// Get or set the value which specifies the content type.
/// </summary>
string ContentType { get; set; }
/// <summary>
/// Get or set headers details used in HttpWebRequest operations.
/// </summary>
List<HDPConnectionHeader> Headers { get; set; }
/// <summary>
/// Get or set Http referer.
/// </summary>
string Referer { get; set; }
#endregion
#endregion
}
}
using System.Collections.Generic;
using System.Net;
namespace HttpData.Client
{
/// <summary>
/// Provides functionality for connection management of different web sources.
/// </summary>
public class HDPConnection : IHDPConnection
{
#region Private Variables
private HDPConnectionState _connectionState;
private string _connectionURL;
private HDPCache cache;
private bool useCache;
#endregion
#region Properties
/// <summary>
/// Get the value which specifies if caching will be used.
/// </summary>
public bool UseCahe
{
get { return useCache; }
}
/// <summary>
/// Get HDPCache object.
/// </summary>
public HDPCache Cache
{
get { return cache; }
}
#endregion
#region .ctor
/// <summary>
/// Instantiate a new HDPConnection object.
/// </summary>
public HDPConnection()
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = "";
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
}
/// <summary>
/// Instantiate a new HDPConnection object.
/// </summary>
/// <param name="connectionURL">Url of the web source.</param>
public HDPConnection(string connectionURL)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = connectionURL;
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
}
/// <summary>
/// Instantiate a new HDPConnection object.
/// </summary>
/// <param name="cacheDefinitions">HDPCacheDefinition
/// object used by caching mechanism.</param>
public HDPConnection(HDPCacheDefinition cacheDefinitions)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = "";
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
useCache = true;
}
/// <summary>
/// Instantiate a new HDPConnection object.
/// </summary>
/// <param name="connectionURL">Url of the web source.</param>
/// <param name="cacheDefinitions">HDPCacheDefinition
/// object used by caching mechanism.</param>
public HDPConnection(string connectionURL, HDPCacheDefinition cacheDefinitions)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = connectionURL;
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
useCache = true;
}
#endregion
#region Public Methods
#endregion
#region IHDPConnection Members
#region Methods
/// <summary>
/// Open connection.
/// </summary>
public void Open()
{
_connectionState = HDPConnectionState.Open;
}
/// <summary>
/// Open connection using a specific url.
/// </summary>
/// <param name="connectionURL">Url of the web source.</param>
public void Open(string connectionURL)
{
_connectionURL = connectionURL;
_connectionState = HDPConnectionState.Open;
}
/// <summary>
/// Close connection.
/// </summary>
public void Close()
{
_connectionState = HDPConnectionState.Closed;
if (cache != null)
cache.CloseStorageConnection();
}
/// <summary>
/// Create a new IHDPCommand object associated with this connection.
/// </summary>
/// <returns>IHDPCommand object associated with this connection.</returns>
IHDPCommand IHDPConnection.CreateCommand()
{
HDPCommand command = new HDPCommand { Connection = this };
return command;
}
/// <summary>
/// Create a new HDPCommand object associated with this connection.
/// </summary>
/// <returns>HDPCommand object associated with this connection.</returns>
public HDPCommand CreateCommand()
{
HDPCommand command = new HDPCommand { Connection = this };
return command;
}
#endregion
#region Properties
/// <summary>
/// Get or set connection url.
/// </summary>
public string ConnectionURL
{
get { return _connectionURL; }
set { _connectionURL = value; }
}
/// <summary>
/// Get or set the value which specifies if auto redirection is allowed.
/// </summary>
public bool AutoRedirect { get; set; }
/// <summary>
/// Get or set the value which specifies if maximum number of auto redirections.
/// </summary>
public int MaxAutoRedirects { get; set; }
/// <summary>
/// Get or set the value which specifies if the
/// connection should be maintained openend.
/// </summary>
public bool KeepAlive { get; set; }
/// <summary>
/// Get or set the value which specifies the user agent to be used.
/// </summary>
public string UserAgent { get; set; }
/// <summary>
/// Get or set the value which specifies the content type.
/// </summary>
public string ContentType { get; set; }
/// <summary>
/// Get or set the value which specifies the coockies used by connection.
/// </summary>
public CookieCollection Cookies { get; set; }
/// <summary>
/// Get the value which specifies the state of the connection.
/// </summary>
public HDPConnectionState ConnectionState
{
get { return _connectionState; }
}
/// <summary>
/// Get or set the value which specifies the connection proxy.
/// </summary>
public HDPProxy Proxy { get; set; }
/// <summary>
/// Get or set headers details used in HttpWebRequest operations.
/// </summary>
public List<HDPConnectionHeader> Headers { get; set; }
/// <summary>
/// Get or set Http referer.
/// </summary>
public string Referer { get; set; }
#endregion
#endregion
#region IDisposable Members
///<summary>
/// Dispose current object.
///</summary>
public void Dispose()
{
this.dispose();
System.GC.SuppressFinalize(this);
}
private void dispose()
{
if (_connectionState == HDPConnectionState.Open)
this.Close();
}
#endregion
}
}
IHDPCommand 和 HDPCommand
这代表了我们提供查询 Web 资源和处理结果(响应)功能的引擎。它提供了多种处理查询响应内容的方式,例如:XPath、RegEx、XSLT、Reflection 等。我将只详细讨论主要方法,其余方法都基于这些方法,我假设随方法提供的注释足以提供正确的方向。但在我开始讲方法之前,先介绍一下属性。我不会在这里粘贴 HDPCommand
类的内容,因为它有很多代码行。您可以通过提供的源代码详细分析它。
Connection
(定义与此命令关联的连接对象)Parameters
(定义在查询过程中使用的参数)CommandType
(定义在查询过程中使用的命令类型;它是 GET 或 POST)CommandText
(定义将要执行的命令的内容;如果这是一个 GET 命令,则存储带有查询参数的 URL;如果这是一个 POST 命令,则存储 POST 操作的主体内容)CommandTimeout
(定义预计从 Web 资源获得响应的时间段)Response
(根据查询操作包含从 Web 资源收到的响应字符串)Uri
(包含被查询 Web 资源的 URI)Path
(包含被查询 Web 资源的路径)LastError
(包含过程中遇到的最后一个错误消息)ContentLength
(根据查询操作包含从 Web 资源收到的内容的长度)
现在我们可以看看公开/实现的方法了。
GetParametersCount
(获取查询过程中使用的参数数量)CreateParameter
(创建一个新的参数以用于查询过程)ExecuteNonQuery
(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回收到的结果数量;它有一个参数,指定是否在结束时清除查询过程中使用的参数集合)Execute
(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个布尔值:如果查询成功执行,则为 true,如果失败,则为 false)ExecuteStream
(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回底层的 HTTP 响应流)CloseResponse
(关闭由ExecuteStream
方法打开的 HTTP 响应流)ExecuteNavigator
(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个XPathNavigator
对象,用于导航通过转换为 XML 的响应;它有一个参数,指定是否在结束时清除查询过程中使用的参数集合)ExecuteDocument
(它有一个重载,使用 GET 或 POST 方法在 Web 资源上执行查询,并返回一个IXPathNavigable
对象,用于导航通过转换为 XML 的响应,“expression
”参数表示将在结果处理中使用的 XPath 表达式)ExecuteBinary
(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字节数组格式返回结果;这主要用于从 Web 资源查询二进制内容,例如:PDF 文件、图像;一个重载方法的参数对输出缓冲区施加了限制)ExecuteBinaryConversion
(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字符串格式返回结果;这用于从 Web 资源查询二进制内容,例如:PDF 文件,并将 PDF 文件内容从二进制转换为字符串)ExecuteString
(使用 GET 或 POST 方法在 Web 资源上执行查询,并以纯字符串格式返回结果)ExecuteValue
(使用 GET 或 POST 方法在 Web 资源上执行查询,并返回作为字符串的结果,该字符串是应用于 XPath 表达式的表示;而不是 XPath 表达式,它也可以是 RegEx)ExecuteCollection
(使用 GET 或 POST 方法在 Web 资源上执行查询,并以通用字符串集合形式返回结果;结果是应用于结果的 XPath 或 RegEx 表达式的表示)ExecuteArray
(使用 GET 或 POST 方法在 Web 资源上执行查询,并以字符串数组形式返回结果;结果是应用于结果的 XPath 或 RegEx 表达式的表示)
using System.Collections.Generic;
using System.IO;
using System.Xml.XPath;
namespace HttpData.Client
{
/// <summary>
/// Provides functionality for querying and processing data from
/// different web sources. Is implemented by HDPCommand.
/// </summary>
public interface IHDPCommand
{
#region Members
#region Properties
/// <summary>
/// Get or set the command connection object.
/// </summary>
IHDPConnection Connection { get; set; }
/// <summary>
/// Get or set the command parameters collection.
/// </summary>
IHDPParameterCollection Parameters { get; }
/// <summary>
/// Get or set the command type.
/// </summary>
HDPCommandType CommandType { get; set; }
/// <summary>
/// Get or set the command text.
/// </summary>
string CommandText { get; set; }
/// <summary>
/// Get or set the command timeout.
/// </summary>
int CommandTimeout { get; set; }
/// <summary>
/// Get the response retrieved from the server.
/// </summary>
string Response { get; }
/// <summary>
/// Get web resource URI.
/// </summary>
string Uri { get; }
/// <summary>
/// Get web resource absolute path.
/// </summary>
string Path { get; }
/// <summary>
/// Get the last error occurend.
/// </summary>
string LastError { get; }
/// <summary>
/// Get the content length of response.
/// </summary>
long ContentLength { get; }
#endregion
#region Methods
/// <summary>
/// Get the parameters number.
/// </summary>
/// <returns>Number of parameters.</returns>
int GetParametersCount();
/// <summary>
/// Create a new IHDPParameter object.
/// </summary>
/// <returns>IHDPParameter parameter object.</returns>
IHDPParameter CreateParameter();
/// <summary>
/// Execute a expression against the web server and return the number of results.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>Number of results determined by the expression.</returns>
int ExecuteNonQuery(bool clearParams);
/// <summary>
/// Execute a query against the web server and does not read the response stream.
/// </summary>
/// <returns>True is the command executed with success otherwise false.</returns>
bool Execute();
/// <summary>
/// Execute a query against the web server.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>True is the command executed with success otherwise false.</returns>
bool Execute(bool clearParams);
/// <summary>
/// Execute a query against the web server.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>Returns the underlying http response stream.</returns>
Stream ExecuteStream(bool clearParams);
/// <summary>
/// Closes the http response object..
/// Usable only with ExecuteStream method.
/// </summary>
void CloseResponse();
/// <summary>
/// Execute a query against the web server and return a XPathNavigator
/// object used to navigate thru the query result.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>XPathNavigator object used
/// to navigate thru the query result.</returns>
XPathNavigator ExecuteNavigator(bool clearParams);
/// <summary>
/// Execute a query against the web server and return
/// a IXPathNavigable object used to navigate thru the query result.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>IXPathNavigable object used to navigate thru the query result.</returns>
IXPathNavigable ExecuteDocument(bool clearParams);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a xpath expression and return a IXPathNavigable
/// object used to navigate thru query result.
/// </summary>
/// <param name="expression">XPath expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>IXPathNavigable object used to navigate thru query result.</returns>
IXPathNavigable ExecuteDocument(string expression, bool clearParams);
/// <summary>
/// Execute a query against the web server and return a byte[] object which
/// contains the binary query result. Used when querying
/// binary content from web server (E.g: PDF files).
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>Byte array object which contains the binary query result.</returns>
byte[] ExecuteBinary(bool clearParams);
/// <summary>
/// Execute a query against the web server and return a byte[] object which
/// contains the binary query result. Used when querying
/// binary content from web server (E.g: PDF files).
/// </summary>
/// <param name="boundaryLimit">Specify the limit
/// of the buffer which must be read.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>Byte array object which contains the binary query result.</returns>
byte[] ExecuteBinary(int boundaryLimit, bool clearParams);
/// <summary>
/// Execute a query against the web server and return a string object
/// which contains the representation of the binary query result.
/// Used when querying binary content from web server (E.g: PDF files).
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>String object which contains the representation
/// of the binary query result.</returns>
string ExecuteBinaryConversion(bool clearParams);
/// <summary>
/// Execute a query against the web server and return a string
/// object which contains the representation of the query result.
/// </summary>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>String object which contains
/// the representation of the query result.</returns>
string ExecuteString(bool clearParams);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a xpath expression and return a string object which contains
/// the representation of the query result value.
/// </summary>
/// <param name="expression">XPath expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>String object which contains
/// the representation of the query result value.</returns>
string ExecuteValue(string expression, bool clearParams);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a regular expression and return a string object which
/// contains the representation of the query result value.
/// </summary>
/// <param name="expression">Regular expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <param name="isRegEx">Specify the a regular expression
/// is used, it must always be to true.</param>
/// <returns>String object which contains
/// the representation of the query result value.</returns>
string ExecuteValue(string expression, bool clearParams, bool isRegEx);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a regular expression and return a List object which
/// contains the representation of the query result.
/// </summary>
/// <param name="expression">Regular expression.</param>
/// <param name="clearParams">Specify if the parameters collection
/// should be cleared after the query is executed.</param>
/// <param name="isRegEx">Specify the a regular expression
/// is used, it must always be to true.</param>
/// <returns>List object which contains
/// the representation of the query result.</returns>
List<string> ExecuteCollection(string expression, bool clearParams, bool isRegEx);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a xpath expression and return a List object which
/// contains the representation of the query result.
/// </summary>
/// <param name="expression">XPath expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>List object which contains
/// the representation of the query result.</returns>
List<string> ExecuteCollection(string expression, bool clearParams);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a regular expression and return a string array object which
/// contains the representation of the query result.
/// </summary>
/// <param name="expression">Regular expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <param name="isRegEx">Specify the a regular expression
/// is used, it must always be to true.</param>
/// <returns>String array object which contains
/// the representation of the query result.</returns>
string[] ExecuteArray(string expression, bool clearParams, bool isRegEx);
/// <summary>
/// Execute a query against the web server, on query reult it will apply
/// a xpath expression and return a string array object
/// which contains the representation of the query result.
/// </summary>
/// <param name="expression">XPath expression.</param>
/// <param name="clearParams">Specify if the parameters
/// collection should be cleared after the query is executed.</param>
/// <returns>String array object which contains
/// the representation of the query result.</returns>
string[] ExecuteArray(string expression, bool clearParams);
#endregion
#endregion
}
}
HDPCache、HDPCacheDefinition、HDPCacheObject 和 HDPCacheStorage
HDPCache
、HDPCacheDefinition
、HDPCacheObject
和 HDPCacheStorage
是处理缓存的类。我不会在这里过多纠缠这个话题,因为它在这种情况下并不那么重要。如果您愿意,可以自己更详细地研究这些类。我认为代码注释将帮助您快速理解它们的用途和功能。HDPCacheObject
类很简单;它包含一组定义缓存行为的属性。以下是它的属性
StorageActiveUntil
(定义缓存被视为有效的日期)MemorySizeLimit
(限制缓存的内存大小)ObjectsNumberLimit
(限制缓存中的对象数量)UseStorage
(定义是否将缓存持久化到磁盘)RetrieveFromStorage
(定义是否应在磁盘上持久化的缓存中搜索特定值)RealtimePersistance
(定义在将新值添加到缓存后是否会实时持久化到磁盘)StorageName
(定义在磁盘上缓存文件的名称)
HDPCacheObject
类只是一个值对属性集和一个时间戳字段,用于标识缓存对象的年龄。缓存系统的工作方式非常清晰简单。当查询 Web 资源时,其 URL 代表缓存对象的键,查询结果代表缓存对象的值。当使用 HDPCommand
对象查询 Web 资源时,如果启用了缓存,则每个查询 URL 和响应内容都会存储在内存缓存中。如果使用相同的 URL 再次查询相同的 Web 资源,则不会执行 HTTP 请求,而是从内存缓存中检索响应内容。HDPCacheDefinition
中还有额外的选项,允许您控制缓存的行为。例如,如果您设置的缓存内存限制为 1024 KB,那么每次向缓存添加新值时,都会计算其内存占用。如果超出设定的限制,根据其他行为定义,缓存内容将被持久化到磁盘或被删除。我想提一下 MemorySizeLimit
和 ObjectsNumberLimit
是互斥的。因此,如果您为 MemorySizeLimit
定义的值大于 0,那么定义 ObjectsNumberLimit
的值就没有意义,因为它不会被考虑,反之亦然。
using System;
namespace HttpData.Client
{
///<summary>
/// Defines the cache options.
///</summary>
public class HDPCacheDefinition
{
#region Public Variables
/// <summary>
/// Specifies the date until which the cache is valid.
/// </summary>
public DateTime StorageActiveUntil = DateTime.Now.AddDays(1);
/// <summary>
/// Specifies the limit size of the cache memory.
/// </summary>
public long MemorySizeLimit;
/// <summary>
/// Specifies the limit number of objects which can be stored in the cache.
/// </summary>
public int ObjectsNumberLimit = 10000;
/// <summary>
/// Specifies if disk storage will be used.
/// </summary>
public bool UseStorage = true;
///<summary>
/// Specifies if the data should be retrieved from the disk storage.
///</summary>
public bool RetrieveFromStorage;
/// <summary>
/// Specifies if the persistance of the cache on disk will be done in real time.
/// </summary>
public bool RealtimePersistance;
/// <summary>
/// Specifies the name of the file of the disk storage.
/// </summary>
public string StorageName = "HttpDataProcessorCahe.che";
#endregion
}
}
using System;
namespace HttpData.Client
{
/// <summary>
/// Container for the cached data based on key value pair.
/// </summary>
[Serializable]
public class HDPCacheObject
{
#region Private Variables
private string key;
private object value;
private DateTime cacheDate;
#endregion
#region Properties
/// <summary>
/// Get or set the cache object key.
/// </summary>
public string Key
{
get { return key; }
set { key = value; }
}
/// <summary>
/// Get or set the cache object value.
/// </summary>
public object Value
{
get { return value; }
set { this.value = value; }
}
/// <summary>
/// Get or set the cache object date.
/// </summary>
public DateTime CacheDate
{
get { return cacheDate; }
}
#endregion
#region .ctor
/// <summary>
/// Instantiate a new HDPCacheObject object.
/// </summary>
public HDPCacheObject()
{
cacheDate = DateTime.Now;
}
/// <summary>
/// Instantiate a new HDPCacheObject object.
/// </summary>
/// <param name="key">Key for the cache object</param>
/// <param name="value">Value for the cache object</param>
public HDPCacheObject(string key, object value)
{
this.key = key;
this.value = value;
cacheDate = DateTime.Now;
}
#endregion
#region Public Methods
#endregion
#region Private Methods
#endregion
}
}
Using the Code
我将提供几个示例,以便您了解它是如何工作的。我认为这是理解地球如何旋转的最佳方式。例如,我们想从以下页面检索所有佛罗里达州城市:http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34。以下是实现上述任务的代码。
using System;
using System.Collections.Generic;
using HttpData.Client;
namespace CityStates
{
class Program
{
static void Main(string[] args)
{
private const string connectionUrl =
"http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34";
//Create a new instance of HDPCacheDefinition object.
HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
{
UseStorage = false,
StorageActiveUntil = DateTime.Now,
ObjectsNumberLimit = 10000,
RealtimePersistance = false,
RetrieveFromStorage = false,
//We will not use a disk storage
StorageName = null
};
//Create a new instance of HDPConnection object.
//Pass as parameters the initial connection URL and the cache definition object.
HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
{
//Define the content type we would expect.
ContentType = HDPContentType.TEXT,
//We want to allow autoredirects
AutoRedirect = true,
//Do not perform more than 10 autoredirects
MaxAutoRedirects = 10,
//The user agent is FireFox 3
UserAgent = HDPAgents.FIREFOX_3,
//We do not want to use a proxy
Proxy = null // If you want to use a proxy: Proxy =
// new HDPProxy("http://127.0.0.1:999/"
// /*This is your proxy address and its port*/,
// "PROXY_USER_NAME", "PROXY_PASSWORD")
};
//Open the connection
connection.Open();
//Create a new instance of HDPCommand object.
//Pass as parameter the HDPConnection object.
HDPCommand command = new HDPCommand(connection)
{
//Activate the memory cache for fast access
//on same web resource multiple times
ActivatePool = true,
//We will perform an GET action
CommandType = HDPCommandType.Get,
//Set the time out period
CommandTimeout = 60000,
//Use MSHTML library instead of HtmlAgilityPack
//(if the value is false then HtmlAgilityPack would be used)
UseMsHtml = true
};
//Execute the query on the web resource. The received
//HTTPWebResponse content will be converted to XML
// and the XPath expression will be executed.
//The method will return the list of Florida state cities.
List<string> cities =
command.ExecuteCollection("//ul/li/b//text()[normalize-space()]", true);
foreach (string city in cities)
Console.WriteLine(city);
connection.Close();
}
}
}
现在来看另一个例子。假设我们要使用用户名和密码登录 LinkedIn 网络。以下是实现此目的的代码
using System;
using System.Collections.Generic;
using HttpData.Client;
namespace CityStates
{
class Program
{
static void Main(string[] args)
{
private const string connectionUrl =
"https://www.linkedin.com/secure/login?trk=hb_signin";
//Create a new instance of HDPCacheDefinition object.
HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
{
UseStorage = false,
StorageActiveUntil = DateTime.Now,
ObjectsNumberLimit = 10000,
RealtimePersistance = false,
RetrieveFromStorage = false,
//We will not use a disk storage
StorageName = null
};
//Create a new instance of HDPConnection object.
//Pass as parameters the initial connection URL and the cache definition object.
HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
{
//Define the content type we would expect.
ContentType = HDPContentType.TEXT,
//We want to allow autoredirects
AutoRedirect = true,
//Do not perform more than 10 autoredirects
MaxAutoRedirects = 10,
//The user agent is FireFox 3
UserAgent = HDPAgents.FIREFOX_3,
//We do not want to use a proxy
Proxy = null // If you want to use a proxy: Proxy =
// new HDPProxy("http://127.0.0.1:999/"
// /*This is your proxy address and its port*/,
// "PROXY_USER_NAME", "PROXY_PASSWORD")
};
//Open the connection
connection.Open();
//Create a new instance of HDPCommand object.
//Pass as parameter the HDPConnection object.
HDPCommand command = new HDPCommand(connection)
{
//Activate the memory cache for fast access
//on same web resource multiple times
ActivatePool = true,
//We will perform an GET action
CommandType = HDPCommandType.Get,
//Set the time out period
CommandTimeout = 60000,
//Use HtmlAgilityPack (if the value is true then MSHTML would be used)
UseMsHtml = false
};
//Define the query parameters used in the POST action.
//The actual parameter name used by a browser
//to authenticate you on Linkedin is without '@' sign.
//Use a HTTP request analyzer and you will notice the difference.
//This is how the actual POST body will look like:
// csrfToken="ajax:-3801133150663455891"&session_key
// ="YOUR_EMAIL@gmail.com"&session_password="YOUR_PASSWORD"
// &session_login="Sign+In"&session_login=""&session_rikey=""
HDPParameterCollection parameters = new HDPParameterCollection();
HDPParameter pToken =
new HDPParameter("@csrfToken", "ajax:-3801133150663455891");
HDPParameter pSessionKey =
new HDPParameter("@session_key", "YOUR_EMAIL@gmail.com");
HDPParameter pSessionPass =
new HDPParameter("@session_password", "YOUR_PASSWORD");
HDPParameter pSessionLogin =
new HDPParameter("@session_login", "Sign+In");
HDPParameter pSessionLogin_ = new HDPParameter("@session_login", "");
HDPParameter pSessionRiKey = new HDPParameter("@session_rikey", "");
parameters.Add(pToken);
parameters.Add(pSessionKey);
parameters.Add(pSessionPass);
parameters.Add(pSessionLogin);
parameters.Add(pSessionLogin_);
parameters.Add(pSessionRiKey);
//If everything went ok then linkeding will ask us to redirect
//(unfortunately autoredirect doesn't work in this case).
//Get the manual redirect URL value.
string value = command.ExecuteValue(
"//a[@id='manual_redirect_link']/@href", true);
if (value != null && String.Compare(value,
"http://www.linkedin.com/home") == 0)
{
command.Connection.ConnectionURL = value;
command.CommandType = HDPCommandType.Get;
//Using the manual redirect URL, check if the opened
//web page contains the welcome message.
//If it does contain the message, then we are in.
string content =
command.ExecuteString("//title[contains(.,'Welcome,')]", true);
if (content.Length > 0)
Console.WriteLine(content);
else
Console.WriteLine("Login failed!");
}
connection.Close();
}
}
}
在您的示例项目中,如果您将使用 MSHTML,请添加以下 app.config 内容。
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<add key="LogFilePath" value="..\Log\My-Log.txt"/>
<add key="HtmlTagsPath" value="HtmlTags.txt"/>
<add key="AttributesTagsPath" value="HtmlAttributes.txt"/>
</appSettings>
</configuration>
注释
- HttpData.Client.Pdf - 并非所有内容都属于我。我不记得我从哪里得到的部分内容。
- HDPUtils.cs - 我不为它的内容感到自豪,我认为它相当混乱,所以请暂时忽略它。
问题
HtmlAgilityPack - 使用时,有时它转换的内容与实际的 HTML DOM 结构不匹配,特别是当涉及 form
元素时。
MSHTML - 使用时,它会剥离 html
标签和 body
标签之间的所有内容(包括 html
标签)。它还会根据有效元素和属性列表验证输入的 HTML 内容,因此不匹配的内容将被删除。需要注意的一点是,默认情况下会删除 JavaScript 内容。您可以从 HttpData.Client.MsHtmlToXm 项目中的 HtmlLoader.cs 类中更改此行为。
关注点
上面的库可以在哪种应用程序中使用,这一点非常明显。
历史
还没有更新,但我相信将来会有更新。