CodeProject 文章抓取

#realJSOP

4.72/5 (43投票s)

2008年10月13日

CPOL

10分钟阅读

140894

804

抓取 CodeProject 上的“我的文章”页面，以便跟踪您的文章。

下载 CPAM - 365 KB

注意 - 由于最近 CodeProject 页面格式发生了（并且有些激进的）变化，本文中的代码已不再可用。因此，我写了一篇全新的文章来利用新的格式变化。该文章在这里：

CodeProject 文章抓取器，重访

我将这篇文章留在网站上，以便让大家有机会比较编码风格、结构变化，甚至抓取方法。

引言

本文描述了一种从 CodeProject 的我的文章页面抓取数据的方法。目前没有 CodeProject API 来检索这些数据，所以这是获取信息的唯一方法。不幸的是，该页面的格式随时可能更改，并可能导致此代码失效，因此您需要密切关注此问题。这应该非常容易，因为我已经为您完成了所有繁重的工作——您所要做的就是维护它。

重要提示：查看本文底部的历史记录部分，并确保您实现了其中显示的错误修复。

ArticleData 类

ArticleData 类包含从网页抓取的每篇文章的数据。这个类最有趣的方面是它派生自 IComparable，因此包含 ArticleData 对象的泛型列表可以根据任何抓取到的值对列表进行排序。有几种方法可以对泛型列表进行排序，我使用了那种使引用代码最简洁的方法。我想说的是，您应该选择您想要的方式。没有哪种方法比其他方法更正确，这更多是程序员风格和偏好而非其他因素。

我的做法

我选择从 IComparable 派生 AriticleData 类，并编写执行排序所需的函数。这使引用代码免受不必要的干扰，从而使代码更易于阅读。这是我喜欢做事的方式。在我看来，没有必要用不必要的细节来打扰程序员。我不会在本文中发布整个类，而是简单地向您展示两个排序函数

public class ArticleData : IComparable<ArticleData>
{
	// DATA MEMBERS

	// PROPERTIES

	#region Comparison delegates
	/// <summary>
	/// Title comparison for sort function
	/// </summary>
	public static Comparison<ArticleData> TitleCompare = delegate(ArticleData p1, ArticleData p2)
	{
		return (p1.SortAscending) ? p1.m_title.CompareTo(p2.m_title) : p2.m_title.CompareTo(p1.m_title);
	};
	/// <summary>
	/// Page views comparison for sort function
	/// </summary>
	public static Comparison<ArticleData> PageViewsCompare = delegate(ArticleData p1, ArticleData p2)
	{
		return (p1.SortAscending) ? p1.m_pageViews.CompareTo(p2.m_pageViews) : p2.m_pageViews.CompareTo(p1.m_pageViews);
	};


	// there are more comparison delegates here

	/// <summary>
	/// Default comparison (compares article ID) for sort function
	/// </summary>
	public int CompareTo(ArticleData other)
	{
		return ArticleID.CompareTo(other.ArticleID);
	}
	#endregion Comparison delegates

ArticleUpdate 类

这个类派生自 ArticleData 类，乍一看，它似乎与 ArticleData 类完全相同，但事实并非如此。为了使代码真正有用，您需要一种方法来识别自上次数据抓取以来的变化。为了本演示的目的，这就是这个类所能实现的功能。我认识到您抓取“我的文章”页面可能有不同的原因，因此您应该准备好编写自己的类来执行应用程序所需的功能。我猜您的实现会比我的更广泛。

该类有自己的排序委托。它们足够相似，我决定不在本文中展示它们，因为我认为这将是多余的。这个类中真正有趣的方法是：

ApplyChanges

当从网页抓取一篇文章时，这个方法由抓取管理器对象（下一节介绍）调用。如果文章存在于现有文章列表中，我们调用这个方法将数据更改为其现有值。如果文章的任何内容发生变化，这个方法返回 true。

public bool ApplyChanges(ArticleUpdate item, DateTime timeOfUpdate, bool newArticle)
{
	bool changed = false;

	// make them all the same
	this.m_title			= m_latestTitle;
	this.m_link			= m_latestLink;
	this.m_lastUpdated		= m_latestLastUpdated;
	this.m_description		= m_latestDescription;
	this.m_pageViews		= m_latestPageViews;
	this.m_rating			= m_latestRating;
	this.m_votes			= m_latestVotes;
	this.m_popularity		= m_latestPopularity;
	this.m_bookmarks		= m_latestBookmarks;

	// set new info
	this.m_latestTitle		= item.m_latestTitle;
	this.m_latestLink		= item.m_latestLink;
	this.m_latestDescription	= item.m_latestDescription;
	this.m_latestPageViews		= item.m_latestPageViews;
	this.m_latestRating		= item.m_latestRating;
	this.m_latestVotes		= item.m_latestVotes;
	this.m_latestPopularity		= item.m_latestPopularity;
	this.m_latestBookmarks		= item.m_latestBookmarks;

	// make a note of the last update time stamp
	this.m_timeUpdated		= timeOfUpdate;
	this.m_newArticle		= newArticle;

	// see if anything changed since the last update
	changed = (this.m_title		!= m_latestTitle	||
		   this.m_link		!= m_latestLink		||
		   this.m_lastUpdated	!= m_latestLastUpdated	||
		   this.m_description	!= m_latestDescription	||
		   this.m_pageViews	!= m_latestPageViews	||
		   this.m_rating	!= m_latestRating	||
		   this.m_votes		!= m_latestVotes	||
		   this.m_popularity	!= m_latestPopularity	||
		   this.m_bookmarks	!= m_latestBookmarks	||
		   this.m_newArticle	== true);

	m_changed = changed;

	return changed;
}

PropertyChanged

PropertyChanged 方法允许您查看特定属性是否已更改。只需提供属性名称，并处理返回值（如果属性的值更改，则为 true）。

public bool PropertyChanged(string property)
{
	string originalProperty = property;
	property = property.ToLower();
	switch (property)
	{
		case "title"		: return (Title != LatestTitle);
		case "link"		: return (Link != LatestLink);
		case "description"	: return (Description != LatestDescription);
		case "pageviews"	: return (PageViews != LatestPageViews);
		case "rating"		: return (Rating != LatestRating);
		case "votes"		: return (Votes != LatestVotes);
		case "popularity"	: return (Popularity != LatestPopularity);
		case "bookmarks"	: return (Bookmarks != LatestBookmarks);
		case "lastupdated"	: return (LastUpdated != LatestLastUpdated);
	}
	// if we get here, the property is invalid
	throw new Exception(string.Format("Unknown article property - '{0}'", originalProperty));
}

HowChanged

此方法接受一个属性名称，并返回一个 ChangeType 枚举器，指示新值是等于、大于还是小于上次抓取的值。

public ChangeType HowChanged(string property)
{
	ChangeType changeType = ChangeType.None;

	string originalProperty = property;
	property = property.ToLower();

	switch (property)
	{
		case "title": 
			break;

		case "link": 
			break;

		case "description": 
			break;

		case "pageviews": 
			{
				if (PageViews != LatestPageViews)
				{
					changeType = ChangeType.Up;
				}
			}
			break;

		case "rating": 
			{
				if (Rating > LatestRating)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Rating < LatestRating)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "votes": 
			{
				if (Votes != LatestVotes)
				{
					changeType = ChangeType.Up;
				}
			}
			break;

		case "popularity": 
			{
				if (Popularity > LatestPopularity)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Popularity < LatestPopularity)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "bookmarks": 
			{
				if (Bookmarks > LatestBookmarks)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Bookmarks < LatestBookmarks)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "lastupdated": 
			break;

		default : throw new Exception(
				string.Format("Unknown article property - '{0}'", 
						originalProperty));
	}

	return changeType;
}

ArticleScraper 类

为了方便自己，我将所有抓取代码都放在了这个类中。网页被请求，然后被精细地解析。对于本文的目的，我没有重视确定文章发布的类别/子类别。

RetrieveArticles 方法负责发出页面请求，并管理解析任务，这些任务本身被分解成可管理的块。在测试抓取代码时，我用网页浏览器访问了“我的文章”页面，并将源代码保存到文件中。这使我能够在解析代码的初步开发阶段无需反复请求 CodeProject 就能进行测试。我决定将代码保留在类中，以便其他程序员也能享受到同样的便利。以下是重要部分（代码中指定文本文件随本文下载文件提供）：

	if (this.ArticleSource == ArticleSource.CodeProject)
	{
		// this code actually hits the codeproject website
		string url = string.Format("{0}{1}{2}", 
					   "https://codeproject.org.cn/script/",
					   "Articles/MemberArticles.aspx?amid=", 
					   this.UserID);
		Uri uri = new Uri(url);
		WebClient webClient = new WebClient();
		string response = "";
		try
		{
			// added proxy support for those that need it - many thanks to Pete 
			// O'Hanlon for pointing this out.
			webClient.Proxy = WebRequest.DefaultWebProxy;
			webClient.Proxy.Credentials = CredentialCache.DefaultCredentials;
			// get the web page
			response = webClient.DownloadString(uri);
		}
		catch (Exception ex)
		{
			throw ex;
		}
		pageSource = response;
	}
	else
	{
		// this code loads a sample page source from a local text file
		StringBuilder builder = new StringBuilder("");
		string filename = System.IO.Path.Combine(Application.StartupPath, 
							"MemberArticles.txt");
		StreamReader reader = null;
		try
		{
			reader = File.OpenText(filename);
			string input = null;
			while ((input = reader.ReadLine()) != null)
			{
				builder.Append(input);
			}
		}
		catch (Exception ex)
		{
			throw ex;
		}
		finally
		{
			reader.Close();
		}

		pageSource = builder.ToString();
	}

注意 - 上述代码中构建 url 字符串的行经过格式化，以防止包含的 <pre> 标签可能导致此文章页面需要水平滚动。

获取网页后，`pageSource` 变量应该包含一些内容。如果包含，我们就会执行以下代码（并且我们仍在 `RetrieveArticles` 方法中）

	int articleNumber = 0;
	bool found = true;

	while (found)
	{
		// establish our trigger points
		string articleStart = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", 
					string.Format("{0:00}", articleNumber));
		// we use the beginning of the next article as the 
		// end of the current one
		string articleEnd   = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", 
					string.Format("{0:00}", articleNumber + 1));

		// get the index of the start of the next article
		int startIndex = pageSource.IndexOf(articleStart);

		if (startIndex >= 0)
		{
			// delete everything that came before the starting index
			pageSource = pageSource.Substring(startIndex);
			startIndex = 0;

			// find the end of our articles data
			int endIndex = pageSource.IndexOf(articleEnd);

			// If we don't have an endIndex, then we've arrived 
			// at the final article in our list. 
			if (endIndex == -1)
			{
				endIndex = pageSource.IndexOf("<table");
				if (endIndex == -1)
				{
					endIndex = pageSource.Length - 1;
				}
			}

			// get the substring
			string data = pageSource.Substring(0, endIndex);

			// if we have data, process it
			if (data != "")
			{
				ProcessArticle(data, articleNumber);
			}
			else
			{
				found = false;
			}
			articleNumber++;
		}
		else
		{
			found = false;
		}
	} // while (found)

	CalculateAverages();

我想我可以使用 LINQ 在 XML 中进行搜索，但归根结底，我们不能指望 HTML 是有效的，所以用这种方式解析文本更可靠。我知道，Chris 等人努力确保一切都恰到好处，但他们毕竟只是人类，我们知道不能指望人类每次都做对。

处理文章

“处理”的意思是解析 HTML 并从文章的 div 中挖掘出实际数据。虽然相当简单，但不得不承认它很繁琐。我们首先获取文章的 URL，这是一个直接的操作。

private string GetArticleLink(string data)
{
	string result = data;
	// find the beginning of the desired text
	int hrefIndex = result.IndexOf("href=\"") + 6;
	//find the end of the desired text
	int endIndex = result.IndexOf("\">", hrefIndex);
	// snag it
	result = result.Substring(hrefIndex, endIndex - hrefIndex).Trim();
	// return it
	return result;
}

接下来，我们清理数据，首先删除所有 HTML 标签。对源代码进行了更改，使 HTML 标签的删除更加智能。如果文章标题和/或描述包含不止一个尖括号，此方法几乎可以保证只返回所讨论项目实际文本的一部分。如果您愿意，可以谷歌搜索（并使用）网上提供的许多详尽的 HTML 解析器之一。恕我直言，考虑到这个类的主要用途以及我们从 CodeProject 获得的一致不错的 HTML，这不值得费力。

private string RemoveHtmlTags(string data)
{
	int ltCount = CountChar(data, '<');
	int gtCount = CountChar(data, '>');

	// If the number of left and right pointy bracks are the same, we stand a 
	// reasonable chance that what we think are html tags really ARE html tags.
	if (ltCount == gtCount)
	{
		data = ForwardStrip(data);
	}
	else
	{
		// Otherwise, we have an errant pointy bracket, which we can almost 
		// always take care of depending on the order in which we search for 
		// tags.
		if (gtCount > ltCount)
		{
			data = BackwardStrip(ForwardStrip(data));
		}
		else
		{
			data = ForwardStrip(BackwardStrip(data));
		}
	}
	return data;
}


private int CountChar(string data, char value)
{
	int count = 0;
	for (int i = 0; i < data.Length; i++)
	{
		if (data[i] == value)
		{
			count++;
		}
	}
	return count;
}

private string ForwardStrip(string data)
{
	bool	found	= true;
	do
	{
		int tagStart = data.IndexOf("<");
		int tagEnd = data.IndexOf(">");
		if (tagEnd >= 0)
		{
			tagEnd += 1;
		}
		found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
		if (found)
		{
			string tag = data.Substring(tagStart, tagEnd - tagStart);
			data = data.Replace(tag, "");
		}
	} while (found);
	return data;
}

private string BackwardStrip(string data)
{
	bool	found	= true;
	do
	{
		int tagStart = data.LastIndexOf("<");
		int tagEnd = data.LastIndexOf(">");
		if (tagEnd >= 0)
		{
			tagEnd += 1;
		}
		found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
		if (found)
		{
			string tag = data.Substring(tagStart, tagEnd - tagStart);
			data = data.Replace(tag, "");
		}
	} while (found);
	return data;
}

然后，我们删除所有剩余的额外内容

private string CleanData(string data)
{
	// get rid of the HTML tags
	data = RemoveHtmlTags(data);

	// get rid of the crap that's left behind
	data = data.Replace("\t", "^").Replace(" ", "");
	data = data.Replace("\n","").Replace("\r", "");
	data = data.Replace(" / 5", "");
	while (data.IndexOf("  ") >= 0)
	{
		data = data.Replace("  ", " ");
	}
	while (data.IndexOf("^ ^") >= 0)
	{
		data = data.Replace("^ ^", "^");
	}
	while (data.IndexOf("^^") >= 0)
	{
		data = data.Replace("^^", "^");
	}
	data = data.Substring(1);
	data = data.Substring(0, data.Length - 1);
	return data;
}

在此之后，我们得到一个纯粹的数据列表，用插入符字符分隔，描述了这篇文章。剩下的就是创建一个 ArticleUpdate 项并将其存储在我们的泛型列表中。

private void ProcessArticle(string data, int articleNumber)
{
	string link	= GetArticleLink(data);
	data = CleanData(data);
	string[] parts = data.Split('^');
	string title = parts[0];
	string description = parts[7];
	string lastUpdated = GetDataField("Last Update", parts);
	string pageViews = GetDataField("Page Views", parts).Replace(",", "");
	string rating = GetDataField("Rating", parts);
	string votes = GetDataField("Votes", parts).Replace(",", "");
	string popularity = GetDataField("Popularity", parts);
	string bookmarks = GetDataField("Bookmark Count", parts);

	// create the AticleData item and add it to the list
	DateTime lastUpdatedDate;
	ArticleUpdate article = new ArticleUpdate();
	article.LatestLink = string.Format("https://codeproject.org.cn{0}", link);
	article.LatestTitle = title;
	article.LatestDescription = description;
	if (DateTime.TryParse(lastUpdated, out lastUpdatedDate))
	{
		article.LatestLastUpdated = lastUpdatedDate;
	}
	else
	{
		article.LatestLastUpdated = new DateTime(1990, 1, 1);
	}
	article.LatestPageViews		= Convert.ToInt32(pageViews);
	article.LatestRating		= Convert.ToDecimal(rating);
	article.LatestVotes		= Convert.ToInt32(votes);
	article.LatestPopularity	= Convert.ToDecimal(popularity);
	article.LatestBookmarks		= Convert.ToInt32(bookmarks);

	AddOrChangeArticle(article);
}

private void AddOrChangeArticle(ArticleUpdate article)
{
	bool found = false;
	DateTime now = DateTime.Now;

	// apply changes
	for (int i = 0; i < m_articles.Count; i++)
	{
		ArticleUpdate item = m_articles[i];
		if (item.LatestTitle.ToLower() == article.LatestTitle.ToLower())
		{
			found = true;
			item.ApplyChanges(article, now, false);
			break;
		}
	}

	// if the article was not found, it must be new (or the title has changed), 
	// so we'll add it
	if (!found)
	{
		article.ApplyChanges(article, now, true);
		m_articles.Add(article);
	}

	// remove all articles that weren't updated this time around - we need to 
	// traverse the list in reverse order so we don't lose track of our index
	for (int i = m_articles.Count - 1; i == 0; i--)
	{
		ArticleUpdate item = m_articles[i];
		if (item.TimeUpdated != now)
		{
			m_articles.RemoveAt(i);
		}
	}
}

示例应用程序

样本应用程序无疑是一个粗略的事务，坦率地说，它旨在展示使用抓取代码的一种可能方式。我决定使用 WebBrowser 控件，但进行到应用程序大约一半时，我开始后悔这个决定。然而，我担心自己会对此感到厌倦，并决定坚持下去。

你会发现我并没有费尽心思美化一切。例如，我用 PNG 文件作为图形，而不是 GIF 文件。这意味着在运行 IE6 或更早版本的系统上，PNG 文件中的透明度无法正确处理。

应用程序允许您选择要排序的数据以及排序方向（升序或降序）。默认值是按上次更新日期降序排列，以便最新文章显示在最前面。

WebBrowser 控件在一个表格中显示文章，并使用图标指示更改的数据和有关文章的某些统计信息。文章标题是实际文章页面的超链接，该页面显示在 WebBrowser 控件中。要返回文章显示，您必须单击“排序”按钮，因为我没有实现正常网页浏览器中的任何前进/后退功能。

使用的图标如下：

- 表示新文章。当您首次启动应用程序时，所有文章都将显示为新文章。

- 表示评分最高的文章。

- 表示评分最差的文章。

- 表示投票最多的文章。

- 表示页面浏览量最多的文章。

- 表示最受欢迎的文章。

- 表示收藏最多的文章。

- 表示关联字段值增加。

- 表示关联字段值减少。

表单上的其他控件包括：

只显示新信息

此复选框允许您筛选文章列表，以便仅显示新文章和具有新数据的文章。

显示图标

此复选框允许您打开和关闭图标的显示。

自动刷新

此复选框允许您打开和关闭自动刷新。每隔一小时，一个 BackgroundWorker 对象就会用于刷新文章数据。

按钮 - 从 CodeProject 刷新

此按钮允许您手动刷新文章数据（即使自动刷新已开启，此按钮也可用）。

最后，您可以指定要检索数据的用户的用户 ID。指定新 ID 后，点击刷新按钮。

闭运算

此代码仅用于检索您自己的文章——抓取类接受用户 ID 作为参数，并且该 ID 当前设置为我自己的。在您开始查找自己的文章之前，请确保更改该 ID。

我已经尽力使其尽可能易于维护，而不会强迫程序员进行概念上的倒腾，但我无法适应每个人的阅读理解水平，所以我想说的是——您基本上只能靠自己了。我不能保证我能及时维护这篇文章，但这应该不重要。我们这里都是程序员，我所呈现的内容并不是什么高深莫测的东西。此外，您在提供的类中有大量的例子可以修改和/或扩展它们的功能。祝您玩得开心。

另请记住，png 文件和 css 文件需要与可执行文件位于同一文件夹中，否则将无法找到它们。

历史

2010 年 2 月 19 日（重要！）：代码中存在一个 bug，会导致程序始终告诉您无法检索文章信息。下载代码后，进行以下更改：在文件 ArticleScraper.cs 中，找到如下所示的行（在 ProcessArticle() 方法中）

string rating = GetDataField("Rating", parts);

并将其更改为：

string rating = GetDataField("Rating", parts).Replace("/5", "");

2008 年 10 月 14 日：解决了以下问题：

添加了通过代理检索网页的支持（感谢 Pete O'Hanlon！）。
添加了在网页检索过程中抛出任何异常的代码（再次感谢 Pete O'Hanlon！）。
添加了更彻底的 HTML 解析，以处理文章标题或描述中错误的 < 和 >（感谢 ChandraRam！）。
将图标嵌入为 exe 文件中的资源。它们将在 exe 首次运行时复制到应用程序文件夹。
在表单顶部添加了一个新的统计项——“显示文章”。
将表单顶部的所有内容放入分组框中，使其看起来更有条理。

2008 年 10 月 13 日：解决了以下问题：

添加了遗忘的 mostvotes.png 图像。
修改了代码以使用 mostvotes 图像。
在表单中添加了一个文本框，允许您指定 userID。
修复了表单大小调整问题。
zip 文件现在包含带有图像和 css 文件的 debug 文件夹。

2008 年 10 月 13 日：发布原始文章。