AJAX 网页抓取和与 digg 的交互






2.50/5 (2投票s)
2006 年 9 月 1 日
3分钟阅读

35301

288
抓取具有 JavaScript 生成内容的网站,以创建一个简单的新闻查看器。
引言
与使用 JavaScript 生成内容的网站进行交互,直到现在,要么非常复杂,要么几乎不可能。 本教程将演示 WebRobot v1.1 组件的用法,以与社交书签网站 digg 进行交互,该网站大量使用 JavaScript 来生成显示的内容并与之交互。您可以点击这里下载完整的应用程序,您还可以下载 WebRobot v1.1 组件的免费试用版,或者 这里适用于 .NET Framework 2.0 的用户。
首先,我们将创建 WebRobot 组件的实例,并启用 AJAX 模式
Private wrobot As New foxtrot.xray.WebRobot
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
wrobot.AJAX = True
End Sub
Private Sub Form1_Closing(ByVal sender As Object, _
ByVal e As System.ComponentModel.CancelEventArgs) _
Handles MyBase.Closing
wrobot.Dispose()
End Sub
我们创建了 WebRobot 的实例,启用了 AJAX 模式,然后在窗体的 Closing 事件中,我们调用了 Dispose 方法来释放所有资源。现在,我们将登录到 digg
'Load the main digg page
wrobot.LoadPage("http://digg.com")
'Get the login form
Dim loginform As foxtrot.xray.Form = wrobot.GetFormByContainsAction("login")
'Username field
Dim userfield As foxtrot.xray.Input.Text = loginform.Fields(0)
'Password field
Dim pswdfield As foxtrot.xray.Input.Password = loginform.Fields(1)
'Submit button
Dim sbmtfield As foxtrot.xray.Input.Submit = loginform.Fields(3)
userfield.Value = username
pswdfield.Value = password
'Simulate a click on the submit button
sbmtfield.Click()
加载主页并填写登录表单后,我们点击了提交按钮。我们可以使用 WebRobot 的 SubmitForm 方法,但由于此页面可能使用 JavaScript 进行表单和按钮事件,因此直接模拟点击会更安全,以便解释任何代码。 Click 事件会阻塞,直到所有操作都执行完毕并且任何必要的页面导航都已完成。
现在,我们可以开始解析主页内容,以检测所有显示的新闻项目。 WebRobot v1.1 组件具有一个 Element 对象和一个 FindElements 方法,可以筛选页面。 Event 对象还公开了一个 Click 方法,允许点击解析后找到的元素。让我们查找新闻项目
Dim newsitems As New System.Collections.ArrayList
'Get the list of DIV elements on the web page
Dim elements() As foxtrot.xray.Element = wrobot.FindElements("div")
For Each item As foxtrot.xray.Element In elements
'Remove the CR and LF characters at the start of the element that the
'digg html source contains
Dim text As String = item.Text.TrimStart(vbCrLf.ToCharArray()).ToLower
'Look for DIVs of news-summary class
If (text.IndexOf("<div class=news-summary") = 0) Then
newsitems.Add(item)
End If
Next
现在,我们有了包含新闻项目的 DIV。请注意使用元素的 Text 属性来搜索 DIV 的类。
现在我们有了 DIV 列表,我们将解析它们的内容
For Each newsitem As foxtrot.xray.Element In newsitems
'Object to store parsed article info
Dim artinfo As New ArticleInfo
Get the H3s in the item, to look for the title
Dim titledata() As foxtrot.xray.Element = newsitem.FindElements("H3")
'The first H3 contains the title, now find the A HREF containing
'the news link
Dim urldata() As foxtrot.xray.Element = titledata(0).FindElements("A")
'The first A HREF found contains the news link
Dim ahref As String = urldata(0).Text
'Regular expression to get the URL and the title of the story
Dim parser As New _
System.Text.RegularExpressions.Regex("href=""(.*)"".*>(.*)</", _
System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
System.Text.RegularExpressions.RegexOptions.Singleline)
'Store the URL and title
artinfo.URL = parser.Matches(ahref).Item(0).Groups.Item(1).Value
artinfo.Title = parser.Matches(ahref).Item(0).Groups.Item(2).Value
'More parsing code follows
.
.
.
Next
我们在 DIV 中搜索找到了故事的 URL 和标题。现在,我们将找到 diggs 的数量、digg This! 链接以及每个新闻项目的 digg 讨论
'The amount of diggs is contained in a STRONG element. Find the one
'with a class that matches diggs-strong-
Dim digginfo() As foxtrot.xray.Element = newsitem.FindElements("strong")
For Each item As foxtrot.xray.Element In digginfo
Dim text As String = item.Text.TrimStart(vbCrLf.ToCharArray()).ToLower
If (text.IndexOf("<strong id=diggs-strong-") = 0) Then
parser = New System.Text.RegularExpressions.Regex(">(.*)</", _
System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
System.Text.RegularExpressions.RegexOptions.Singleline)
'Store the diggs count
artinfo.Diggs = _
Integer.Parse(parser.Matches(text).Item(0).Groups.Item(1).Value)
End If
Next
'The digg this! link and the digg discussion links are stored in A HREFs
urldata = newsitem.FindElements("A")
For Each item As foxtrot.xray.Element In urldata
If (item.Text.IndexOf("digg it") > -1) Then
'If item contains digg it, it's the digg this! link.
'If the user has already dugg the item, this link will
'not be present. If present, we will store the Element
'object to simulate a click
artinfo.DiggLink = item
ElseIf (item.Text.IndexOf("class=more") > -1) Then
'If the A HREF class is more, then this is the digg discussion link
parser = New System.Text.RegularExpressions.Regex("href=""(.*)"".*>(.*)</", _
System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
System.Text.RegularExpressions.RegexOptions.Singleline)
artinfo.DiggMore = parser.Matches(item.Text).Item(0).Groups.Item(1).Value
End If
Next
'Create a new item for the main article ListView
Dim litem As New ListViewItem(artinfo.Title)
'Store the article info in the articlelist HashTable
articlelist(litem) = artinfo
ListView1.Items.Add(litem)
我们已经用文章信息填充了表单。现在,我们添加代码以加载一个网页浏览器实例,其中包含点击的链接故事
Private Sub ListView1_DoubleClick(ByVal sender As Object, _
ByVal e As System.EventArgs) Handles ListView1.DoubleClick
'Are there any selected items?
If (ListView1.SelectedItems.Count > 0) Then
'Get the article info related to the selected item
Dim item As ListViewItem = ListView1.SelectedItems(0)
Dim artinfo As ArticleInfo = articlelist(item)
'Launch a new web browser instance with the URL
System.Diagnostics.Process.Start(artinfo.URL)
End If
End Sub
现在,我们创建一个上下文菜单,以便在用户右键单击文章时显示。 此上下文菜单将显示 diggs 的数量(在 MenuItem1 中),允许用户 digg 该故事(在 MenuItem2 中),并启动一个包含 digg 讨论的浏览器实例(在 MenuItem3 中)。 首先,我们将添加代码以更新 digg 计数以及新闻项目是否已被 digg
Private Sub ListView1_Click(ByVal sender As Object, _
ByVal e As System.EventArgs) Handles ListView1.Click
'Is there a selected item?
If (ListView1.SelectedItems.Count > 0) Then
'Enable the context menu
ListView1.ContextMenu = ContextMenu1
'Get the article info related to the selected item
Dim item As ListViewItem = ListView1.SelectedItems(0)
Dim artinfo As ArticleInfo = articlelist(item)
'Update digg count
MenuItem1.Text = artinfo.Diggs.ToString & " Diggs"
'Can we digg this item?
If (artinfo.DiggLink Is Nothing) Then
'Item already dugg
MenuItem2.Text = "Dugg!"
MenuItem2.Enabled = False
Else
'We can dig this item
MenuItem2.Text = "Digg this!"
MenuItem2.Enabled = True
End If
Else
'Disable the context menu
ListView1.ContextMenu = Nothing
End If
End Sub
现在,我们可以添加代码来 digg 一个新闻项目
Private Sub MenuItem2_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MenuItem2.Click
'Is there a selected item?
If (ListView1.SelectedItems.Count > 0) Then
ListView1.ContextMenu = ContextMenu1
'Get the article info related to the selected item
Dim item As ListViewItem = ListView1.SelectedItems(0)
Dim artinfo As ArticleInfo = articlelist(item)
'Are we sure we can digg this item?
If Not (artinfo.DiggLink Is Nothing) Then
'Simulate a click on the digg this! link, which
'contains JavaScript code, but no valid HREF
artinfo.DiggLink.Click()
'Clear this item so that we cannot try to digg it
'again, update digg count, and update the user
'interface
artinfo.DiggLink = Nothing
artinfo.Diggs += 1
MenuItem2.Text = "Dugg!"
MenuItem2.Enabled = False
MenuItem1.Text = artinfo.Diggs.ToString & " Diggs"
End If
End If
End Sub
最后,我们添加代码以加载一个包含 digg 讨论链接的浏览器窗口
Private Sub MenuItem3_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MenuItem3.Click
'Are there any selected items?
If (ListView1.SelectedItems.Count > 0) Then
'Get the article info related to the selected item
Dim item As ListViewItem = ListView1.SelectedItems(0)
Dim artinfo As ArticleInfo = articlelist(item)
'Launch a new web browser instance with the digg discussion
System.Diagnostics.Process.Start(artinfo.DiggMore)
End If
End Sub
我们已经与 digg 进行了交互,模拟了真实用户点击链接。 除了验证码,Web 应用程序无法知道是否是真实用户在控制。