使用信息检索技术替代商业 Web 服务





5.00/5 (1投票)
在本文中,我将讨论一个示例,说明如何使用信息检索技术从MSN Money页面抓取数据,以获得免费的货币兑换率和其他报价的Web服务。
引言
厌倦了寻找免费的货币兑换和股票信息Web服务?这里是解决方案。在我未能找到一个良好且免费的货币兑换和其他股票信息的Web服务后,我产生了此想法。
信息检索将帮助我们获取隐藏在HTML页面中的信息,并尝试将其放入易于使用的标准格式。
在此示例中,我展示了一个名为DataGrabber
的类,该类用于从MSN Money页面检索选定的信息。让我们先看一下这个页面,并确定其中的感兴趣区域。
我在此网页中标记了一些内容:汇率、方向箭头、变化、变化率和货币转换器。这些是DataGrabber
通过解析页面HTML代码将提供的服务。此外,我们应该能够获取页面的标题,以准确地指定报价名称并确定是否未找到符号。
获取页面HTML
首先,我们必须获取所需页面的HTML。MSN的页面URL如下:http://moneycentral.msn.com/detail/stock_quote?Symbol=SYMBOL&FormatAs=Index。
我们只需要指定我们正在获取信息的符号,然后使用WebClient
对象向MSN发送请求。
'Assuming Symbol property is defined
Private Const URL As String = "http://moneycentral.msn.com/detail/stock_quote?" _
& "Symbol={0}&FormatAs=Index"
Private Function GetPageCode() As String
Dim c As New WebClient()
Dim data As Stream = c.OpenRead(String.Format(URL, Symbol))
Dim reader As New StreamReader(data)
Dim str As String = reader.ReadToEnd
reader.Close() : reader.Dispose()
data.Close() : data.Dispose()
c.Dispose()
Return str
End Function
'Use the following property to keep HTML alive inside the object
Private ReadOnly Property Page() As String
Get
If _page Is Nothing Then _page = GetPageCode()
Return _page
End Get
End Property
'To refresh data in next query, just set _page to null
Public Sub RefreshData()
_page = Nothing
End Sub
现在我们有了所需的HTML代码。让我们开始解析。
使用字符串函数的简单HTML解析
在进入HTML解析函数之前,让我们回顾一下String
类中的一些基本函数。这些函数在我们的情况下非常有用。
IndexOf(s As String)
:此函数返回s
第一次出现的位置索引,如果未找到s
则返回-1。IndexOf(s As String, startIndex As Integer)
:与上一个类似,但从startIndex
而不是字符串的开头开始。LastIndexOf(s As String)
:返回s
最后一次出现的位置索引,如果未找到s
则返回-1。Substring(startIndex As Integer, length As Integer)
:返回从startIndex
开始,长度为length
的子字符串。ToLower()
:返回整个字符串的小写形式。Trim()
:删除字符串开头和结尾的空格。Trim(ParamArray trimChars() As Char)
:删除字符串开头和结尾的所有指定字符。StartsWith(s As String)
:如果字符串以s
开头,则返回true。
获取页面标题
这个只读属性通过搜索<title></title>
标签来解析HTML代码,以获取包含当前报价详细信息的页面标题。
Public ReadOnly Property Title() As String
Get
Dim i1, i2 As Integer
'Remember to add the length of the tag itself
'in order to get rid of it
i1 = Page.ToLower.IndexOf("<title>") + 7
i2 = Page.ToLower.IndexOf("</title>")
Return Page.Substring(i1, i2 - i1)
End Get
End Property
获取汇率
汇率值位于一个span
中,该span
具有一个名为s1
的CSS类。所以我们搜索这个span
并读取其中存储的值。我们使用与获取标题相同的方法。
Private Const S1 As String = "<span class=""s1"">" 'CSS class for rate
Public ReadOnly Property Rate() As Double
Get
Dim i1, i2 As Integer
i1 = Page.ToLower.IndexOf(S1) + S1.Length
i2 = Page.ToLower.IndexOf("</span>", i1)
Dim d As Double = CDbl(Page.Substring(i1, i2 - i1))
Return d
End Get
End Property
获取变化和变化率
变化可能是 UP(上涨)、DOWN(下跌)或 UNCH(不变)。在尝试读取值之前,我们必须知道我们在寻找什么。我们可以通过查找images/up.gif或images/down.gif来得知。如果两者都不存在,则返回0(不变)。
Public ReadOnly Property Change() As Double
Get
Dim s As String
If Page.ToLower.IndexOf(UP) <> -1 Then
'Change is UP
s = S4
ElseIf Page.ToLower.IndexOf(DOWN) <> -1 Then
'Change is DOWN
s = S5
Else
'No change
Return 0
End If
Dim i1, i2 As Integer
'The location of change span is just after rate span, so we
'start searching from there by using S1 CSS class
i1 = Page.ToLower.IndexOf(s, Page.ToLower.IndexOf(S1)) + s.Length
i2 = Page.ToLower.IndexOf("</span>", i1)
Dim d As Double = CDbl(Page.Substring(i1, i2 - i1))
Return d
End Get
End Property
Public ReadOnly Property ChangeRatio() As Double
Get
Dim s As String
If Page.ToLower.IndexOf(UP) <> -1 Then
'Change is UP
s = S4
ElseIf Page.ToLower.IndexOf(DOWN) <> -1 Then
'Change is DOWN
s = S5
Else
'No change
Return 0
End If
Dim i1, i2 As Integer
'Change ration is the last span if S4 or S5 in the page, so we
'search for the last index of the tag.
i1 = Page.ToLower.LastIndexOf(s) + s.Length
i2 = Page.ToLower.IndexOf("</span>", i1)
Dim d As Double = CDbl(Page.Substring(i1, i2 - i1).Trim("%"))
Return d
End Get
End Property
获取具有汇率的货币列表
我将解释代码的最后一部分是执行货币之间基于对美元汇率的兑换率计算的部分。这些信息存储在JavaScript代码中,而不是HTML中,这将使我们的解析工作更加容易。
首先,让我们看一下货币名称、符号和值(相对于美元)的存储格式。这是您将在MSN Money页面HTML代码中找到的长列表的一部分。
curUSD2X['AED'] = new currency(0.272279262542725, 'Emirati Dirham');
curUSD2X['ARS'] = new currency(0.330906689167023, 'Argentine Peso');
curUSD2X['AUD'] = new currency(0.870776772499084, 'Australian Dollar');
curUSD2X['BHD'] = new currency(2.65561938285828, 'Bahraini Dinar');
很明显,我们可以按如下方式分割每一行。这些常量字符串可以与IndexOf()
和Substring()
函数一起使用,如我们之前所见,以从每一行中提取我们需要的值。
'Currency exchange rates are found in the following format in MSN's page
'where {0} is the symbol, {1} is the value, and {2} is the name
'curUSD2X['{0}'] = new currency({1}, '{2}');
Private Const CUR0 As String = "curUSD2X['"
Private Const CUR1 As String = "'] = new currency("
Private Const CUR2 As String = ", '"
Private Const CUR3 As String = "')"
请注意,此列表**不会**出现在页面的HTML中,除非该符号是用于汇率(例如,/USDEUR、/ILSUSD等),但不是其他符号(如$INDU、MSFT、-CL等)。
为了方便处理货币信息,我们定义了如下Currency
类。
Public Class Currency
Private _name As String
Private _symbol As String
Private _amount As Double
Public Sub New(ByVal symbol As String, _
ByVal amount As Double, ByVal name As String)
Me.Symbol = symbol
Me.Amount = amount
Me.Name = name
End Sub
Public Property Name()
Get
Return _name
End Get
Set(ByVal value)
_name = value
End Set
End Property
Public Property Symbol() As String
Get
Return _symbol
End Get
Set(ByVal value As String)
_symbol = value
End Set
End Property
Public Property Amount() As Double
Get
Return _amount
End Get
Set(ByVal value As Double)
_amount = value
End Set
End Property
Public Shared Function Convert(ByVal fromCur As Currency, _
ByVal toCur As Currency, ByVal amount As Double) As Double
Dim result As Double
result = fromCur.Amount * (1 / toCur.Amount)
Return result * amount
End Function
End Class
'This comparer will help us in sorting...
Public Class CurrencyNameComparer
Implements IComparer(Of Currency)
Public Function Compare(ByVal x As Currency, ByVal y As Currency) _
As Integer Implements System.Collections.Generic.IComparer(Of Currency).Compare
Return String.Compare(x.Name, y.Name)
End Function
End Class
回到DataGrabber
;定义以下属性,这将帮助我们确定符号的类型。
Public ReadOnly Property IsCurrency() As Boolean
Get
Return Symbol.StartsWith("/")
End Get
End Property
最后,这是遍历页面HTML中所有货币汇率行并返回已填充的货币列表的函数。
Public Function GetCurrencyList() As List(Of Currency)
If Not IsCurrency Then
Throw New ArgumentException("This works only for currency exchange symbols")
End If
Dim l As New List(Of Currency)
Dim s As String
Dim startIndex As Integer = Page.IndexOf(CUR0)
Dim endIndex As Integer = Page.IndexOf("}", startIndex)
s = Page.Substring(startIndex, endIndex - startIndex)
Dim lines() As String = s.Split(";")
For Each str As String In lines
str = str.Trim
If Not str.StartsWith(CUR0) Then Exit For
Dim nam, sym As String
Dim amt As Double
Dim i1, i2 As Integer
i1 = str.IndexOf(CUR0) + CUR0.Length
i2 = str.IndexOf(CUR1)
sym = str.Substring(i1, i2 - i1)
i1 = str.IndexOf(CUR1) + CUR1.Length
i2 = str.IndexOf(CUR2)
amt = CDbl(str.Substring(i1, i2 - i1))
i1 = str.IndexOf(CUR2) + CUR2.Length
i2 = str.IndexOf(CUR3)
nam = str.Substring(i1, i2 - i1)
Dim cur As New Currency(sym, amt, nam)
l.Add(cur)
Next
l.Sort(New CurrencyNameComparer)
Return l
End Function
等等!如果Microsoft更改了HTML代码怎么办?
从一开始就很明显,如果Microsoft更改了其MSN Money页面的HTML代码,整个解决方案将非常容易受到破坏。为了尽量减少此风险,我尝试将解析绑定到更可能保持不变的CSS类;因为,Microsoft可能会更改类本身,但不会更改类的名称,即使页面的整体设计发生变化。无论如何,这是一个给出思路的通用解决方案;您可以尝试编写更复杂的解决方案,该解决方案可以使用某种正则表达式和存储在外部文件中的解析规则,这样您就可以跟上未来的变化,而无需重写代码或重新编译。
有关更多文章,请访问我的博客:http://vbnet4arab.blogspot.com(仅限阿拉伯语)。
祝您编码愉快!