65.9K
CodeProject 正在变化。 阅读更多。
Home

在 C# 搜索引擎/网络爬虫中添加功能

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.74/5 (22投票s)

2006年5月24日

CPOL

17分钟阅读

viewsIcon

269519

downloadIcon

3595

为 Searcharoo 项目添加高级搜索引擎功能(和持久化目录)

Sample image of Searcharoo: shows both the 'initial page' (bottom right) and the results view

背景

本文承接之前的两个 Searcharoo 示例

Searcharoo 版本 1 介绍了如何构建一个简单的搜索引擎,该搜索引擎从指定文件夹爬取文件系统,并索引所有 HTML(或其他已知类型)的文档。开发了一个基本的设计和对象模型,以支持简单的单字搜索,搜索结果显示在一个基础的查询/结果页面上。

Searcharoo 版本 2 重点在于添加一个“爬虫”来查找要索引的数据,通过跟踪网页链接(而不是仅仅查看文件系统中的目录列表)。这意味着通过 HTTP 下载文件,解析 HTML 以查找更多链接,并确保我们不会陷入递归循环,因为许多网页会相互引用。本文还讨论了如何将多个搜索词的结果合并到一个“匹配”集中。

引言

本文(Searcharoo 版本 3)涵盖了三个主要方面

  1. 为目录实现“保存到磁盘”功能
  2. 功能建议、错误修复以及整合其他人对先前文章的贡献代码(主要来自 CodeProject - 谢谢!)
  3. 改进代码本身(添加注释、移动类、提高可读性,并希望使其更容易修改和重用)

新“功能”包括

  • 将目录(内存中用于快速搜索)保存到磁盘
  • 使爬虫能够识别并跟踪 FRAMESET 和 IFRAME 中引用的页面(le_mo_mo 建议)
  • 结果分页显示,而不是全部列在一个页面上(由 Jim Harkins 提交)
  • 标准化单词和数字(去除标点符号等)
  • (可选)对英文单词进行词干提取,以减小目录大小(Chris Taylor 和 Trickster 建议)
  • (可选)使用停用词来减小目录大小
  • (可选)创建“Go word”列表,专门为领域特定词汇(如“C#”)创建目录,这些词汇否则可能会被忽略

错误修复包括

  • 正确解析可能带有额外属性的 标签,例如在 ASP.NET 环境中的 ID= 属性。(xenomouse 提交)</li> <li>处理服务器设置的用于跟踪“会话”的 Cookie。(Simon Jones 提交)</li> <li>检查重定向后的“最终”URL,以确保正确的页面被索引和链接。(Simon Jones 提交)</li> <li>正确解析(并遵守!)ROBOTS meta 标签。(我自己在发现的这个 bug)</li> </ul> <h4>代码布局改进包括</h4> <ul> <li>将 SearcharooSpider.aspx 中混乱的爬虫代码移到一个真正的 C# 类中(并实现 EventHandler 以允许监视进度)</li> <li>将偏好设置封装到一个静态类中</li> <li>使用 #regions 布局 Searcharoo.cs(如果您有 VS.NET,易于阅读)</li> <li>为搜索框创建用户控件(Searcharoo.ASCX)- 如果您想重新品牌化,只需在一个地方修改即可。</li> <li>使用 PagedDataSource 实现分页,您可以轻松地在 Searcharoo3.aspx 中更改结果的“模板”(例如链接大小/颜色/布局)</li> </ul> <h3>设计</h3> <p>(从版本 1 开始)<em>基本</em>的 Catalog-File-Word 设计保持不变,但此版本实现了许多额外的类。</p> <p><img alt="Object Model for Searcharoo3" src="https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_2.png" border="1"></p> <p>为了构建目录,SearcharooSpider.aspx 调用 Spider.BuildCatalog(),该方法</p> <ol> <li>访问 Preferences 静态对象以读取设置</li> <li>创建空的 Catalog</li> <li>创建 IGoWord、IStopper 和 IStemming 的实现(基于 Preferences)</li> <li>处理 startPageUri(使用 WebRequest)</li> <li>创建 HtmlDocument,填充属性,包括 Link 集合</li> <li>解析页面内容,根据需要创建 Word 和 File 对象</li> <li>为每个 LocalLink 递归应用步骤 4 到 6</li> <li>使用 CatalogBinder 将 Catalog 二进制序列化到磁盘</li> <li>将 Catalog 添加到 Application.Cache[] 中,供 Searcharoo3.aspx 用于搜索!</li> </ol> <h3>代码结构</h3> <p>这些是此版本中使用的文件(包含在 <a href="Searcharoo_3/Searcharoo3.zip">下载</a>中)。</p> <table cellspacing="0" width="600" border="1"> <tbody> <tr> <th>web.config</th> <td>14 个设置,用于控制爬虫<em>和</em>搜索页面的行为。它们都是“可选的”(即,如果未提供配置设置,爬虫和搜索页面将运行),但我建议至少提供<br /><code lang="xml"><add key="Searcharoo_VirtualRoot" value="https:///content/" /></code> </td> </tr> <tr> <th>Searcharoo.cs</th> <td>该应用程序的大部分代码都在这个文件中。版本 2 中 ASPX 文件中的许多类(如 Spider 和 HtmlDocument)已被移到这个文件中,因为它更容易阅读和维护。新的版本 3 功能(Stop、Go、Stemming)都添加在这里。<br /><img alt="Searcharoo.cs contents viewed in VisualStudio.NET with regions collapsed" src="https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_3.gif" border="1"> </td> </tr> <tr> <th>Searcharoo3.aspx</th> <td>搜索页面(输入和结果)。检查 Application-Cache 中是否有 Catalog,如果没有,则创建它(反序列化或运行 SearcharooSpider.aspx)。</td> </tr> <tr> <th>Searcharoo.ascx</th> <td><strong>新</strong>用户控件,包含两个 asp:Panel<ul> <li>“空白”搜索框(页面首次加载时,默认为黄色背景)</li> <li>已填充的搜索框(显示结果时,默认为蓝色背景)</li> </ul>(参见文章顶部的截图)</td> </tr> <tr> <th>SearcharooSpider.aspx</th> <td>主页面(Searcharoo3.aspx)将 Server.Transfer 到此页面以创建新 Catalog(如果需要)。<br />版本 2 中此页面中几乎所有代码都已迁移到 Searcharoo.cs - OnProgressEvent() 允许它在爬虫进行时仍显示“进度”消息。</td> </tr> </tbody> </table> <h2>将 Catalog 保存到磁盘</h2> <p>将目录保存到磁盘有几个好处</p> <ul> <li>它可以在与网站不同的服务器上构建(对于小型网站,代码可能没有权限在 Web 服务器上写入磁盘)</li> <li>如果服务器的 Application 重启,目录可以重新加载,而不是完全重新构建</li> <li>您最终可以“看到”目录中存储了什么信息 - 有助于调试!</li> </ul> <p>框架中提供了两种序列化(XML 和二进制)方式,由于 XML 是“人类可读的”,所以似乎是尝试它的逻辑选择。序列化 Catalog 所需的代码非常简单 - 下面的代码来自 Catalog.Save() 方法,因此对<strong>此</strong>的引用就是 Catalog 对象。</p> <pre lang="cs">XmlSerializer serializerXml = <span class="cs-keyword">new</span> XmlSerializer( <span class="cs-keyword">typeof</span>( Catalog ) ); System.IO.TextWriter writer = <span class="cs-keyword">new</span> System.IO.StreamWriter( Preferences.CatalogFileName+<span class="cpp-string">".xml"</span> ); serializerXml.Serialize( writer, <span class="cs-keyword">this</span> ); writer.Close(); </pre> <p>我主要使用的“测试数据集”是<a href="http://www.cia.gov/cia/publications/factbook/" target="_blank">CIA World Factbook</a>(<a href="http://www.cia.gov/cia/download.html">下载</a>),它在磁盘上大约是<strong>52.6 MB</strong>(仅 HTML,不包括图像和不可搜索数据) - 所以当我看到 XML 序列化的 Catalog 本身是其三倍大,达到<strong>156 MB</strong>(是的,兆字节!)时,我多么“惊讶”。几乎无法轻松<em>打开</em>它,只能通过命令提示符“type”它。</p> <img height="410" alt="Xml Serialization is VERBOSE: 136 Mb!" src="https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_136mb.gif" width="600"> <p>哎哟 - 太浪费空间了!更糟的是,这是我第一次注意到 File 类中定义的字段被声明为 public 而不是 private(参见以 `_` 开头的字段)。首先,让我们去除序列化重复项(应该是 private 的字段及其 public 属性对应项)-- 而不是改变可见性(并可能破坏代码),可以在定义中添加 `[XmlIgnore]` 属性。为了进一步减少重复文本的数量,使用 `[XmlElement]` 属性将元素名称压缩为单个字母,并且为了减少 `<>` 的数量,一些属性被标记为 `[XmlAttribute]` 进行序列化。</p> <pre lang="cs">[Serializable] <span class="cs-keyword">public</span> <span class="cs-keyword">class</span> Word { [XmlElement(<span class="cpp-string">"t"</span>)] <span class="cs-keyword">public</span> <span class="cs-keyword">string</span> Text; [XmlElement(<span class="cpp-string">"fs"</span>)] <span class="cs-keyword">public</span> File[] Files ... [Serializable] <span class="cs-keyword">public</span> <span class="cs-keyword">class</span> File { [XmlIgnore] <span class="cs-keyword">public</span> <span class="cs-keyword">string</span> _Url; ... [XmlAttribute(<span class="cpp-string">"u"</span>)] <span class="cs-keyword">public</span> <span class="cs-keyword">string</span> Url { ... [XmlAttribute(<span class="cpp-string">"t"</span>)] <span class="cs-keyword">public</span> <span class="cs-keyword">string</span> Title { ... [XmlElement(<span class="cpp-string">"d"</span>)] <span class="cs-keyword">public</span> <span class="cs-keyword">string</span> Description { ... [XmlAttribute(<span class="cpp-string">"d"</span>)] <span class="cs-keyword">public</span> DateTime CrawledDate { ... [XmlAttribute(<span class="cpp-string">"s"</span>)] <span class="cs-keyword">public</span> <span class="cs-keyword">long</span> Size { ... ... </pre> <p>XML 文件现在只有微小(不是!)的<strong>49 MB</strong>,仍然太大而无法用记事本打开,但可以通过 cmd 轻松查看。正如您在下面看到的,“压缩”XML 确实节省了一些空间 - 至少 Catalog 现在比源数据小了!</p> <img height="195" alt="Xml Serialization is slightly less verbose - 49 Mb" src="https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_49mb.gif" width="600"> <p>即使输出缩小了,49MB 的 XML 仍然有点太冗长而不实用(这并不奇怪,XML 通常是这样!),所以让我们将索引序列化为二进制格式(同样,框架类使这非常简单)。</p> <pre lang="cs">System.IO.Stream stream = <span class="cs-keyword">new</span> System.IO.FileStream (Preferences.CatalogFileName+<span class="cpp-string">".dat"</span> , System.IO.FileMode.Create ); System.Runtime.Serialization.IFormatter formatter = <span class="cs-keyword">new</span> System.Runtime.Serialization.Formatters.Binary.BinaryFormatter(); formatter.Serialize (stream, <span class="cs-keyword">this</span>); stream.Close(); </pre> <p>改为二进制序列化的结果非常显著 - 相同的目录数据是<strong>4.6 MB</strong>,而不是 150 MB!大约是 XML 大小的 3%,这绝对是正确的方式。</p> <p>现在 Catalog 可以成功保存到磁盘,<em>似乎</em>重新将其加载到内存和 Application Cache 中很简单...</p> <h2>从磁盘加载 Catalog</h2> <p>不幸的是,事情并没有那么简单。每当 Application 重启时(例如修改了 web.config 或 Searcharoo.cs),代码都无法反序列化文件,而是抛出这个晦涩的错误</p> <p><strong>无法找到程序集 h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null</strong> </p> <p><a href="Searcharoo_3/Searcharoo3_4L.png" target="_blank"><img alt="Load Binary Catalog Exception - click to see full screen" src="https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_4.png" border="1"></a></p> <p>起初我感到困惑 - 我没有任何名为<em>h4octhiw</em>的程序集,所以一时不清楚为什么找不到它。但有几个提示</p> <ul> <li>“找不到”的程序集似乎有一个随机生成的名称……而我们知道什么程序集使用随机生成的名称?\Temporary ASP.NET Files\ 目录,动态编译的程序集(来自 src="" 和 ASPX)保存在那里。</li> <li>错误行只引用了“object”和“stream”类型 - 它们肯定不是有问题的原因吧?</li> <li>阅读堆栈跟踪(从底部开始,向上阅读,一如既往)(<a href="Searcharoo3_4L.png" target="_blank">点击图片</a>),您可以推断 Deserialize 方法创建了一个 BinaryParser,该 BinaryParser 创建一个 ObjectMap,其中包含一个 MemberNames 数组,该数组又请求 ObjectReader.GetType(),这会触发 GetAssembly() 方法……但它失败了!嗯 - 听起来它可能正在查找已序列化的类型 - 为什么找不到它们?</li> </ul> <p>如果您精通 Google 技能,而不是在<a href="http://www.google.com.au/search?q=ASP.NET+%22cannot+find+the+assembly%22">搜索<em>ASP.NET "Cannot find the assembly"</em></a>时返回的数十个无用链接,您将很幸运地偶然发现这篇<a href="https://codeproject.org.cn/soap/Serialization_Samples.asp" target="_blank">关于序列化的 CodeProject 文章</a>,您将了解到一项非常有趣的知识</p> <blockquote style="BACKGROUND-COLOR: #fbedbb">类型信息在类被序列化时也会被序列化,以便使用类型信息反序列化该类。类型信息包括命名空间、类名、程序集名称、区域性信息、程序集版本和公钥标记。只要您的反序列化类和被序列化的类位于同一个程序集中,就不会有问题。但是,如果序列化器位于单独的程序集中,.NET 就找不到您的类的类型,因此无法反序列化它。</blockquote> <p>但这意味着什么?每次 Web/IIS 的“Application”重启时,您所有的 ASPX 和 src="" 代码都会被重新编译到一个新的、随机命名的程序集中,位于 \Temporary ASP.NET Files\。所以,尽管 Catalog 类是基于相同的代码,但它的<strong>类型信息</strong>(命名空间、类名、<strong>程序集名称</strong>、区域性信息、程序集版本和公钥标记)是不同的!</p> <p>而且,重要的是,当一个类被二进制序列化时,它的类型信息会与它一起存储(题外话:XML 序列化<em>不会</em>发生这种情况,所以如果我们坚持使用 XML,我们可能都没事)。</p> <p>结果是:每次重新编译后(无论是什么触发的:web.config 更改、代码更改、IIS 重启、机器重启等),我们的 Catalog 类都有不同的类型信息 - 当它尝试加载之前保存的序列化版本时,它不匹配,框架找不到前一个 Catalog 类型定义的程序集(因为它只是临时的,并且在重新编译时已被删除)。</p> <h3>自定义格式化程序实现</h3> <p>听起来复杂?确实有点,但整个“临时程序集”的东西是隐式发生的,大多数开发人员不需要了解或关心太多。值得庆幸的是,我们也不必过于担心,因为<a href="https://codeproject.org.cn/soap/Serialization_Samples.asp" target="_blank">关于序列化的 CodeProject 文章</a>也提供了解决方案:一个辅助类,它会“欺骗”Binary Deserializer 使用“当前”的 Catalog 类型。</p> <pre lang="cs"><span class="cs-keyword">public</span> <span class="cs-keyword">class</span> CatalogBinder: System.Runtime.Serialization.SerializationBinder { <span class="cs-keyword">public</span> <span class="cs-keyword">override</span> Type BindToType (<span class="cs-keyword">string</span> assemblyName, <span class="cs-keyword">string</span> typeName) { <span class="cs-comment">// get the 'fully qualified (ie inc namespace) type name' into an </span> <span class="cs-comment">// array</span> <span class="cs-keyword">string</span>[] typeInfo = typeName.Split('.'); <span class="cs-comment">// because the last item is the class name, which we're going to </span> <span class="cs-comment">// 'look for' in *this* namespace/assembly</span> <span class="cs-keyword">string</span> className=typeInfo[typeInfo.Length -<span class="cs-literal">1</span>]; <span class="cs-keyword">if</span> (className.Equals(<span class="cpp-string">"Catalog"</span>)) { <span class="cs-keyword">return</span> <span class="cs-keyword">typeof</span> (Catalog); } <span class="cs-keyword">else</span> <span class="cs-keyword">if</span> (className.Equals(<span class="cpp-string">"Word"</span>)) { <span class="cs-keyword">return</span> <span class="cs-keyword">typeof</span> (Word); } <span class="cs-keyword">if</span> (className.Equals(<span class="cpp-string">"File"</span>)) { <span class="cs-keyword">return</span> <span class="cs-keyword">typeof</span> (File); } <span class="cs-keyword">else</span> { <span class="cs-comment">// pass back exactly what was passed in!</span> <span class="cs-keyword">return</span> Type.GetType(<span class="cs-keyword">string</span>.Format( <span class="cpp-string">"{0}, {1}"</span>, typeName, assemblyName)); } } } </pre> <p>瞧!现在 Catalog 可以保存/加载了,搜索引擎比以前更加健壮。您可以保存/备份 Catalog,打开调试模式查看其内容,甚至可以在另一台机器上生成它(例如在本地 PC 上),然后上传到您的 Web 服务器!</p> <p>使用“调试”XML 序列化文件,我第一次能够看到 Catalog 的内容,我发现很多“垃圾”被存储,这既浪费内存/磁盘空间,又无用/不可搜索。随着这个版本的主要任务完成,似乎应该进行一些错误修复并添加一些“真正的搜索引擎”功能来清理 Catalog 的内容。</p> <h2>新功能和错误修复</h2> <h4>FRAME 和 IFRAME 支持</h4> <p>CodeProject 会员<strong>le_mo_mo</strong>指出爬虫没有跟踪(和索引)框架内容。这只是对查找链接的正则表达式的一个小改动 - 之前支持 `A` 和 `AREA` 标签,所以将 `FRAME` 和 `IFRAME` 添加到模式中很简单。</p> <pre lang="cs"><span class="cs-keyword">foreach</span> (Match match <span class="cs-keyword">in</span> Regex.Matches(htmlData , @<span class="cpp-string">"(?<anchor><\s*(a|area|<strong>frame|iframe</strong>)\"</span> + @<span class="cpp-string">"s*(?:(?:\b\w+\b\s*(?:=\s*(?:"</span><span class="cpp-string">"[^"</span><span class="cpp-string">"]*"</span><span class="cpp-string">"|'[^']"</span> + @<span class="cpp-string">"*'|[^"</span><span class="cpp-string">"'<> ]+)\s*)?)*)?\s*>)"</span> , RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {</pre> <h4>停用词</h4> <p>让我们从<a href="http://www.google.com/support/bin/answer.py?answer=981">Google 对停用词的定义</a>开始</p> <blockquote style="BACKGROUND-COLOR: #fbedbb">Google 会忽略常用词和字符,例如“where”和“how”,以及某些单个数字和单个字母。这些词语很少有助于缩小搜索范围,并可能减慢搜索结果的速度。我们称它们为“停用词”。</blockquote> <p>基本前提是我们不想浪费目录空间来存储永远不会使用的数据,“停用词”的假设是您永远不会搜索“a in at I”之类的词,因为它们几乎出现在每个页面上,因此实际上无助于您找到任何内容!</p> <p>这是来自 MIT 的<a href="http://libraries.mit.edu/tutorials/general/stopwords.html" target="_blank">基本定义</a>以及一些<a href="http://www.tbray.org/ongoing/When/200x/2003/07/11/Stopwords" target="_blank">有趣的统计数据和停用词思考</a>,包括“经典”停用词的难题:用户是否应该能够<a href="http://www.google.com/search?q=%22to+be+or+not+to+be%22">搜索莎士比亚的独白“生存还是毁灭”</a>?</p> <p>Searcharoo 随附的停用词代码非常基础 - 它会剔除所有一个和两个字母的单词,加上</p> <pre lang="text">the, and, that, you, this, for, but, with, are, have, was, out, not</pre> <p>更复杂的实现留给其他人贡献(或者未来的版本, whichever comes first)。</p> <h4>词语标准化</h4> <p>我注意到单词经常被存储<em>时,还包含</em>了它们在源文本中相邻的任何标点符号。例如,Catalog 包含带有 Word 实例的 Files,例如</p> <table cellspacing="0" width="350" border="1"> <tbody> <tr> <td>"People</td> <td>people</td> <td>people*</td> <td>people</td> </tr> </tbody> </table> <p>这阻止了包含这些单词的页面在搜索时被返回,除非用户输入了完全相同的标点符号 - 在上面的例子中,搜索<strong>people</strong>只会返回一页,而您期望它返回所有四页。<br />Searcharoo 的前一个版本确实有一个“黑名单”标点符号 `[,./?;:()-=etc]`,但这还不够,因为我无法预测/预见到所有可能的标点符号。此外,它是使用 Trim() 方法实现的,该方法没有解析出单词内的标点符号(题外话:对括号词的处理在版本 3 中仍然不令人满意)。以下允许索引的字符的“白名单”确保 NO 标点符号被意外存储为单词的一部分。</p> <pre lang="cs">key = System.Text.RegularExpressions.Regex.Replace(key, @<span class="cpp-string">"[^a-z0-9,.]"</span> , <span class="cpp-string">""</span> , System.Text.RegularExpressions.RegexOptions.IgnoreCase); </pre> <p><strong style="COLOR: red">文化提示:</strong>这种移除标点符号的“白名单”方法非常以英语为中心,因为它将至少删除大多数欧洲语言中的一些字符,并且它将删除大多数亚洲语言内容中的所有内容。<br />如果您想使用 Searcharoo 处理非英语字符集,您应该找到上面那行代码,并将其替换为版本 2 中的这个“黑名单”。虽然它允许搜索更多字符,但结果更有可能被标点符号污染,从而降低可搜索性。</p> <pre lang="cs">key = word.Trim (' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();</pre> <h4>数字标准化</h4> <p>数字是词语标准化的一个特例:一些标点符号对于解释数字是必需的(例如小数点),然后将其转换为正确的数字。<br />虽然不完美,但这表示写成 0412-345-678 或 (04)123-45678 的电话号码都将被 Catalog 存储为 0412345678,因此<em>搜索</em> 0412-345-678 或 (04)123-45678 都会匹配这两个源文档。</p> <pre lang="cs"><span class="cs-keyword">private</span> <span class="cs-keyword">bool</span> IsNumber (<span class="cs-keyword">ref</span> <span class="cs-keyword">string</span> word) { <span class="cs-keyword">try</span> { <span class="cs-keyword">long</span> number = Convert.ToInt64(word); <span class="cs-comment">//;int.Parse(word);</span> word = number.ToString(); <span class="cs-keyword">return</span> (word!=<span class="cs-clrtype">String</span>.Empty);<span class="cs-comment">//true;</span> } <span class="cs-keyword">catch</span> { <span class="cs-keyword">return</span> <span class="cs-keyword">false</span>; } } </pre> <h4>Go words</h4> <p>阅读上面的<strong>词语标准化</strong>部分后,您可以看到如何对技术术语/短语(如 C# 或 C++)进行目录化和搜索是不可能的 - 非字母数字字符在被目录化之前就被过滤掉了。</p> <p>为了避免这种情况,Searcharoo 允许创建一个“Go words”列表。“Go word”与“Stop word”相反:而不是阻止目录化,它会获得一个进入目录的通行证,绕过标准化和词干提取代码。</p> <p>这种方法的弱点在于您必须提前知道所有用户可能搜索的 Go words。将来,您可能需要存储每个不成功的搜索词以供以后分析和扩展您的 Go word 列表。Go word 的实现<em>非常</em>简单</p> <pre lang="cs"><span class="cs-keyword">public</span> <span class="cs-keyword">bool</span> IsGoWord (<span class="cs-keyword">string</span> word) { <span class="cs-keyword">switch</span> (word.ToLower()) { <span class="cs-keyword">case</span> <span class="cpp-string">"c#"</span>: <span class="cs-keyword">case</span> <span class="cpp-string">"vb.net"</span>: <span class="cs-keyword">case</span> <span class="cpp-string">"asp.net"</span>: <span class="cs-keyword">return</span> <span class="cs-keyword">true</span>; <span class="cs-keyword">break</span>; } <span class="cs-keyword">return</span> <span class="cs-keyword">false</span>; } </pre> <h4>词干提取</h4> <p>“词干提取”最基本的解释是,它试图识别“相关的”单词,并在响应查询时返回它们。最简单的例子是复数:搜索“field”也应该找到“fields”的实例,反之亦然。更复杂的例子是“realize”和“realization”,“populate”和“population”——</p> <p>此页面<a href="http://www.infotoday.com/searcher/may01/liddy.htm" target="_blank">How a Search Engine Works</a>包含关于词干提取和上述其他技术(如上面所述)的简要说明。</p> <p>《<a href="http://www.tartarus.org/martin/PorterStemmer/" target="_blank">Porter Stemming Algorithm</a>》已经以 C# 类的形式存在,<a href="http://www.tartarus.org/martin/PorterStemmer/csharp2.txt">因此直接</a>在 Searcharoo3 中使用(感谢 Martin Porter 的贡献和感谢!)。</p> <h2>对 Catalog 大小影响</h2> <p>上面的停用词、词干提取和标准化步骤都是为了“整理”Catalog 并希望能减小其大小/提高搜索速度。结果列于下文,针对我们的<a href="http://www.cia.gov/cia/publications/factbook/" target="_blank">CIA World Factbook</a></p> <table id="Table1" cellspacing="0" border="1"> <tbody> <tr> <th width="20%">source<br />800 个文件<br />52.6 MB</th> <th width="20%">原始 *</th> <th width="20%">+ 停用词</th> <th width="20%">+ 词干提取</th> <th width="20%">+'白名单'<br />标准化</th> </tr> <tr> <td>唯一单词</td> <td align="center">30,415</td> <td align="center">30,068</td> <td align="center">26,560</td> <td align="center">26,050</td> </tr> <tr> <td>XML 序列化</td> <td align="center">156 MB ^</td> <td align="center">149 MB</td> <td align="center">138 MB</td> <td align="center">136 MB</td> </tr> <tr> <td>二进制序列化</td> <td align="center">4.6 MB</td> <td align="center">4.5 MB</td> <td align="center">4.1 MB</td> <td align="center">4.0 MB</td> </tr> <tr> <td>二进制与源数据的百分比</td> <td align="center">8.75%</td> <td align="center">8.55%</td> <td align="center">7.79%%</td> <td align="center">7.60%</td> </tr> </tbody> </table> <p><em>* 黑名单标准化,代码中已注释掉,并在“文化提示”中提及</em> <br /><em>^ 使用 [Attributes]“压缩”XML 输出后为 49 MB</em></p> <p>结果是单词数量减少了 14%,二进制文件大小减少了 13%(主要是由于添加了词干提取)。因为整个 Catalog 保存在内存中(在 Application Cache 中),保持小尺寸很重要 - 也许未来的版本可以持久化部分“工作副本”到磁盘,并允许爬取<em>非常大</em>的站点,但目前 Catalog 的大小似乎不到源数据大小的 10%。</p> <h2>…但是 UI 呢?</h2> <p>搜索用户界面也进行了一些改进</p> <ul> <li>将搜索输入移入 Searcharoo.ascx 用户控件</li> <li>在搜索词中添加与爬虫期间相同的词干提取、停用词和 Go word 解析</li> <li>使用新的 ResultFile 类生成结果列表,以构建 DataSource 并绑定到 Repeater 控件</li> <li>添加 PagedDataSource 和自定义分页链接,而不是一个长长的结果列表(感谢 <a href="https://codeproject.org.cn/aspnet/spideroo.asp#xx927327xx">Jim Harkin 的反馈/代码</a>和 <a href="http://www.uberasp.net/ArticlePrint.aspx?id=29">uberasp.net</a>)</li> </ul> <h4>ResultFile 和 SortedList</h4> <p>在版本 2 中,输出结果非常粗糙:代码中充斥着 `Response.Write` 调用,使得重新格式化输出变得困难。<a href="https://codeproject.org.cn/script/profile/whos_who.asp?id=1125453">Jim Harkins</a> 发布了一些 Visual Basic 代码,下面将其转换为 C#。</p> <pre lang="cs"><span class="cs-comment">// build each result row</span> <span class="cs-keyword">foreach</span> (<span class="cs-keyword">object</span> foundInFile <span class="cs-keyword">in</span> finalResultsArray.Keys) { <span class="cs-comment">// Create a ResultFile with it's own Rank</span> infile = <span class="cs-keyword">new</span> ResultFile ((File)foundInFile); infile.Rank = (<span class="cs-keyword">int</span>)((DictionaryEntry)finalResultsArray[foundInFile]).Value; sortrank = infile.Rank * -<span class="cs-literal">1000</span>; <span class="cs-comment">// Assume not 'thousands' of results</span> <span class="cs-keyword">if</span> (output.Contains(sortrank) ) { <span class="cs-comment">// rank exists - drop key index one number until it fits</span> <span class="cs-keyword">for</span> (<span class="cs-keyword">int</span> i = <span class="cs-literal">1</span>; i < <span class="cs-literal">999</span>; i++) { sortrank++; <span class="cs-keyword">if</span> (!output.Contains (sortrank)) { output.Add (sortrank, infile); <span class="cs-keyword">break</span>; } } } <span class="cs-keyword">else</span> { output.Add(sortrank, infile); } sortrank = <span class="cs-literal">0</span>; <span class="cs-comment">// reset for next pass</span> } </pre> <p>Jim 的代码通过一个名为“sortrank”的新变量进行了一些技巧处理,试图将文件保持在“Searcharoo 排名”顺序,但输出的 `SortedList` 中具有唯一的键。如果返回了<em>成千上万</em>的结果,您可能会遇到麻烦……</p> <h4>PagedDataSource</h4> <p>一旦结果进入 SortedList,它们就被分配给一个 `PagedDataSource`,然后绑定到 Searcharoo3.aspx 上的 Repeater 控件。</p> <pre lang="cs">SortedList output = <span class="cs-keyword">new</span> SortedList (finalResultsArray.Count); <span class="cs-comment">// empty sorted list</span> ... pg.DataSource = output.GetValueList(); pg.AllowPaging = <span class="cs-keyword">true</span>; pg.PageSize = Preferences.ResultsPerPage; <span class="cs-comment">// defaults to 10 10;</span> pg.CurrentPageIndex = Request.QueryString[<span class="cpp-string">"page"</span>]==<span class="cs-keyword">null</span>?<span class="cs-literal">0</span>: Convert.ToInt32(Request.QueryString[<span class="cpp-string">"page"</span>])-<span class="cs-literal">1</span>; SearchResults.DataSource = pg; SearchResults.DataBind(); </pre> <p>使其更容易以任何您喜欢的方式重新格式化结果列表!</p> <pre lang="html"><asp:Repeater id="SearchResults" runat="server"> <HeaderTemplate> <p><%=NumberOfMatches%> results for <%=Matches%> took <%=DisplayTime%></p> </HeaderTemplate> <ItemTemplate> <a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"><b> <%# DataBinder.Eval(Container.DataItem, "Title") %></b></a> <a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>" target=\"_blank\" title="open in new window" style="font-size:x-small">↑</a> <font color=gray>(<%# DataBinder.Eval(Container.DataItem, "Rank") %>) </font> <br><%# DataBinder.Eval(Container.DataItem, "Description") %>... <br><font color=green><%# DataBinder.Eval(Container.DataItem, "Url") %> - <%# DataBinder.Eval(Container.DataItem, "Size") %> bytes</font> <font color=gray>- <%# DataBinder.Eval(Container.DataItem, "CrawledDate") %></font><p> </ItemTemplate> <FooterTemplate> <p><%=CreatePagerLinks(pg, Request.Url.ToString() )%></p> </FooterTemplate> </asp:Repeater> </pre> <p>不幸的是,页面链接是通过 `CreatePagerLinks` 中的嵌入式 `Response.Write` 调用生成的……也许这将在未来的版本中进行模板化……</p> <h2>未来...</h2> <p>如果您查看下面的日期,您会注意到版本 2 和 3 之间几乎间隔了一年半,所以讨论另一个“未来”版本可能听起来很乐观 - 但您永远不知道……</p> <p>不幸的是,上面许多新功能都是针对英语语言的(尽管它们可以被禁用,以确保 Searcharoo 仍可用于其他语言的网站)。但是,在未来的版本中,我想尝试让代码在处理欧洲、亚洲和其他语言时更智能一些。</p> <p>用户还可以输入布尔 OR 搜索,或者用引号“ ”组合术语,就像 Google、Yahoo 等一样,那会很好。</p> <p>最后,为 HTML 以外的文档类型(主要是 PDF 等其他 Web 类型)建立索引对许多网站都会很有用。</p> <a name="aspnet2"></a> <h2>ASP.NET 2.0</h2> <p>Searcharoo3 几乎未经修改地运行在 ASP.NET 2.0 上 - 只需从 @Page 属性中移除 `<code lang="cs">src=<span class="cpp-string">"Searcharoo.cs"</span></code>`,然后将 *Searcharoo.cs* 文件移到 *App_Code* 目录中。</p> <p><a title="Click to enlarge" href="Searcharoo_3/ClassDiagram.png" target="_blank"><img height="566" alt="Visual Studio 2005 Class Diagram - click for larger view" src="https://cloudfront.codeproject.com/ip/searcharoo_3/classdiagram_600x566.png" width="600" border="0"></a></p> <p>Visual Studio.NET 内部 Web 服务器警告:Searcharoo_VirtualRoot 设置(爬虫开始查找要索引的页面)默认为 *https:///*。VS.NET 的内部 Web 服务器选择一个随机端口运行,所以如果您使用它来测试 Searcharoo,您可能需要相应地设置此 web.config 值。</p> <h2>历史</h2> <ul> <li>2004-06-30:<a href="Searcharoo.asp" target="_blank">版本 1</a> 在 CodeProject 上发布</li> <li>2004-07-03:<a href="Spideroo.asp" target="_blank">版本 2</a> 在 CodeProject 上发布</li> <li>2006-05-24:版本 3(本页)在 CodeProject 上发布</li> </ul> </div></div></div></div><div><div id="div-gpt-ad-1738591766860-0" style="min-width:300px;min-height:600px" class="mt-40 sticky top-40"></div></div></div><!--$--><!--/$--><!--$--><!--/$--></main><footer class="custom-container flex flex-col gap-3 px-2"><div class="mt-2 py-4 min-h-36 bg-primary"></div><div class="border-t-4 border-primary p-3 flex justify-between"><nav class="flex-1"><ul><li><a href="http://developermedia.com/" rel="nofollow noreferrer">广告</a></li><li><a href="/info/privacy.aspx">隐私</a></li><li><a href="/info/cookie.aspx">Cookie</a></li><li><a href="/info/TermsOfUse.aspx">使用条款</a></li></ul></nav><div class="flex-1"></div><div class="flex-1 text-gray text-sm text-right">版权所有 © <a href="mailto:webmaster@codeproject.com">CodeProject</a>, 1999-2025<br />保留所有权利。</div></div></footer><script src="#webpack-5ed466aeca77ac7a.js" async=""></script><script>(self.__next_f=self.__next_f||[]).push([0])</script><script>self.__next_f.push([1,"1:\"$Sreact.fragment\"\n2:I[387,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"177\",\"static/chunks/app/layout-1ea82ab3f50fdbe9.js\"],\"default\"]\n3:I[6874,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"394\",\"static/chunks/app/article/%5BarticleType%5D/%5BarticleId%5D/%5BarticleSlug%5D/page-1251365ce5bc4953.js\"],\"\"]\n4:I[3063,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"394\",\"static/chunks/app/article/%5BarticleType%5D/%5BarticleId%5D/%5BarticleSlug%5D/page-1251365ce5bc4953.js\"],\"Image\"]\n5:I[6932,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"394\",\"static/chunks/app/article/%5BarticleType%5D/%5BarticleId%5D/%5BarticleSlug%5D/page-1251365ce5bc4953.js\"],\"default\"]\n6:I[7231,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"177\",\"static/chunks/app/layout-1ea82ab3f50fdbe9.js\"],\"default\"]\n7:I[7555,[],\"\"]\n8:I[1295,[],\"\"]\na:I[9665,[],\"MetadataBoundary\"]\nc:I[9665,[],\"OutletBoundary\"]\nf:I[4911,[],\"AsyncMetadataOutlet\"]\n11:I[9665,[],\"ViewportBoundary\"]\n13:I[6614,[],\"\"]\n14:\"$Sreact.suspense\"\n15:I[4911,[],\"AsyncMetadata\"]\n:HL[\"/_next/static/media/945b7384c5256ec3-s.p.ttf\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/ttf\"}]\n:HL[\"/_next/static/media/b55e164a6e7ce445-s.p.ttf\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/ttf\"}]\n:HL[\"/_next/static/media/e337cf18f0f81cb9-s.p.ttf\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/ttf\"}]\n:HL[\"/_next/static/media/f7794ce32483498f-s.p.ttf\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/ttf\"}]\n:HL[\"/_next/static/css/f00fbee724adea66.css\",\"style\"]\n"])</script><script>self.__next_f.push([1,"0:{\"P\":null,\"b\":\"KmxRTrlUit5clwfM6GRk0\",\"p\":\"\",\"c\":[\"\",\"Articles\",\"14207\",\"Adding-features-to-a-C-search-engine-web-spider\"],\"i\":false,\"f\":[[[\"\",{\"children\":[\"article\",{\"children\":[[\"articleType\",\"Articles\",\"d\"],{\"children\":[[\"articleId\",\"14207\",\"d\"],{\"children\":[[\"articleSlug\",\"Adding-features-to-a-C-search-engine-web-spider\",\"d\"],{\"children\":[\"__PAGE__\",{}]}]}]}]}]},\"$undefined\",\"$undefined\",true],[\"\",[\"$\",\"$1\",\"c\",{\"children\":[[[\"$\",\"link\",\"0\",{\"rel\":\"stylesheet\",\"href\":\"/_next/static/css/f00fbee724adea66.css\",\"precedence\":\"next\",\"crossOrigin\":\"$undefined\",\"nonce\":\"$undefined\"}]],[\"$\",\"html\",null,{\"lang\":\"en\",\"children\":[[\"$\",\"head\",null,{\"children\":[[\"$\",\"link\",null,{\"rel\":\"apple-touch-icon\",\"sizes\":\"144x144\",\"href\":\"/favicon/apple-touch-icon.png\"}],[\"$\",\"link\",null,{\"rel\":\"icon\",\"type\":\"image/png\",\"sizes\":\"32x32\",\"href\":\"/favicon/favicon-32x32.png\"}],[\"$\",\"link\",null,{\"rel\":\"icon\",\"type\":\"image/png\",\"sizes\":\"16x16\",\"href\":\"/favicon/favicon-16x16.png\"}],[\"$\",\"link\",null,{\"rel\":\"manifest\",\"href\":\"/favicon/manifest.json\"}],[\"$\",\"link\",null,{\"rel\":\"mask-icon\",\"href\":\"/favicon/safari-pinned-tab.svg\",\"color\":\"#ff9900\"}],[\"$\",\"meta\",null,{\"property\":\"og:title\",\"content\":\"CodeProject\"}],[\"$\",\"meta\",null,{\"property\":\"og:image\",\"content\":\"https://codeproject.org.cn/favicon/mstile-150x150.png\"}],[\"$\",\"meta\",null,{\"property\":\"og:description\",\"content\":\"For those who code\"}],[\"$\",\"meta\",null,{\"property\":\"og:url\",\"content\":\"https://codeproject.org.cn\"}],[\"$\",\"script\",null,{\"async\":true,\"src\":\"https://#/tag/js/gpt.js\"}],[\"$\",\"$L2\",null,{}]]}],[\"$\",\"body\",null,{\"className\":\"__className_2277ee\",\"children\":[[\"$\",\"header\",null,{\"children\":[[\"$\",\"div\",null,{\"className\":\"custom-container gap-2 py-1 text-gray-light flex justify-between whitespace-nowrap z-10 relative px-1 md:px-0\",\"children\":[[\"$\",\"div\",null,{\"className\":\"flex-1 md:ml-20\",\"children\":[[\"$\",\"span\",null,{\"className\":\"hidden md:block\",\"children\":[65.938,\" articles\"]}],[\"$\",\"span\",null,{\"className\":\"block md:hidden\",\"children\":[\"65.9\",\"K\"]}]]}],[\"$\",\"div\",null,{\"className\":\"flex-1\",\"children\":[\"CodeProject is changing.\",\" \",[\"$\",\"$L3\",null,{\"href\":\"/info/Changes.aspx\",\"className\":\"!text-gray-lightest\",\"children\":\"Read more.\"}]]}],[\"$\",\"div\",null,{\"className\":\"flex-1\"}],[\"$\",\"div\",null,{\"className\":\"flex-1\"}]]}],[\"$\",\"div\",null,{\"className\":\"mb-2 mt-2 md:mt-0\",\"children\":[\"$\",\"div\",null,{\"className\":\"h-[94px] bg-primary\",\"children\":[\"$\",\"div\",null,{\"className\":\"custom-container relative h-full\",\"children\":[[\"$\",\"$L3\",null,{\"href\":\"/\",\"children\":[\"$\",\"$L4\",null,{\"className\":\"absolute -top-[31px]\",\"src\":\"/logo250x135.gif\",\"priority\":true,\"alt\":\"Home\",\"width\":250,\"height\":135}]}],[\"$\",\"div\",null,{\"className\":\"w-[calc(100%-300px)] translate-x-[300px] translate-y-[2px] h-[90px] relative overflow-hidden\",\"children\":[\"$\",\"$L5\",null,{\"adUnit\":\"/67541884/CPLeader728\",\"adSlotId\":\"div-gpt-ad-1738591571151-0\",\"adSize\":[728,90],\"className\":\"absolute top-0 left-0\"}]}]]}]}]}],[\"$\",\"$L6\",null,{}]]}],[\"$\",\"main\",null,{\"className\":\"px-2 py-10\",\"children\":[\"$\",\"$L7\",null,{\"parallelRouterKey\":\"children\",\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L8\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":[[[\"$\",\"title\",null,{\"children\":\"404: This page could not be found.\"}],[\"$\",\"div\",null,{\"style\":{\"fontFamily\":\"system-ui,\\\"Segoe UI\\\",Roboto,Helvetica,Arial,sans-serif,\\\"Apple Color Emoji\\\",\\\"Segoe UI Emoji\\\"\",\"height\":\"100vh\",\"textAlign\":\"center\",\"display\":\"flex\",\"flexDirection\":\"column\",\"alignItems\":\"center\",\"justifyContent\":\"center\"},\"children\":[\"$\",\"div\",null,{\"children\":[[\"$\",\"style\",null,{\"dangerouslySetInnerHTML\":{\"__html\":\"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}\"}}],[\"$\",\"h1\",null,{\"className\":\"next-error-h1\",\"style\":{\"display\":\"inline-block\",\"margin\":\"0 20px 0 0\",\"padding\":\"0 23px 0 0\",\"fontSize\":24,\"fontWeight\":500,\"verticalAlign\":\"top\",\"lineHeight\":\"49px\"},\"children\":404}],[\"$\",\"div\",null,{\"style\":{\"display\":\"inline-block\"},\"children\":[\"$\",\"h2\",null,{\"style\":{\"fontSize\":14,\"fontWeight\":400,\"lineHeight\":\"49px\",\"margin\":0},\"children\":\"This page could not be found.\"}]}]]}]}]],[]],\"forbidden\":\"$undefined\",\"unauthorized\":\"$undefined\"}]}],[\"$\",\"footer\",null,{\"className\":\"custom-container flex flex-col gap-3 px-2\",\"children\":[[\"$\",\"div\",null,{\"className\":\"mt-2 py-4 min-h-36 bg-primary\"}],[\"$\",\"div\",null,{\"className\":\"border-t-4 border-primary p-3 flex justify-between\",\"children\":[[\"$\",\"nav\",null,{\"className\":\"flex-1\",\"children\":[\"$\",\"ul\",null,{\"children\":[[\"$\",\"li\",null,{\"children\":[\"$\",\"a\",null,{\"href\":\"http://developermedia.com/\",\"rel\":\"nofollow noreferrer\",\"children\":\"Advertise\"}]}],[\"$\",\"li\",null,{\"children\":[\"$\",\"$L3\",null,{\"href\":\"/info/privacy.aspx\",\"children\":\"Privacy\"}]}],[\"$\",\"li\",null,{\"children\":[\"$\",\"$L3\",null,{\"href\":\"/info/cookie.aspx\",\"children\":\"Cookies\"}]}],[\"$\",\"li\",null,{\"children\":[\"$\",\"$L3\",null,{\"href\":\"/info/TermsOfUse.aspx\",\"children\":\"Terms of Use\"}]}]]}]}],[\"$\",\"div\",null,{\"className\":\"flex-1\"}],[\"$\",\"div\",null,{\"className\":\"flex-1 text-gray text-sm text-right\",\"children\":[\"Copyright © \",[\"$\",\"a\",null,{\"href\":\"mailto:webmaster@codeproject.com\",\"children\":\"CodeProject\"}],\", 1999-2025\",[\"$\",\"br\",null,{}],\"All Rights Reserved.\"]}]]}]]}]]}]]}]]}],{\"children\":[\"article\",[\"$\",\"$1\",\"c\",{\"children\":[null,[\"$\",\"$L7\",null,{\"parallelRouterKey\":\"children\",\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L8\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":\"$undefined\",\"forbidden\":\"$undefined\",\"unauthorized\":\"$undefined\"}]]}],{\"children\":[[\"articleType\",\"Articles\",\"d\"],[\"$\",\"$1\",\"c\",{\"children\":[null,[\"$\",\"$L7\",null,{\"parallelRouterKey\":\"children\",\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L8\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":\"$undefined\",\"forbidden\":\"$undefined\",\"unauthorized\":\"$undefined\"}]]}],{\"children\":[[\"articleId\",\"14207\",\"d\"],[\"$\",\"$1\",\"c\",{\"children\":[null,[\"$\",\"$L7\",null,{\"parallelRouterKey\":\"children\",\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L8\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":\"$undefined\",\"forbidden\":\"$undefined\",\"unauthorized\":\"$undefined\"}]]}],{\"children\":[[\"articleSlug\",\"Adding-features-to-a-C-search-engine-web-spider\",\"d\"],[\"$\",\"$1\",\"c\",{\"children\":[null,[\"$\",\"$L7\",null,{\"parallelRouterKey\":\"children\",\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L8\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":\"$undefined\",\"forbidden\":\"$undefined\",\"unauthorized\":\"$undefined\"}]]}],{\"children\":[\"__PAGE__\",[\"$\",\"$1\",\"c\",{\"children\":[\"$L9\",[\"$\",\"$La\",null,{\"children\":\"$Lb\"}],null,[\"$\",\"$Lc\",null,{\"children\":[\"$Ld\",\"$Le\",[\"$\",\"$Lf\",null,{\"promise\":\"$@10\"}]]}]]}],{},null,false]},null,false]},null,false]},null,false]},null,false]},null,false],[\"$\",\"$1\",\"h\",{\"children\":[null,[\"$\",\"$1\",\"D7xd_uZcc-oE_LcDfBMoF\",{\"children\":[[\"$\",\"$L11\",null,{\"children\":\"$L12\"}],[\"$\",\"meta\",null,{\"name\":\"next-size-adjust\",\"content\":\"\"}]]}],null]}],false]],\"m\":\"$undefined\",\"G\":[\"$13\",\"$undefined\"],\"s\":false,\"S\":false}\n"])</script><script>self.__next_f.push([1,"b:[\"$\",\"$14\",null,{\"fallback\":null,\"children\":[\"$\",\"$L15\",null,{\"promise\":\"$@16\"}]}]\ne:null\n12:[[\"$\",\"meta\",\"0\",{\"charSet\":\"utf-8\"}],[\"$\",\"meta\",\"1\",{\"name\":\"viewport\",\"content\":\"width=device-width, initial-scale=1\"}]]\nd:null\n"])</script><script>self.__next_f.push([1,"16:{\"metadata\":[[\"$\",\"title\",\"0\",{\"children\":\"Adding features to a C# search engine/web spider - CodeProject\"}],[\"$\",\"link\",\"1\",{\"rel\":\"manifest\",\"href\":\"/manifest.json\",\"crossOrigin\":\"$undefined\"}]],\"error\":null,\"digest\":\"$undefined\"}\n10:{\"metadata\":\"$16:metadata\",\"error\":null,\"digest\":\"$undefined\"}\n"])</script><script>self.__next_f.push([1,"17:I[9741,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"394\",\"static/chunks/app/article/%5BarticleType%5D/%5BarticleId%5D/%5BarticleSlug%5D/page-1251365ce5bc4953.js\"],\"default\"]\n19:I[6947,[\"869\",\"static/chunks/869-095b567be01917e3.js\",\"394\",\"static/chunks/app/article/%5BarticleType%5D/%5BarticleId%5D/%5BarticleSlug%5D/page-1251365ce5bc4953.js\"],\"default\"]\n18:T706f,"])</script><script>self.__next_f.push([1,"/*\n This is the legacy article critical styles\n They are added in order to reduce mvp development time\n They should be refactored or completely replaced woth nextjs css modules in future\n*/\n\n::selection {\n background-color: #f90;\n color: #fff;\n text-shadow: none;\n}\n.blank-background {\n background-color: white;\n}\nhtml,\ndiv,\nspan,\napplet,\nobject,\niframe,\na,\nabbr,\nacronym,\nbig,\ncite,\ncode,\ndel,\ndfn,\nem,\nfont,\nimg,\nins,\nkbd,\nq,\ns,\nsamp,\nsmall,\nstrike,\nstrong,\nsub,\nsup,\ntt,\nvar,\nfieldset,\nform,\nlabel,\ntable,\ntbody,\ntfoot,\nthead,\ntr,\nth,\ntd,\nli,\nol,\nul {\n margin: 0;\n padding: 0;\n border: 0;\n}\nhtml {\n font-size: 16px;\n -webkit-font-smoothing: antialiased;\n font-smooth: always;\n}\nbody,\np,\nh1,\nh2,\nh3,\nh4,\nh5,\nh6,\nli,\ntr,\ntd,\nth,\ndd,\ndt {\n font-size: 16px;\n line-height: 1.4;\n color: #111111;\n}\nbody {\n margin: 0;\n}\nh1,\nh3,\nh4,\nh5,\nth {\n font-weight: bold;\n}\nh1 {\n color: #333333;\n padding: 0px;\n margin: 0 0 7px;\n text-align: left;\n}\nh2 {\n margin: 20px 0 11px;\n padding: 0;\n padding-bottom: 10px;\n color: #333333;\n}\nh3 {\n color: #ff9900;\n}\nh1 {\n font-size: 38px;\n font-weight: 400;\n}\nh2 {\n font-size: 29px;\n font-weight: 400;\n}\nh3 {\n font-size: 19px;\n font-weight: normal;\n}\nh4 {\n font-size: 17px;\n}\npre {\n color: black;\n background-color: #fbedbb;\n padding: 6px;\n font: 14px Consolas, monospace, mono;\n white-space: pre;\n overflow: auto;\n border: solid 1px #fbedbb;\n -moz-tab-size: 4;\n -o-tab-size: 4;\n -webkit-tab-size: 4;\n tab-size: 4;\n}\ncode {\n color: #990000;\n font: 15px Consolas, monospace, mono;\n}\ntable {\n background-color: Transparent;\n}\nimg {\n -ms-interpolation-mode: bicubic;\n}\na {\n text-decoration: none;\n color: #005782;\n}\na:visited {\n color: #800080;\n}\na:hover {\n text-decoration: underline;\n}\na:not([href]) {\n color: inherit;\n text-decoration: none;\n}\ninput[type=\"text\"],\ninput[type=\"url\"],\ninput[type=\"search\"],\ninput[type=\"email\"],\ninput[type=\"number\"],\ninput[type=\"password\"],\nselect,\ntextarea {\n border: 1px solid #d7d7d7;\n font-size: 16px;\n padding: 5px;\n}\na.button,\na.button-large,\n.button,\n.button-large {\n color: white;\n background-color: #e08900;\n border: 1px solid #cccccc;\n text-decoration: none;\n white-space: nowrap;\n font-size: 100%;\n padding: 4px;\n cursor: pointer;\n border-radius: 3px;\n -webkit-border-radius: 3px;\n -moz-border-radius: 3px;\n}\ntable.small-text td,\nul.small-text li,\nol.small-text li,\n.small-text {\n font-size: 14px;\n}\n.invisible {\n display: none;\n}\n.subdue,\n.subdue li,\ntr.subdue td {\n color: #808080;\n}\n.bold {\n font-weight: bold;\n}\n.align-left {\n text-align: left;\n}\n.align-right {\n text-align: right;\n}\n.align-center {\n text-align: center;\n}\n.float-right {\n float: right;\n}\n.float-left {\n float: left;\n}\n.sticky {\n position: sticky;\n top: 0;\n}\n.extended {\n width: 100%;\n box-sizing: border-box;\n}\n.padded-top {\n padding-top: 20px;\n}\n.tight {\n margin: 0px;\n padding: 0px;\n}\n.nowrap {\n white-space: nowrap;\n}\n.fixed-layout {\n table-layout: fixed;\n}\n.clip-text {\n text-overflow: ellipsis;\n overflow: hidden;\n white-space: nowrap;\n}\n.raised {\n background-color: #fff8df;\n border: 1px solid #cccccc;\n -moz-box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n -webkit-box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n}\nol,\nul {\n padding-left: 40px;\n margin: 10px 0;\n}\nol.compact li,\nul.compact li,\nol li.compact,\nul li.compact {\n font-size: 14px;\n}\nul.download {\n margin-top: 25px;\n}\nul.download li,\nul li.download,\nul.Download li,\nul li.Download {\n background: url(\"/images/download24.png\") no-repeat scroll left center\n transparent;\n font-weight: bold;\n list-style-type: none;\n margin: 0px 0 6px -40px;\n padding: 0 0 1px 30px;\n vertical-align: middle;\n}\n.callout {\n margin: 20px 0;\n background-color: #ffffe1;\n color: #333333;\n border: 1px solid #cccccc;\n padding: 15px;\n border-radius: 3px;\n -webkit-border-radius: 3px;\n -moz-border-radius: 3px;\n}\n.trace {\n padding: 20px;\n background-color: #eeeeee;\n color: #333333;\n border: 1px solid red;\n font-size: 13px;\n}\ntextarea,\ninput[type=\"text\"],\ninput[type=\"button\"],\ninput[type=\"submit\"] {\n -webkit-appearance: none;\n border-radius: 0;\n}\n.container-content {\n background-color: white;\n position: relative;\n zoom: 1;\n padding: 0 9px;\n cursor: default;\n}\n.container-content-wrap {\n margin: auto;\n max-width: 1270px;\n}\n.container-content pre,\n.container-code pre,\n.answer pre {\n white-space: pre-wrap;\n /* css-3 */\n white-space: -moz-pre-wrap;\n /* Mozilla, since 1999 */\n white-space: -pre-wrap;\n /* Opera 4-6 */\n white-space: -o-pre-wrap;\n /* Opera 7 */\n word-wrap: break-word;\n /* Internet Explorer 5.5+ */\n _white-space: pre;\n /* IE only hack to re-specify in addition to word-wrap */\n word-break: break-word;\n -ms-word-break: break-word;\n}\n.flex-container {\n display: -webkit-box;\n /* OLD - iOS 6-, Safari 3.1-6 */\n display: -moz-box;\n /* OLD - Firefox 19- (buggy but mostly works) */\n display: -ms-flexbox;\n /* TWEENER - IE 10 */\n display: -webkit-flex;\n /* NEW - Chrome */\n display: flex;\n /* NEW, Spec - Opera 12.1, Firefox 20+ */\n}\n.flex-extend {\n justify-content: space-between;\n}\n.flex-wrap {\n flex-wrap: wrap;\n}\n.flex-item {\n -webkit-box-flex: 1;\n /* OLD - iOS 6-, Safari 3.1-6 */\n -moz-box-flex: 1;\n /* OLD - Firefox 19- */\n -webkit-flex: 1;\n /* Chrome */\n -ms-flex: 1;\n /* IE 10 */\n flex: 1;\n}\n.flex-item-tight {\n flex: 0 1 auto;\n}\n.hover-container {\n display: block;\n position: relative;\n}\n.clearfix:after,\n.container:after {\n display: block;\n content: \".\";\n visibility: hidden;\n height: 0px;\n clear: both;\n}\n.clearfix,\n.container {\n display: inline-block;\n /* Mac IE5 sees this */\n /* Mac IE5 comment hack \\*/\n display: block;\n /* Mac IE5 doesn't see this, but everyone else does */\n}\n.access-link,\n.access-link img {\n position: absolute;\n top: 0px;\n left: 0px;\n width: 1px;\n height: 1px;\n z-index: 101;\n border-style: none;\n margin-top: -1px;\n overflow: hidden;\n}\n.site-top-menu {\n white-space: nowrap;\n position: absolute;\n z-index: 101;\n width: 100%;\n}\n.site-top-menu .main-content {\n width: 100%;\n}\n.site-top-menu .main-content .memberbar {\n margin-left: 90px;\n margin-right: 10px;\n}\n.site-top-menu.fixed .main-content {\n margin: auto;\n max-width: 1270px;\n}\n.site-header {\n background-image: url(\"/App_Themes/CodeProject/Img/logo135-bg.gif\");\n white-space: nowrap;\n overflow: hidden;\n}\n.site-header .main-content {\n position: relative;\n overflow: hidden;\n white-space: nowrap;\n}\n.site-header .logo {\n display: inline-block;\n}\n.site-header .promo {\n display: inline-block;\n position: absolute;\n top: 33px;\n right: 0;\n}\n.site-header.fixed .main-content {\n margin: auto;\n max-width: 1270px;\n}\n.sub-headerbar {\n padding-right: 9px;\n position: relative;\n margin: auto;\n max-width: 1270px;\n}\n.sub-headerbar-divider {\n margin-left: 10px;\n height: 1px;\n border-bottom: 1px solid #cccccc;\n position: absolute;\n bottom: 2px;\n left: 0px;\n right: 9px;\n}\n.memberbar {\n height: 25px;\n padding-top: 10px;\n color: #999999;\n font-size: 14px;\n}\n.memberbar a {\n color: #808080;\n font-size: 14px;\n}\ndiv.navbar {\n white-space: nowrap;\n}\n.navmenu {\n background: white;\n color: #4d4d4d;\n padding: 0px;\n margin: 0px;\n list-style: none;\n height: 25px;\n}\n.navmenu ul,\n.navmenu li {\n margin: 0;\n padding: 0;\n}\n.navmenu .has-submenu {\n position: absolute;\n right: 5px;\n padding-left: 10px;\n}\n.navmenu \u003e li:hover \u003e a,\n.navmenu \u003e li \u003e a:active {\n border: 1px solid #cccccc;\n}\n.navmenu ul,\n.navmenu \u003e li.open:hover \u003e a,\n.navmenu \u003e li.open \u003e a:active {\n border: 1px solid #cccccc;\n border-bottom-color: white;\n}\n.navmenu \u003e li {\n margin: 0 11px 2px 2px;\n}\n.navmenu \u003e li:active,\n.navmenu \u003e li:active \u003e a,\n.navmenu \u003e li:hover,\n.navmenu \u003e li \u003e a:active {\n background: white;\n color: #4d4d4d;\n}\n.navmenu \u003e li.openable \u003e a:active,\n.navmenu \u003e li.openable:hover \u003e a {\n /*border-bottom: 1px solid @ColourBack-Menu; */\n}\n.navmenu \u003e li \u003e a {\n padding: 2px 7px 6px 7px;\n border: 1px solid transparent;\n font-weight: bold;\n}\n.navmenu a {\n display: block;\n float: left;\n color: #666666;\n background: white;\n font-size: 17px;\n padding: 0px 9px;\n text-decoration: none;\n white-space: nowrap;\n}\n.navmenu a.fly {\n white-space: nowrap;\n}\n.navmenu ul {\n background: white;\n position: absolute;\n left: -9999px;\n top: -9999px;\n list-style: none;\n}\n.navmenu li {\n float: left;\n color: #4d4d4d;\n}\n.navmenu li.last {\n height: 9px;\n}\n.navmenu li a:active,\n.navmenu li a:hover {\n color: white;\n background-color: #ff9900;\n}\n.navmenu li \u003e a:active,\n.navmenu li:hover,\n.navmenu li:hover \u003e a,\n.navmenu li:hover.heading,\n.navmenu li a.selected {\n position: relative;\n color: white;\n background-color: #ff9900;\n}\n.navmenu li.openable:hover ul {\n left: 0px;\n top: 30px;\n z-index: 500;\n}\n.navmenu li.openable:hover \u003e ul ul {\n position: absolute;\n left: -9999px;\n top: -9999px;\n width: auto;\n}\n.navmenu li ul {\n border-bottom: 5px solid #ff9900;\n}\n.navmenu li li {\n float: none;\n}\n.navmenu li li a {\n float: none;\n font-size: 16px;\n font-weight: normal;\n}\n.navmenu li li a.fly {\n color: #4d4d4d;\n background-color: white;\n padding: 2px 20px;\n}\n.navmenu li li a.break {\n margin-bottom: 15px;\n}\n.navmenu li li a.highlight1,\n.navmenu li li a.highlight1:active,\n.navmenu li li a.highlight1:hover {\n background-color: #009900;\n}\n.navmenu li li a.highlight2,\n.navmenu li li a.highlight2:active,\n.navmenu li li a.highlight2:hover {\n background-color: #ff9900;\n}\n.navmenu li li a.highlight3,\n.navmenu li li a.highlight3:active,\n.navmenu li li a.highlight3:hover {\n background-color: #000000;\n}\n.navmenu li li a.highlight1,\n.navmenu li li a.highlight2,\n.navmenu li li a.highlight3 {\n color: white;\n font-size: 16px;\n margin: 5px 0;\n padding: 9px 20px;\n}\n.site-footer {\n display: -webkit-box;\n /* OLD - iOS 6-, Safari 3.1-6 */\n display: -moz-box;\n /* OLD - Firefox 19- (buggy but mostly works) */\n display: -ms-flexbox;\n /* TWEENER - IE 10 */\n display: -webkit-flex;\n /* NEW - Chrome */\n display: flex;\n /* NEW, Spec - Opera 12.1, Firefox 20+ */\n padding-top: 5px;\n width: 100%;\n font-size: 13px;\n color: #999999;\n}\n.site-footer .align-left,\n.site-footer .align-center,\n.site-footer .align-right {\n -webkit-box-flex: 1;\n /* OLD - iOS 6-, Safari 3.1-6 */\n -moz-box-flex: 1;\n /* OLD - Firefox 19- */\n -webkit-flex: 1;\n /* Chrome */\n -ms-flex: 1;\n /* IE 10 */\n flex: 1;\n}\n.site-footer .align-left {\n flex: 1 0 100px;\n}\n.site-footer .align-center {\n flex: 0 1 0%;\n white-space: nowrap;\n}\n.site-footer .align-right {\n flex: 1 0 100px;\n}\n.site-footer .page-width .active {\n border-bottom: 2px solid #ff9900;\n}\n.searchbar {\n padding: 0;\n}\n.searchbar .search {\n margin-bottom: 4px;\n padding: 2px 5px 0px;\n border: 1px solid #cccccc;\n}\n.searchbar .search.subdue {\n color: #cccccc;\n}\n.searchbar input.search {\n width: 190px;\n border: none;\n font-size: 13px;\n padding: 4px 2px;\n}\n.searchbar .search-advanced {\n padding: 8px;\n width: 203px;\n z-index: 1000;\n background-color: white;\n border: solid 1px #cccccc;\n position: absolute;\n top: -4px;\n right: 0px;\n}\n.searchbar .popup {\n display: none;\n}\n.sub-headerbar .searchbar {\n /*\n\t.search-advanced\n\t{\n .transition(width, .1s, linear);\n\n \u0026.open\n {\n \t\twidth: 320px;\n }\n }\n */\n}\n.sub-headerbar .searchbar input.search {\n /*\n \u0026:focus,\u0026:active\n {\n position : absolute;\n top : 3px;\n right : 36px;\n height : 19px;\n border : 1px solid #ccc;\n border-right : none;\n width: 300px;\n\n .transition(width, .1s, linear);\n }\n */\n}\n.search td {\n background-color: white;\n}\n.article-container-parts {\n display: -webkit-box;\n /* OLD - iOS 6-, Safari 3.1-6 */\n display: -moz-box;\n /* OLD - Firefox 19- (buggy but mostly works) */\n display: -ms-flexbox;\n /* TWEENER - IE 10 */\n display: -webkit-flex;\n /* NEW - Chrome */\n display: flex;\n /* NEW, Spec - Opera 12.1, Firefox 20+ */\n}\n.article-container {\n -webkit-box-flex: 1;\n /* OLD - iOS 6-, Safari 3.1-6 */\n -moz-box-flex: 1;\n /* OLD - Firefox 19- */\n -webkit-flex: 1;\n /* Chrome */\n -ms-flex: 1;\n /* IE 10 */\n flex: 1;\n background-color: white;\n color: #111111;\n zoom: 1;\n position: relative;\n max-width: 100%;\n min-height: 675px;\n}\n.article-container .article {\n margin: 0 20px 0 10px;\n line-height: 143%;\n}\n.article-container span.stats {\n margin-left: 30px;\n}\n.article-left-sidebar {\n width: 120px;\n min-width: 120px;\n min-height: 400px;\n font-size: 14px;\n border-right: 1px solid #f2f2f2;\n padding: 0;\n /* Old school style */\n /*\n .license {\n font-weight: bold;\n display: inline-block;\n margin-top: 10px;\n }\n */\n}\n.article-left-sidebar .article-left-sidebar-inner {\n width: 120px;\n min-width: 120px;\n}\n.article-left-sidebar .tabs \u003e div {\n padding: 5px 0 5px 10px;\n}\n.article-left-sidebar .selected {\n font-weight: bold;\n background: transparent url(\"/images/right-selected.gif\") no-repeat scroll\n right 9px;\n}\n.article-left-sidebar h4 {\n color: #ff9900;\n font-size: 14px;\n font-weight: bold;\n}\n.article-left-sidebar .stats div div div {\n padding: 0px 0 10px 10px;\n color: #666666;\n}\n.article-right-sidebar {\n width: 300px;\n margin-left: 10px;\n font-size: 14px;\n}\n.article-right-sidebar .header {\n font-size: 22px;\n color: white;\n background-color: #ff9900;\n padding: 8px;\n font-weight: 400;\n margin: 0;\n}\n.article-right-sidebar .reading-list-toc {\n max-height: 250px;\n overflow-y: auto;\n padding-right: 6px;\n margin-top: 5px;\n}\n.article-right-sidebar .reading-list-toc .count {\n font-weight: normal;\n font-size: 13px;\n}\n.article-right-sidebar .reading-list-toc .title {\n font-size: 16px;\n}\n.article-right-sidebar .content-list-item {\n display: -webkit-box;\n /* OLD - iOS 6-, Safari 3.1-6 */\n display: -moz-box;\n /* OLD - Firefox 19- (buggy but mostly works) */\n display: -ms-flexbox;\n /* TWEENER - IE 10 */\n display: -webkit-flex;\n /* NEW - Chrome */\n display: flex;\n /* NEW, Spec - Opera 12.1, Firefox 20+ */\n justify-content: space-between;\n font-size: 14px;\n margin: 4px 0;\n}\n.article-right-sidebar .content-list-item .count {\n width: 15%;\n padding: 0px;\n}\n.article-right-sidebar .content-list-item .title {\n width: 85%;\n font-weight: normal;\n}\n.article-right-sidebar .gototop {\n text-align: center;\n padding: 10px 0;\n opacity: 0;\n -webkit-transition: opacity 0.3s linear 0ms;\n -moz-transition: opacity 0.3s linear 0ms;\n -o-transition: opacity 0.3s linear 0ms;\n transition: opacity 0.3s linear 0ms;\n}\n.container-content .edit-links {\n margin: 21px 0 0 10px;\n}\n.article-summary {\n padding: 0px 10px 0px 10px;\n overflow: hidden;\n}\n.article h1 {\n margin: 0 0 15px 0;\n}\n.article h2 {\n color: #ff9900;\n}\n.article pre {\n overflow: auto;\n}\n.article .header .title {\n color: #808080;\n}\n.article .header .author {\n font-weight: bold;\n min-width: 100px;\n}\n.article .header .author a {\n color: #808080;\n}\n.article .header .avatar {\n max-width: 48px;\n max-height: 48px;\n overflow: unset;\n}\n.article .header .avatar-wrap {\n width: 48px;\n height: 48px;\n margin-right: 10px;\n text-align: center;\n}\n.article .header .date {\n white-space: nowrap;\n}\n.article .header .license {\n margin-left: 15px;\n}\n.article .header .stats {\n margin-left: 30px;\n}\n.article img.lazyload,\n.article img.lazyloading {\n opacity: 0;\n}\n.article img {\n max-width: 700px;\n height: auto;\n}\n.article img.lazyloaded {\n opacity: 1;\n transition: opacity 300ms;\n}\n.article .summary {\n color: #808080;\n padding: 40px 0 15px 0;\n}\n.article .text {\n padding-top: 10px;\n}\n.article .reading-list-nav {\n border-top: 1px #dedede solid;\n border-bottom: 1px #dedede solid;\n padding: 10px 0;\n}\n.article .reading-list-nav .prev {\n padding-left: 20px;\n}\n.article .reading-list-nav .next {\n margin-left: 5px;\n padding-right: 20px;\n text-align: right;\n}\n.article-nav {\n text-align: right;\n margin-top: 21px;\n vertical-align: middle;\n line-height: 15px;\n}\n.msg-728x90 {\n width: 728px;\n height: 90px;\n overflow: hidden;\n}\n.msg-300x250 {\n width: 300px;\n height: 250px;\n overflow: hidden;\n}\n.content-list {\n margin-bottom: 17px;\n}\n.content-list .count {\n font-weight: bold;\n font-size: 16px;\n color: #ff9900;\n padding: 3px;\n text-align: center;\n}\n.content-list-item {\n margin: 10px 0;\n}\n.content-list-item .title {\n font-size: 14px;\n font-weight: bold;\n padding: 0px 0;\n}\n.content-list-item .title a {\n color: #005782;\n}\n.tags {\n line-height: 190%;\n}\n.tags .t {\n background: none repeat scroll 0 0 transparent;\n border: 1px solid #fbedbb;\n border-radius: 12px 0 0 12px;\n line-height: 1.4;\n padding: 0 2px 2px 3px;\n position: relative;\n text-decoration: none;\n margin: 2px 5px 4px 0;\n white-space: nowrap;\n}\n.tags .t a {\n color: #666666;\n display: inline-block;\n margin-right: 3px;\n padding-left: 5px;\n text-overflow: ellipsis;\n}\n.container-breadcrumb {\n font-size: 14px;\n margin-top: 7px;\n color: #808080;\n margin: 12px 0 35px;\n}\n.container-breadcrumb a {\n color: #808080;\n}\n.pre-lang {\n display: -webkit-box;\n /* OLD - iOS 6-, Safari 3.1-6 */\n display: -moz-box;\n /* OLD - Firefox 19- (buggy but mostly works) */\n display: -ms-flexbox;\n /* TWEENER - IE 10 */\n display: -webkit-flex;\n /* NEW - Chrome */\n display: flex;\n /* NEW, Spec - Opera 12.1, Firefox 20+ */\n background-color: #fbedbb;\n justify-content: space-between;\n padding: 4px 8px;\n margin-top: 5px;\n color: #999999;\n border-bottom: solid 1px #ffd044;\n}\n.code-comment {\n color: #008000;\n font-style: italic;\n}\n.code-keyword {\n color: Blue;\n}\n.code-sdkkeyword {\n color: #339999;\n}\n.code-string {\n color: Purple;\n}\n.code-attribute {\n color: red;\n}\n.code-leadattribute {\n color: #800000;\n}\n.pre-action-link {\n font-size: 13px;\n color: #999999;\n}\n.pre-action-link span {\n cursor: pointer;\n margin: 0;\n -webkit-transition: color 0.1s linear;\n -moz-transition: color 0.1s linear;\n -o-transition: color 0.1s linear;\n transition: color 0.1s linear;\n}\n.speech-bubble-container-down,\n.speech-bubble-container-up,\n.speech-bubble-container-up-right,\n.speech-bubble-container-left,\n.speech-bubble-container-right {\n position: relative;\n}\n.speech-bubble-up,\n.speech-bubble-down,\n.speech-bubble-left,\n.speech-bubble-right,\n.speech-bubble-up-right {\n padding: 0.6em;\n border: 1px solid #cccccc;\n background-color: white;\n margin: 15px;\n text-decoration: none;\n font-weight: normal;\n text-align: left;\n white-space: normal;\n color: #333333;\n font-size: 14px;\n line-height: 1.3;\n}\n.speech-bubble-down {\n margin-bottom: 0px;\n}\n.tooltip .speech-bubble-up,\n.tooltip .speech-bubble-up-right,\n.tooltip .speech-bubble-down,\n.tooltip .speech-bubble-left,\n.tooltip .speech-bubble-right {\n -moz-box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n -webkit-box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n box-shadow: 4px 4px 16px 1px rgba(0, 0, 0, 0.25);\n border-radius: 5px;\n -webkit-border-radius: 5px;\n -moz-border-radius: 5px;\n min-width: 75px;\n}\n.speech-bubble-pointer-down,\n.speech-bubble-pointer-down-inner {\n width: 0;\n height: 0;\n border-bottom-width: 0;\n background: none;\n}\n.speech-bubble-pointer-down {\n border-left: 7px solid transparent;\n border-right: 7px solid transparent;\n border-top: 1px solid #cccccc;\n border-top-width: 14px;\n margin-left: 35px;\n margin-bottom: 15px;\n _display: none;\n}\n.speech-bubble-pointer-up,\n.speech-bubble-pointer-up-right,\n.speech-bubble-pointer-up-inner,\n.speech-bubble-pointer-up-right-inner {\n width: 0;\n height: 0;\n border-top-width: 0;\n background: none;\n}\n.speech-bubble-pointer-up,\n.speech-bubble-pointer-up-right {\n border-left: 5px solid transparent;\n border-right: 5px solid transparent;\n border-bottom: 1px solid #cccccc;\n border-bottom-width: 14px;\n margin-left: 35px;\n position: absolute;\n top: -12px;\n _display: none;\n}\n.speech-bubble-container-up-right .speech-bubble-pointer-up-right {\n margin-left: 0;\n margin-right: 0px;\n right: 35px;\n}\n.tooltip {\n position: relative;\n text-decoration: none;\n}\n.tooltip .speech-bubble-container-up,\n.tooltip .speech-bubble-container-down,\n.tooltip .speech-bubble-container-left,\n.tooltip .speech-bubble-container-right,\n.tooltip .speech-bubble-container-up-right,\n.tooltip .tooltip-flyout {\n display: none;\n opacity: 0;\n -webkit-transition: opacity 0.5s linear 0ms;\n -moz-transition: opacity 0.5s linear 0ms;\n -o-transition: opacity 0.5s linear 0ms;\n transition: opacity 0.5s linear 0ms;\n}\n.micromodal {\n display: none;\n}\n.micromodal .modal__overlay {\n position: fixed;\n top: 0;\n left: 0;\n right: 0;\n bottom: 0;\n display: flex;\n justify-content: center;\n align-items: center;\n background: rgba(0, 0, 0, 0.65);\n z-index: 1000;\n}\n.micromodal .modal__container {\n box-sizing: border-box;\n overflow-y: auto;\n max-width: 500px;\n max-height: 100vh;\n padding: 30px;\n background-color: #fff;\n border-radius: 4px;\n}\n.micromodal .modal__container,\n.micromodal .modal__overlay {\n will-change: transform;\n}\n.micromodal .modal_header {\n display: flex;\n justify-content: space-between;\n align-items: center;\n margin-bottom: 1rem;\n}\n.micromodal .modal_title {\n margin-top: 0;\n margin-bottom: 0;\n color: #ff9900;\n box-sizing: border-box;\n}\n.bottom-promo {\n height: 90px;\n margin-top: 10px;\n overflow: hidden;\n}\n.bottom-promo .msg-728x90 {\n width: 728px;\n margin: 0 auto;\n}\n.msg-728x90 {\n overflow: hidden;\n position: relative;\n height: 90px;\n min-width: 728px;\n}\ntable.forum {\n table-layout: fixed;\n margin: 0 0 20px 0;\n padding: 0px;\n width: 100%;\n}\n.forum {\n /*\n .indent\n {\n padding-left:5px; padding-right: 5px;\n }\n */\n}\n.forum table {\n border-collapse: separate;\n}\n.forum .header1,\n.forum .header1 TD {\n color: #333333;\n font-size: 14px;\n vertical-align: middle;\n}\n.forum .header2,\n.forum .header2 TD {\n color: white;\n background-color: #ff9900;\n font-size: 14px;\n vertical-align: middle;\n}\n.forum .header2 input,\n.forum .header2 TD input,\n.forum .header2 select,\n.forum .header2 TD select {\n padding: 2px;\n}\n.forum .button {\n border: 1px solid #ffcc66;\n margin: 1px;\n}\n.forum .searchbar {\n border: 1px solid #cccccc;\n padding: 4px 0 0 0;\n margin: 3px 0;\n}\n.forum .searchbar .search {\n width: 200px;\n}\n.forum .searchbar input[type=\"image\"] {\n vertical-align: middle;\n}\n.forum .dropdown {\n background-color: #fffdfa;\n font-size: 95%;\n margin-left: 5px;\n}\n.forum .footer,\n.forum .footer td,\n.forum .navbar,\n.forum .navbar td {\n font-size: 14px;\n padding: 8px 0;\n border: 0;\n border-top: 1px solid #808080;\n}\n.author-wrapper {\n position: relative;\n}\n.author-wrapper .profile-pic {\n border: 1px solid #333;\n margin: 0 13px 0 0;\n padding: 10px;\n -moz-box-shadow: 3px 3px 5px 1px rgba(0, 0, 0, 0.2);\n -webkit-box-shadow: 3px 3px 5px 1px rgba(0, 0, 0, 0.2);\n box-shadow: 3px 3px 5px 1px rgba(0, 0, 0, 0.2);\n max-height: 100px;\n}\n.author-wrapper .container-member .author {\n font-size: 29px;\n color: #333333;\n font-weight: 600;\n}\n.member-signin .forgot {\n padding: 0;\n}\n.member-signin a.forgot {\n color: #808080;\n}\n.rating-container {\n /*\n\t.rating-close\n\t{\n\t\tfont-size : @FontSize-MediumSmall;\n\t\tfont-weight : bold;\n\t\t// display : inline-block;\n\t\t// height : 19px;\n\t\tpadding : 0px 7px 3px 5px;\n\t\ttext-decoration : none;\n\t\tborder : 1px solid transparent;\n\t\tposition : absolute;\n\t\tright : 1px;\n\t\ttop : -1px;\n\n\t\t\u0026:hover { border : 1px solid @Colour-Theme1; }\n\t}*/\n}\n.rating-container .rating-prompt {\n padding-right: 5px;\n white-space: nowrap;\n line-height: 25px;\n margin-right: 2px;\n font-weight: bold;\n color: #808080;\n}\n.rating-container .rating-votes {\n margin-left: 5px;\n}\n.rating-container.large-stars,\n.rating-container.medium-stars,\n.rating-container.small-stars {\n margin-left: 7px;\n}\n.rating-container.large-stars .rating-votes,\n.rating-container.large-stars .rating-prompt,\n.rating-container.large-stars .rating-poor,\n.rating-container.large-stars .rating-good {\n line-height: 25px;\n}\n.tablet-only,\n.tablet-block-only {\n display: none;\n}\n.mobile-only,\n.mobile-block-only {\n display: none;\n}\n.desktop-only.tablet-only {\n display: inherit;\n}\n.tablet-only.desktop-only {\n display: inherit;\n}\n.rrssb-buttons,\n.rrssb-buttons li,\n.rrssb-buttons li a {\n -moz-box-sizing: border-box;\n box-sizing: border-box;\n}\n.clearfix:after {\n clear: both;\n}\n.clearfix:before,\n.clearfix:after {\n content: \" \";\n display: table;\n}\n.rrssb-buttons {\n font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n height: 40px;\n margin: 0;\n padding: 0;\n width: 100%;\n}\n.rrssb-buttons li {\n float: left;\n height: 100%;\n list-style: none;\n margin: 0;\n padding: 0 2.5px;\n line-height: 13px;\n}\n.rrssb-buttons li.email a {\n background-color: #0a88ff;\n}\n.rrssb-buttons li.facebook a {\n background-color: #306199;\n}\n.rrssb-buttons li.linkedin a {\n background-color: #007bb6;\n}\n.rrssb-buttons li.twitter a {\n background-color: #26c4f1;\n}\n.rrssb-buttons li.reddit a {\n background-color: #8bbbe3;\n}\n.rrssb-buttons li.pinterest a {\n background-color: #b81621;\n}\n.rrssb-buttons li.compact {\n padding: 0 4.5px;\n}\n.rrssb-buttons li.compact a {\n border-radius: 50%;\n padding: 0px 7px 0 32px;\n}\n.rrssb-buttons li.compact a .icon {\n left: 7px;\n top: -3px;\n width: 5%;\n transform: scale(0.85);\n}\n.rrssb-buttons li.compact a .icon svg {\n height: auto;\n width: auto;\n}\n.rrssb-buttons li a {\n background-color: #ccc;\n border-radius: 3px;\n display: block;\n font-size: 11px;\n font-weight: bold;\n height: 100%;\n padding: 11px 7px 12px 27px;\n position: relative;\n text-align: center;\n text-decoration: none;\n text-transform: uppercase;\n -webkit-font-smoothing: antialiased;\n -moz-osx-font-smoothing: grayscale;\n width: 100%;\n -webkit-transition: background-color 0.2s ease-in-out;\n -moz-transition: background-color 0.2s ease-in-out;\n -o-transition: background-color 0.2s ease-in-out;\n transition: background-color 0.2s ease-in-out;\n}\n.rrssb-buttons li a .icon {\n display: block;\n height: 100%;\n left: 10px;\n padding-top: 9px;\n position: absolute;\n top: 0;\n width: 10%;\n}\n.rrssb-buttons li a .icon svg {\n height: 17px;\n width: 17px;\n}\n.rrssb-buttons li a .icon svg path,\n.rrssb-buttons li a .icon svg polygon {\n fill: #fff;\n}\n.cc-window {\n opacity: 1;\n background-color: #ff9900;\n /*\n -webkit-transition: opacity .25s ease;\n -moz-transition: opacity .25s ease;\n -ms-transition: opacity .25s ease;\n -o-transition: opacity .25s ease;\n transition: opacity .25s ease;\n */\n}\n.cc-window.cc-invisible {\n opacity: 0;\n}\n.cc-animate.cc-revoke {\n /*\n -webkit-transition: transform .25s ease;\n -moz-transition: transform .25s ease;\n -ms-transition: transform .25s ease;\n -o-transition: transform .25s ease;\n transition: transform .25s ease;\n */\n}\n.cc-animate.cc-revoke.cc-bottom {\n transform: translateY(2em);\n}\n.cc-window,\n.cc-revoke {\n position: fixed;\n overflow: hidden;\n box-sizing: border-box;\n /* exclude padding when dealing with width */\n font-family: \"Segoe UI\", Arial, Sans-Serif;\n font-size: 13px;\n /* by setting the base font here, we can size the rest of the popup using CSS `em` */\n line-height: 1.5em;\n display: flex;\n flex-wrap: nowrap;\n /* the following are random unjustified styles - just because - should probably be removed */\n z-index: 9999;\n}\n.cc-window.cc-banner {\n padding: 0.7em 1.8em;\n width: 100%;\n flex-direction: row;\n}\n.cc-revoke {\n padding: 0.5em;\n}\n.cc-btn,\n.cc-link,\n.cc-close,\n.cc-revoke {\n cursor: pointer;\n}\n.cc-link {\n opacity: 0.8;\n display: inline-block;\n padding: 0.2em;\n text-decoration: underline;\n}\n.cc-link:active,\n.cc-link:visited {\n color: initial;\n}\n.cc-btn {\n display: block;\n padding: 0.4em 0.8em;\n font-size: 0.9em;\n font-weight: bold;\n border-width: 2px;\n border-style: solid;\n text-align: center;\n white-space: nowrap;\n}\n.cc-banner .cc-btn:last-child {\n min-width: 110px;\n margin-left: 10px;\n}\n.cc-highlight .cc-btn:first-child {\n background-color: transparent;\n border-color: transparent;\n}\n.cc-revoke.cc-bottom {\n bottom: 0;\n left: 3em;\n border-top-left-radius: 0.5em;\n border-top-right-radius: 0.5em;\n}\n.cc-bottom {\n bottom: 1em;\n}\n.cc-window.cc-banner {\n align-items: center;\n}\n.cc-banner.cc-bottom {\n left: 0;\n right: 0;\n bottom: 0;\n}\n.cc-banner .cc-message {\n flex: 1;\n}\n.cc-compliance {\n display: flex;\n align-items: center;\n align-content: space-between;\n}\n.cc-compliance \u003e .cc-btn {\n flex: 1;\n}\n.cc-btn + .cc-btn {\n margin-left: 0.5em;\n}\n\n@keyframes showcopied {\n 0% {\n opacity: 0;\n }\n 70% {\n opacity: 1;\n }\n 100% {\n opacity: 0;\n }\n}\n"])</script><script>self.__next_f.push([1,"1a:T9375,"])</script><script>self.__next_f.push([1,"\r\n\u003cul class=\"download\"\u003e\r\n\u003cli\u003e\u003ca href=\"/kb/ip/14207/searcharoo_3/searcharoo3.zip\"\u003eDownload source code - 29 Kb \u003c/a\u003e\u003c/li\u003e\r\n\r\n\u003cli\u003e\u003ca href=\"/kb/ip/14207/searcharoo_3/searcharoo3_aspnet2.zip\"\u003eDownload source code for - 30 Kb \u003c/a\u003e\u003ca href=\"/kb/ip/14207/#aspnet2\"\u003efor ASP.NET 2.0\u003c/a\u003e \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003cp\u003e\u003cimg height=\"389\" alt=\"Sample image of Searcharoo: shows both the 'initial page' (bottom right) and the results view\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_1.gif\" width=\"500\" border=\"1\" /\u003e\u003c/p\u003e\r\n\r\n\u003ch2\u003eBackground\u003c/h2\u003e\r\n\r\n\u003cp\u003eThis article follows on from the previous two Searcharoo samples:\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003ca href=\"Searcharoo.asp\" target=\"_blank\"\u003eSearcharoo Version 1\u003c/a\u003e describes building a simple search engine that crawls the \u003cem\u003efile system\u003c/em\u003e from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003ca href=\"Spideroo.asp\" target=\"_blank\"\u003eSearcharoo Version 2\u003c/a\u003e focused on adding a 'spider' to find data to index by following \u003cem\u003eweb links\u003c/em\u003e (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'.\u003c/p\u003e\r\n\r\n\u003ch2\u003eIntroduction\u003c/h2\u003e\r\n\r\n\u003cp\u003eThis article (version 3 of Searcharoo) covers three main areas:\u003c/p\u003e\r\n\r\n\u003col\u003e\r\n\u003cli\u003eImplementing a 'save to disk' function for the catalog \u003c/li\u003e\r\n\r\n\u003cli\u003eFeature suggestions, bug fixes and incorporation of code contributed by others on previous articles (mostly via CodeProject - thankyou!) \u003c/li\u003e\r\n\r\n\u003cli\u003eImproving the code itself (adding comments, moving classes, improving readability and hopefully making it easier to modify \u0026amp; re-use) \u003c/li\u003e\r\n\u003c/ol\u003e\r\n\r\n\u003ch4\u003eNew 'features' include:\u003c/h4\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eSaving the catalog (which resides in memory for fast searching) to disk \u003c/li\u003e\r\n\r\n\u003cli\u003eMaking the Spider recognise and follow pages referenced in FRAMESETs and IFRAMEs (suggested by le_mo_mo) \u003c/li\u003e\r\n\r\n\u003cli\u003ePaging results rather than just listing them all on one page (submitted by \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=1125453\" target=\"_blank\"\u003eJim Harkins\u003c/a\u003e) \u003c/li\u003e\r\n\r\n\u003cli\u003eNormalising words and numbers (removing punctuation, etc) \u003c/li\u003e\r\n\r\n\u003cli\u003e(Optional) stemming of English words to reduce catalog size (suggested by \u003ca href=\"http://dotnetjunkies.com/weblog/chris.taylor\" target=\"_blank\"\u003eChris Taylor\u003c/a\u003e and \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=32301\" target=\"_blank\"\u003eTrickster\u003c/a\u003e) \u003c/li\u003e\r\n\r\n\u003cli\u003e(Optional) use of Stop words to reduce catalog size \u003c/li\u003e\r\n\r\n\u003cli\u003e(Optional) creation of a Go word list, to specifically catalog domain-specific words like \"C#\", which might otherwise be ignored \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003ch4\u003eThe bug fixes include:\u003c/h4\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eCorrectly parsing \u0026lt;TITLE\u0026gt; tags that may have additional attributes eg. an ID= attribute in an ASP.NET environment. (submitted by \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=2416042\" target=\"_blank\"\u003exenomouse\u003c/a\u003e) \u003c/li\u003e\r\n\r\n\u003cli\u003eHandling Cookies if the server has set them to track a 'session' (submitted by \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=804458\" target=\"_blank\"\u003eSimon Jones\u003c/a\u003e) \u003c/li\u003e\r\n\r\n\u003cli\u003eChecking the 'final' URL after redirects to ensure the right page is indexed and linked (submitted by \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=804458\" target=\"_blank\"\u003eSimon Jones\u003c/a\u003e) \u003c/li\u003e\r\n\r\n\u003cli\u003eCorrectly parsing (and obeying!) the ROBOTS meta tag (I found this bug myself). \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003ch4\u003eCode layout improvements included:\u003c/h4\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eThe Spider code that was a bit of a mess in SearcharooSpider.aspx being moved into a proper C# class (and implementing an EventHandler to allow monitoring of progress) \u003c/li\u003e\r\n\r\n\u003cli\u003eEncapsulation of Preferences into a single static class \u003c/li\u003e\r\n\r\n\u003cli\u003eLayout of Searcharoo.cs using #regions (easy to read if you have VS.NET) \u003c/li\u003e\r\n\r\n\u003cli\u003eUser control (Searcharoo.ASCX) created for search box - if you want to re-brand it you only have to modify in one place. \u003c/li\u003e\r\n\r\n\u003cli\u003ePaging implementation using PagedDataSource means you can easily alter the 'template' for the results (eg link size/color/layout) in Searcharoo3.aspx \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003ch3\u003eDesign\u003c/h3\u003e\r\n\r\n\u003cp\u003eThe \u003cem\u003efundamental\u003c/em\u003e Catalog-File-Word design remains unchanged (from Version 1), however there are quite a few extra classes implemented in this version.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cimg alt=\"Object Model for Searcharoo3\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_2.png\" border=\"1\" /\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eTo build the catalog, SearcharooSpider.aspx calls Spider.BuildCatalog() which:\u003c/p\u003e\r\n\r\n\u003col\u003e\r\n\u003cli\u003eAccesses Preferences static object to read settings \u003c/li\u003e\r\n\r\n\u003cli\u003eCreates empty Catalog \u003c/li\u003e\r\n\r\n\u003cli\u003eCreates IGoWord, IStopper and IStemming implementations (based on Preferences) \u003c/li\u003e\r\n\r\n\u003cli\u003eProcesses startPageUri (with a WebRequest) \u003c/li\u003e\r\n\r\n\u003cli\u003eCreates HtmlDocument, populates properties including Link collections \u003c/li\u003e\r\n\r\n\u003cli\u003eParses the content of the page, creating Word and File objects as required \u003c/li\u003e\r\n\r\n\u003cli\u003eRecursively applies steps 4 through 6 for each LocalLink \u003c/li\u003e\r\n\r\n\u003cli\u003eBinarySerializes the Catalog to disk using CatalogBinder \u003c/li\u003e\r\n\r\n\u003cli\u003eAdds the Catalog to Application.Cache[], for use by Searcharoo3.aspx for searching! \u003c/li\u003e\r\n\u003c/ol\u003e\r\n\r\n\u003ch3\u003eCode Structure\u003c/h3\u003e\r\n\r\n\u003cp\u003eThese are the files used in this version (and contained in the \u003ca href=\"Searcharoo_3/Searcharoo3.zip\"\u003edownload\u003c/a\u003e).\u003c/p\u003e\r\n\r\n\u003ctable cellspacing=\"0\" width=\"600\" border=\"1\"\u003e\r\n\u003ctbody\u003e\r\n\u003ctr\u003e\r\n\u003cth\u003eweb.config\u003c/th\u003e\r\n\r\n\u003ctd\u003e14 settings that control how the spider \u003cem\u003eand\u003c/em\u003e the search page behave. They are all 'optional' (ie the spider and search page will run if no config settings are provided) but I recommend at least providing\u003cbr /\u003e\u003ccode lang=\"xml\"\u003e\u0026lt;add key=\"Searcharoo_VirtualRoot\" value=\"https:///content/\" /\u0026gt;\u003c/code\u003e \u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003cth\u003eSearcharoo.cs\u003c/th\u003e\r\n\r\n\u003ctd\u003eMost code for the application is in this file. Many classes that were in ASPX files in version 2 have been moved into this file (such as Spider and HtmlDocument) because it's easier to read and maintain. New version 3 features (Stop, Go, Stemming) all added here. \u003cbr /\u003e\u003cimg alt=\"Searcharoo.cs contents viewed in VisualStudio.NET with regions collapsed\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_3.gif\" border=\"1\" /\u003e \u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003cth\u003eSearcharoo3.aspx\u003c/th\u003e\r\n\r\n\u003ctd\u003eSearch page (input and results). Checks the Application-Cache for a Catalog, and if none exists, creates one (deserialize OR run SearcharooSpider.aspx)\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003cth\u003eSearcharoo.ascx\u003c/th\u003e\r\n\r\n\u003ctd\u003e\u003cstrong\u003eNEW \u003c/strong\u003euser control that contains two asp:Panels: \r\n\u003cul\u003e\r\n\u003cli\u003ethe 'blank' search box (when page is first loaded, defaults to yellow background) \u003c/li\u003e\r\n\r\n\u003cli\u003ethe populated search box (when results are displayed, defaults to blue background) \u003c/li\u003e\r\n\u003c/ul\u003e\r\n(see the screenshot at the top of the article) \u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003cth\u003eSearcharooSpider.aspx\u003c/th\u003e\r\n\r\n\u003ctd\u003eThe main page (Searcharoo3.aspx) does a Server.Transfer to this page to create a new Catalog (if required).\u003cbr /\u003eAlmost ALL of the code that \u003cem\u003ewas\u003c/em\u003e in this page in version 2 has been migrated to Searcharoo.cs - OnProgressEvent() allows it to still display 'progress' messages as the spidering is taking place. \u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\r\n\u003c/table\u003e\r\n\r\n\u003ch2\u003eSaving the Catalog to Disk\u003c/h2\u003e\r\n\r\n\u003cp\u003eThere are a couple of reasons why saving the catalog to disk is useful:\u003c/p\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eIt can be built on a different server to the website (for smaller sites, where the code may not have permission to write to disk on the webserver) \u003c/li\u003e\r\n\r\n\u003cli\u003eIf the server Application restarts, the catalog can be re-loaded rather than rebuilt entirely \u003c/li\u003e\r\n\r\n\u003cli\u003eYou can finally 'see' what information is stored in the catalog - useful for debugging! \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003cp\u003eThere are two types of Serialization (Xml and Binary) available in the Framework, and since the Xml is 'human readable', that seemed the logical one to try. The code required to serialize the Catalog is very simple - the code below is from the Catalog.Save() method, so the reference to \u003cstrong\u003ethis\u003c/strong\u003e is the Catalog object. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003eXmlSerializer serializerXml = \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e XmlSerializer( \u003cspan class='cs-keyword'\u003etypeof\u003c/span\u003e( Catalog ) );\nSystem.IO.TextWriter writer \n = \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e System.IO.StreamWriter( Preferences.CatalogFileName+\u003cspan class='cpp-string'\u003e\".xml\"\u003c/span\u003e );\nserializerXml.Serialize( writer, \u003cspan class='cs-keyword'\u003ethis\u003c/span\u003e );\nwriter.Close();\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eThe 'test dataset' I've mostly used is the \u003ca href=\"http://www.cia.gov/cia/publications/factbook/\" target=\"_blank\"\u003eCIA World Factbook\u003c/a\u003e (\u003ca href=\"http://www.cia.gov/cia/download.html\"\u003edownload\u003c/a\u003e) which is about \u003cstrong\u003e52.6 Mb\u003c/strong\u003e on disk for the HTML only (not including images and non-searchable data) - so imagine my \"surprise\" when the Xml-Serialized-Catalog itself three times the size at \u003cstrong\u003e156 Mb\u003c/strong\u003e (yes, megabytes!). Couldn't even \u003cem\u003eopen\u003c/em\u003e it easily, except by 'type'ing it from the Command Prompt.\u003c/p\u003e\r\n\u003cimg height=\"410\" alt=\"Xml Serialization is VERBOSE: 136 Mb!\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_136mb.gif\" width=\"600\" /\u003e \r\n\u003cp\u003eOUCH - what a waste of space! And worse, this was the first time I'd noticed the fields defined in the File class were declared public and not private (see the elements beginning with underscors). Firstly, let's get rid of the serialized duplicates (fields that should be private, and their public property counterparts) -- rather than change the visibility (and pontentially break code), the [XmlIgnore] attribute can be added to the definition. To further reduce the amount of repeated text, the element names are compressed to single letters using the [XmlElement] attribute, and to reduce the number of \u0026lt;\u0026gt; some of the properties are marked to be serialized as [XmlAttribute]s.\u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e[Serializable]\n\u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003eclass\u003c/span\u003e Word \n{\n [XmlElement(\u003cspan class='cpp-string'\u003e\"t\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e Text;\n [XmlElement(\u003cspan class='cpp-string'\u003e\"fs\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e File[] Files\n...\n[Serializable]\n\u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003eclass\u003c/span\u003e File\n{\n [XmlIgnore] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e _Url;\n...\n [XmlAttribute(\u003cspan class='cpp-string'\u003e\"u\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e Url { ...\n [XmlAttribute(\u003cspan class='cpp-string'\u003e\"t\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e Title { ...\n [XmlElement(\u003cspan class='cpp-string'\u003e\"d\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e Description { ...\n [XmlAttribute(\u003cspan class='cpp-string'\u003e\"d\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e DateTime CrawledDate { ...\n [XmlAttribute(\u003cspan class='cpp-string'\u003e\"s\"\u003c/span\u003e)] \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003elong\u003c/span\u003e Size { ...\n...\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eThe Xml file is now a teeny (not!) \u003cstrong\u003e49 Mb\u003c/strong\u003e in size, still too large for notepad but easily viewed via cmd. As you can see below, the 'compression' of the Xml certainly saved some space - at least the Catalog is now smaller than the source data!\u003c/p\u003e\r\n\u003cimg height=\"195\" alt=\"Xml Serialization is slightly less verbose - 49 Mb\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_49mb.gif\" width=\"600\" /\u003e \r\n\u003cp\u003eEven with the smaller output, 49 Mb is of Xml is still a little too verbose to be practical (hardly a surprise really, Xml often is!) so let's serialize the index to a Binary format (again, the Framework classes make it really simple).\u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003eSystem.IO.Stream stream = \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e System.IO.FileStream\n (Preferences.CatalogFileName+\u003cspan class='cpp-string'\u003e\".dat\"\u003c/span\u003e , System.IO.FileMode.Create );\nSystem.Runtime.Serialization.IFormatter formatter = \n \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();\nformatter.Serialize (stream, \u003cspan class='cs-keyword'\u003ethis\u003c/span\u003e);\nstream.Close();\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eThe results of changing to Binary Serialization were dramatic - the same catalog data was \u003cstrong\u003e4.6 Mb\u003c/strong\u003e rather than 150! That's about 3% of the Xml size, definitely the way to go.\u003c/p\u003e\r\n\r\n\u003cp\u003eNow that I had the Catalog being saved successfully to disk, it \u003cem\u003eseemed\u003c/em\u003e like it would be a simple matter to re-load it back into memory \u0026amp; the Application Cache...\u003c/p\u003e\r\n\r\n\u003ch2\u003eLoading the Catalog from Disk\u003c/h2\u003e\r\n\r\n\u003cp\u003eUnfortunately, it was NOT that simple. Whenever the Application restarted (say web.config or Searcharoo.cs was changed), the code could not de-serialize the file but instead threw this cryptic error: \u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cstrong\u003eCannot find the assembly h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null\u003c/strong\u003e \u003c/p\u003e\r\n\r\n\u003cp\u003e\u003ca href=\"Searcharoo_3/Searcharoo3_4L.png\" target=\"_blank\"\u003e\u003cimg alt=\"Load Binary Catalog Exception - click to see full screen\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/searcharoo3_4.png\" border=\"1\" /\u003e\u003c/a\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eAt first I was stumped - I didn't have any assembly named \u003cem\u003eh4octhiw\u003c/em\u003e, so it wasn't immediately apparent why it could not be found. There are a couple of hints though:\u003c/p\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eThe 'not found ' assembly appears to have a randomly generated name... and what do we know uses randomly generated assembly names? The \\Temporary ASP.NET Files\\ directory where dynamically compiled assemblies (from src=\"\" and ASPX) are saved. \u003c/li\u003e\r\n\r\n\u003cli\u003eThe error line references only 'object' and 'stream' types - surely \u003cem\u003ethey\u003c/em\u003e aren't causing the problem \u003c/li\u003e\r\n\r\n\u003cli\u003eReading through the Stack Trace (\u003ca href=\"Searcharoo3_4L.png\" target=\"_blank\"\u003eclick on the image\u003c/a\u003e) from the bottom, up (as always), you can infer that the Deserialize method creates a BinaryParser that creates an ObjectMap with an array of MemberNames which in turn request ObjectReader.GetType() which triggers the GetAssembly() method... but it fails!. Hmm - sounds like it might be looking for the Types that have been serialized - why can't it find them? \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003cp\u003eIf your Google skills are honed, rather than the dozens of useless links returned when you \u003ca href=\"http://www.google.com.au/search?q=ASP.NET+%22cannot+find+the+assembly%22\"\u003esearch for \u003cem\u003eASP.NET \"Cannot find the assembly\"\u003c/em\u003e\u003c/a\u003e you'll be lucky and stumble across this \u003ca href=\"https://codeproject.org.cn/soap/Serialization_Samples.asp\" target=\"_blank\"\u003eCodeProject article on Serialization\u003c/a\u003e where you will learn a very interesting fact:\u003c/p\u003e\r\n\r\n\u003cblockquote style=\"BACKGROUND-COLOR: #fbedbb\"\u003eType information is also serialized while the class is serialized enabling the class to be deserialized using the type information. Type information consists of namespace, class name, assembly name, culture information, assembly version, and public key token. As long as your deserialized class and the class that is serialized reside in the same assembly it does not cause any problem. But if the serializer is in a separate assembly, .NET cannot find your class' type hence cannot deserialize it.\u003c/blockquote\u003e\r\n\r\n\u003cp\u003eBut what does it mean? Every time the web/IIS 'Application' restarts, all your ASPX and src=\"\" code is recompiled to a NEW, RANDOMLY NAMED assembly in \\Temporary ASP.NET Files\\. So although the Catalog class is based on the same code, its \u003cstrong\u003eType Information\u003c/strong\u003e (namespace, class name, \u003cstrong\u003eassembly name\u003c/strong\u003e, culture information, assembly version, and public key token) is DIFFERENT!\u003c/p\u003e\r\n\r\n\u003cp\u003eAnd, importantly, when a class is binary serialized, its Type Information is stored along with it (aside: this \u003cem\u003edoesn't\u003c/em\u003e happen with Xml Serialization, so we probably would have been OK if we'd stuck with that). \u003c/p\u003e\r\n\r\n\u003cp\u003eThe upshot: after every recompile (whatever triggered it: web.config change, code change, IIS restart, machine reboot, etc) our Catalog class has different Type info - and when it tries to load the serialized version we saved earlier, it doesn't match and the Framework can't find the assembly where the previous Catalog Type is defined (since it was only Temporary and has been deleted when the recompile took place). \u003c/p\u003e\r\n\r\n\u003ch3\u003eCustom Formatter implementation\u003c/h3\u003e\r\n\r\n\u003cp\u003eSounds complex? It is, kinda, but the whole 'temporary assemblies' thing is something that happens invisibly and most developers don't need to know or care much about it. Thankfully we don't have to worry too much either, because the \u003ca href=\"https://codeproject.org.cn/soap/Serialization_Samples.asp\" target=\"_blank\"\u003eCodeProject article on Serialization\u003c/a\u003e also contains the solution: a helper class that 'tricks' the Binary Deserializer into using the 'current' Catalog type.\u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e\u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003eclass\u003c/span\u003e CatalogBinder: System.Runtime.Serialization.SerializationBinder\n{\n \u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003eoverride\u003c/span\u003e Type BindToType (\u003cspan class='cs-keyword'\u003estring\u003c/span\u003e assemblyName, \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e typeName) \n { \n \u003cspan class='cs-comment'\u003e// get the 'fully qualified (ie inc namespace) type name' into an \u003c/span\u003e\n \u003cspan class='cs-comment'\u003e// array\u003c/span\u003e\n \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e[] typeInfo = typeName.Split('.');\n \u003cspan class='cs-comment'\u003e// because the last item is the class name, which we're going to \u003c/span\u003e\n \u003cspan class='cs-comment'\u003e// 'look for' in *this* namespace/assembly\u003c/span\u003e\n \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e className=typeInfo[typeInfo.Length -\u003cspan class='cs-literal'\u003e1\u003c/span\u003e];\n \u003cspan class='cs-keyword'\u003eif\u003c/span\u003e (className.Equals(\u003cspan class='cpp-string'\u003e\"Catalog\"\u003c/span\u003e))\n {\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003etypeof\u003c/span\u003e (Catalog);\n }\n \u003cspan class='cs-keyword'\u003eelse\u003c/span\u003e \u003cspan class='cs-keyword'\u003eif\u003c/span\u003e (className.Equals(\u003cspan class='cpp-string'\u003e\"Word\"\u003c/span\u003e))\n {\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003etypeof\u003c/span\u003e (Word);\n }\n \u003cspan class='cs-keyword'\u003eif\u003c/span\u003e (className.Equals(\u003cspan class='cpp-string'\u003e\"File\"\u003c/span\u003e))\n {\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003etypeof\u003c/span\u003e (File);\n }\n \u003cspan class='cs-keyword'\u003eelse\u003c/span\u003e\n { \u003cspan class='cs-comment'\u003e// pass back exactly what was passed in!\u003c/span\u003e\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e Type.GetType(\u003cspan class='cs-keyword'\u003estring\u003c/span\u003e.Format( \u003cspan class='cpp-string'\u003e\"{0}, {1}\"\u003c/span\u003e, typeName, \n assemblyName));\n }\n } \n}\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eEt Voila! Now that the Catalog can be saved/loaded, the search engine is much more robust than before. You can save/back-up the Catalog, turn on debugging to see its contents, and even generate it on a different machine (say a local PC) and upload to your web server!\u003c/p\u003e\r\n\r\n\u003cp\u003eUsing the 'debug' Xml serialized files, for the first time I could the contents of the Catalog, and I found lots of 'garbage' was being stored that was both wasteful in terms of memory/disk, but also useless/unsearchable. With the major task for this release complete, it seemed appropriate to do some bugfixes and add some \"real search engine\" features to clean up the Catalog's contents.\u003c/p\u003e\r\n\r\n\u003ch2\u003eNew features \u0026amp; bug fixes\u003c/h2\u003e\r\n\r\n\u003ch4\u003eFRAME and IFRAME support\u003c/h4\u003e\r\n\r\n\u003cp\u003eCodeProject member \u003cstrong\u003ele_mo_mo\u003c/strong\u003e pointed out that the spider did not follow (and index) framed content. This was a minor change to the regex that finds links - previously \u003ccode\u003eA\u003c/code\u003e and \u003ccode\u003eAREA \u003c/code\u003etags were supported, so it was simple enough to add \u003ccode\u003eFRAME \u003c/code\u003eand \u003ccode\u003eIFRAME \u003c/code\u003eto the pattern. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e\u003cspan class='cs-keyword'\u003eforeach\u003c/span\u003e (Match match \u003cspan class='cs-keyword'\u003ein\u003c/span\u003e Regex.Matches(htmlData\n , @\u003cspan class='cpp-string'\u003e\"(?\u0026lt;anchor\u0026gt;\u0026lt;\\s*(a|area|\u003cstrong\u003eframe|iframe\u003c/strong\u003e)\\\"\u003c/span\u003e + \n @\u003cspan class='cpp-string'\u003e\"s*(?:(?:\\b\\w+\\b\\s*(?:=\\s*(?:\"\u003c/span\u003e\u003cspan class='cpp-string'\u003e\"[^\"\u003c/span\u003e\u003cspan class='cpp-string'\u003e\"]*\"\u003c/span\u003e\u003cspan class='cpp-string'\u003e\"|'[^']\"\u003c/span\u003e + \n @\u003cspan class='cpp-string'\u003e\"*'|[^\"\u003c/span\u003e\u003cspan class='cpp-string'\u003e\"'\u0026lt;\u0026gt; ]+)\\s*)?)*)?\\s*\u0026gt;)\"\u003c/span\u003e\n , RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture))\n{\u003c/pre\u003e\r\n\r\n\u003ch4\u003eStop words\u003c/h4\u003e\r\n\r\n\u003cp\u003eLet's start with \u003ca href=\"http://www.google.com/support/bin/answer.py?answer=981\"\u003eGoogle's definition of Stop Words\u003c/a\u003e: \u003c/p\u003e\r\n\r\n\u003cblockquote style=\"BACKGROUND-COLOR: #fbedbb\"\u003eGoogle ignores common words and characters, such as \"where\" and \"how,\" as well as certain single digits and single letters. These terms rarely help narrow a search and can slow search results. We call them \"stop words.\"\u003c/blockquote\u003e\r\n\r\n\u003cp\u003eThe basic premise is that we don't want to waste space in the catalog storing data will never be used, the 'Stop Words' assumption is that you'll never search for words like \"a in at I\" because they appear on almost every page, and therefore don't actually help you find anything!\u003c/p\u003e\r\n\r\n\u003cp\u003eHere's a \u003ca href=\"http://libraries.mit.edu/tutorials/general/stopwords.html\" target=\"_blank\"\u003ebasic definition from MIT\u003c/a\u003e and some \u003ca href=\"http://www.tbray.org/ongoing/When/200x/2003/07/11/Stopwords\" target=\"_blank\"\u003einteresting statistics and Stop Word thoughts\u003c/a\u003e including the 'classic' Stop Word conundrum: should users be able to \u003ca href=\"http://www.google.com/search?q=%22to+be+or+not+to+be%22\"\u003esearch for Hamlet's soliloquy \"to be or not to be\"\u003c/a\u003e? \u003c/p\u003e\r\n\r\n\u003cp\u003eThe Stop Word code supplied with Searcharoo3 is pretty basic - it strips out ALL one and two letter words, plus\u003c/p\u003e\r\n\r\n\u003cpre lang=\"text\"\u003ethe, and, that, you, this, for, but, with, are, have, was, out, not\u003c/pre\u003e\r\n\r\n\u003cp\u003eA more complex implementation is left for others to contribute (or a future version, whichever comes first).\u003c/p\u003e\r\n\r\n\u003ch4\u003eWord normalization\u003c/h4\u003e\r\n\r\n\u003cp\u003eI had noticed words were often being stored \u003cem\u003ewith\u003c/em\u003e any punctuation that was adjacent to them in the source text. For example, the Catalog contained Files with Word instances for \u003c/p\u003e\r\n\r\n\u003ctable cellspacing=\"0\" width=\"350\" border=\"1\"\u003e\r\n\u003ctbody\u003e\r\n\u003ctr\u003e\r\n\u003ctd\u003e\"People\u003c/td\u003e\r\n\r\n\u003ctd\u003epeople\u003c/td\u003e\r\n\r\n\u003ctd\u003epeople*\u003c/td\u003e\r\n\r\n\u003ctd\u003epeople\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\r\n\u003c/table\u003e\r\n\r\n\u003cp\u003eThis prevented the pages containing those words from ever being returned in a search, unless the user had typed the exact punctuation as well - in the above example a search for \u003cstrong\u003epeople\u003c/strong\u003e would only return one page, when you would expect it to return all four pages. \u003cbr /\u003eThe previous version of Searcharoo did have a 'black list' of punctuation [,./?;:()-=etc] but that wasn't sufficient as I could not predict/foresee all possible punctuation characters. Also, it was implemented with the Trim() method which was not parsing out punctuation within words [aside: the handling of parenthesised words is still not satisfactory in version 3]. The following 'white list' of characters that are allowed to be indexed ensures that NO punctuation is accidentally stored as part of a word. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003ekey = System.Text.RegularExpressions.Regex.Replace(key, @\u003cspan class='cpp-string'\u003e\"[^a-z0-9,.]\"\u003c/span\u003e\n , \u003cspan class='cpp-string'\u003e\"\"\u003c/span\u003e\n , System.Text.RegularExpressions.RegexOptions.IgnoreCase);\n\u003c/pre\u003e\r\n\r\n\u003cp\u003e\u003cstrong style=\"COLOR: red\"\u003eCulture note:\u003c/strong\u003e this \"white list\" method of removing punctuation is VERY English-language centric, as it will remove \u003cem\u003eat least some\u003c/em\u003e characters from most European languages, and it will strip ALL content from most Asian-language content. \u003cbr /\u003eIf you want to use Searcharoo with non-English character sets, you should find the above line of code and REPLACE it with this \"black list\" from Version 2. While it allows more characters to be searched, the results are more likely to be polluted by punctuation which could reduce searchability. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003ekey = word.Trim\n (' ','?','\\\"',',','\\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();\u003c/pre\u003e\r\n\r\n\u003ch4\u003eNumber normalization\u003c/h4\u003e\r\n\r\n\u003cp\u003eNumbers are a special case of word normalization: some punctuation is required to interpret the number (eg decimal point), then convert it to a proper number.\u003cbr /\u003eAlthough not perfect, this means phone numbers written as 0412-345-678 or (04)123-45678 would both be Catalogued as 0412345678 and therefore \u003cem\u003esearching\u003c/em\u003e for either 0412-345-678 or (04)123-45678 would match both source documents. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e\u003cspan class='cs-keyword'\u003eprivate\u003c/span\u003e \u003cspan class='cs-keyword'\u003ebool\u003c/span\u003e IsNumber (\u003cspan class='cs-keyword'\u003eref\u003c/span\u003e \u003cspan class='cs-keyword'\u003estring\u003c/span\u003e word)\n{\n \u003cspan class='cs-keyword'\u003etry\u003c/span\u003e\n {\n \u003cspan class='cs-keyword'\u003elong\u003c/span\u003e number = Convert.ToInt64(word); \u003cspan class='cs-comment'\u003e//;int.Parse(word);\u003c/span\u003e\n word = number.ToString();\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e (word!=\u003cspan class='cs-clrtype'\u003eString\u003c/span\u003e.Empty);\u003cspan class='cs-comment'\u003e//true;\u003c/span\u003e\n }\n \u003cspan class='cs-keyword'\u003ecatch\u003c/span\u003e\n {\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003efalse\u003c/span\u003e;\n }\n}\n\u003c/pre\u003e\r\n\r\n\u003ch4\u003eGo words\u003c/h4\u003e\r\n\r\n\u003cp\u003eAfter reading the \u003cstrong\u003eWord Normalization\u003c/strong\u003e section above you can see how cataloging and searching for a technical term/phrase (like C# or C++) is impossible - the non-alphanumeric characters are filtered out before they have a chance to be catalogued.\u003c/p\u003e\r\n\r\n\u003cp\u003eTo avoid this, Searcharoo allows a 'Go words' list to be created. A 'Go word' is the opposite of a 'Stop word': instead of being blocked from cataloguing, it is given a free-pass into the catalog, bypassing the Normalization and Stemming code.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe weakness in this approach is that you must know ahead of time all the different Go words that your users might search for. In future, you might want to store each unsuccessful search term for later analysis and expansion of your Go word list. The Go word implementation is \u003cem\u003every\u003c/em\u003e simple:\u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e\u003cspan class='cs-keyword'\u003epublic\u003c/span\u003e \u003cspan class='cs-keyword'\u003ebool\u003c/span\u003e IsGoWord (\u003cspan class='cs-keyword'\u003estring\u003c/span\u003e word)\n{\n \u003cspan class='cs-keyword'\u003eswitch\u003c/span\u003e (word.ToLower())\n {\n \u003cspan class='cs-keyword'\u003ecase\u003c/span\u003e \u003cspan class='cpp-string'\u003e\"c#\"\u003c/span\u003e:\n \u003cspan class='cs-keyword'\u003ecase\u003c/span\u003e \u003cspan class='cpp-string'\u003e\"vb.net\"\u003c/span\u003e:\n \u003cspan class='cs-keyword'\u003ecase\u003c/span\u003e \u003cspan class='cpp-string'\u003e\"asp.net\"\u003c/span\u003e:\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003etrue\u003c/span\u003e;\n \u003cspan class='cs-keyword'\u003ebreak\u003c/span\u003e;\n }\n \u003cspan class='cs-keyword'\u003ereturn\u003c/span\u003e \u003cspan class='cs-keyword'\u003efalse\u003c/span\u003e;\n}\n\u003c/pre\u003e\r\n\r\n\u003ch4\u003eStemming\u003c/h4\u003e\r\n\r\n\u003cp\u003eThe most basic explanation of 'stemming' is that it attempts to identify 'related' words and return them in response to a query. The simplest example is plurals: searching for \"field\" should also find instances of \"fields\" and vice versa. More complex examples are \"realize\" and \"realization\", \"populate\" and \"population\" - the \u003c/p\u003e\r\n\r\n\u003cp\u003eThis page on \u003ca href=\"http://www.infotoday.com/searcher/may01/liddy.htm\" target=\"_blank\"\u003eHow a Search Engine Works\u003c/a\u003e contains a brief explanation of Stemming and some of the other techniques described above.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe \u003ca href=\"http://www.tartarus.org/martin/PorterStemmer/\" target=\"_blank\"\u003ePorter Stemming Algorithm\u003c/a\u003e already existed as a \u003ca href=\"http://www.tartarus.org/martin/PorterStemmer/csharp2.txt\" target=\"_blank\"\u003eC# class\u003c/a\u003e, so was utilized 'as is' in Searcharoo3 (credit and thanks to \u003ca href=\"http://tartarus.org/~martin/index.html\" target=\"_blank\"\u003eMartin Porter\u003c/a\u003e).\u003c/p\u003e\r\n\r\n\u003ch2\u003eAffect on Catalog size\u003c/h2\u003e\r\n\r\n\u003cp\u003eThe Stop Words, Stemming, and Normalization steps above were all developed to 'tidy up' the Catalog and hopefully reduce its size/increase search speed. The results are listed below for our \u003ca href=\"http://www.cia.gov/cia/publications/factbook/\" target=\"_blank\"\u003eCIA World Factbook\u003c/a\u003e:\u003c/p\u003e\r\n\r\n\u003ctable id=\"Table1\" cellspacing=\"0\" border=\"1\"\u003e\r\n\u003ctbody\u003e\r\n\u003ctr\u003e\r\n\u003cth width=\"20%\"\u003esource:\u003cbr /\u003e800 files\u003cbr /\u003e52.6 Mb\u003c/th\u003e\r\n\r\n\u003cth width=\"20%\"\u003eRaw *\u003c/th\u003e\r\n\r\n\u003cth width=\"20%\"\u003e+ Stop words\u003c/th\u003e\r\n\r\n\u003cth width=\"20%\"\u003e+ Stemming \u003c/th\u003e\r\n\r\n\u003cth width=\"20%\"\u003e+'white list'\u003cbr /\u003enormalization\u003c/th\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd\u003eUnique Words\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e30,415\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e30,068\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e26,560\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e26,050\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd\u003eXml Serialized\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e156 Mb ^\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e149 Mb\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e138 Mb\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e136 Mb\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd\u003eBinary Serialized\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e4.6 Mb\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e4.5 Mb\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e4.1 Mb\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e4.0 Mb\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\u003ctr\u003e\r\n\u003ctd\u003eBinary % of source\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e8.75%\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e8.55%\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e7.79%%\u003c/td\u003e\r\n\r\n\u003ctd align=\"center\"\u003e7.60%\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\r\n\u003c/table\u003e\r\n\r\n\u003cp\u003e\u003cem\u003e* black list normalization, which is commented out in the code, and mentioned in the 'culture note'\u003c/em\u003e \u003cbr /\u003e\u003cem\u003e^ 49 Mb after 'compressing' the Xml output with [Attributes]\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eThe result was a 14% reduction in the number of words and a 13% decrease in Binary file size (mostly due to the addition of Stemming). Because the whole Catalog stays in memory (in the Application Cache) keeping the size small is important - maybe a future version will be able to persist some 'working copy' of the data to disk and enable spidering of \u003cem\u003ereally large\u003c/em\u003e sites, but for now the catalog seems to take less than 10% of the source data size.\u003c/p\u003e\r\n\r\n\u003ch2\u003e...but what about the UI?\u003c/h2\u003e\r\n\r\n\u003cp\u003eThe search user interface also had some improvements:\u003c/p\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003eMoving the search inputs into the Searcharoo.ascx User Control \u003c/li\u003e\r\n\r\n\u003cli\u003eAdding the same Stemming, Stop and Go word parsing to the search term that is applied during spidering \u003c/li\u003e\r\n\r\n\u003cli\u003eGenerating the result list using the new ResultFile class to construct a DataSource to bind to a Repeater control \u003c/li\u003e\r\n\r\n\u003cli\u003eAdding PagedDataSource and custom paging links rather than one long list of results (thanks to \u003ca href=\"https://codeproject.org.cn/aspnet/spideroo.asp#xx927327xx\"\u003eJim Harkin's feedback/code\u003c/a\u003e and \u003ca href=\"http://www.uberasp.net/ArticlePrint.aspx?id=29\"\u003euberasp.net\u003c/a\u003e) \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003ch4\u003eResultFile and SortedList\u003c/h4\u003e\r\n\r\n\u003cp\u003eIn version 2, outputting the results was very crude: the code was littered with \u003ccode\u003eResponse.Write\u003c/code\u003e calls making it difficult to reformat the output. \u003ca href=\"https://codeproject.org.cn/script/profile/whos_who.asp?id=1125453\"\u003eJim Harkins\u003c/a\u003e posted some Visual Basic code which is converted to C# below. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003e\u003cspan class='cs-comment'\u003e// build each result row\u003c/span\u003e\n\u003cspan class='cs-keyword'\u003eforeach\u003c/span\u003e (\u003cspan class='cs-keyword'\u003eobject\u003c/span\u003e foundInFile \u003cspan class='cs-keyword'\u003ein\u003c/span\u003e finalResultsArray.Keys)\n{\n \u003cspan class='cs-comment'\u003e// Create a ResultFile with it's own Rank\u003c/span\u003e\n infile = \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e ResultFile ((File)foundInFile);\n infile.Rank = (\u003cspan class='cs-keyword'\u003eint\u003c/span\u003e)((DictionaryEntry)finalResultsArray[foundInFile]).Value;\n sortrank = infile.Rank * -\u003cspan class='cs-literal'\u003e1000\u003c/span\u003e; \u003cspan class='cs-comment'\u003e// Assume not 'thousands' of results\u003c/span\u003e\n \u003cspan class='cs-keyword'\u003eif\u003c/span\u003e (output.Contains(sortrank) )\n { \u003cspan class='cs-comment'\u003e// rank exists - drop key index one number until it fits\u003c/span\u003e\n \u003cspan class='cs-keyword'\u003efor\u003c/span\u003e (\u003cspan class='cs-keyword'\u003eint\u003c/span\u003e i = \u003cspan class='cs-literal'\u003e1\u003c/span\u003e; i \u0026lt; \u003cspan class='cs-literal'\u003e999\u003c/span\u003e; i++)\n {\n sortrank++;\n \u003cspan class='cs-keyword'\u003eif\u003c/span\u003e (!output.Contains (sortrank))\n {\n output.Add (sortrank, infile);\n \u003cspan class='cs-keyword'\u003ebreak\u003c/span\u003e;\n }\n }\n } \u003cspan class='cs-keyword'\u003eelse\u003c/span\u003e {\n output.Add(sortrank, infile);\n }\n sortrank = \u003cspan class='cs-literal'\u003e0\u003c/span\u003e; \u003cspan class='cs-comment'\u003e// reset for next pass\u003c/span\u003e\n}\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eJim's code does some trickery with a new 'sortrank' variable to try and keep the files in 'Searcharoo rank' order, but with unique keys in the output \u003ccode\u003eSortedList\u003c/code\u003e. If \u003cem\u003ethousands\u003c/em\u003e of results were returned, you might run into trouble...\u003c/p\u003e\r\n\r\n\u003ch4\u003ePagedDataSource\u003c/h4\u003e\r\n\r\n\u003cp\u003eOnce the results are in the SortedList, assigned to a \u003ccode\u003ePagedDataSource \u003c/code\u003ewhich is then bound to a Repeater control on Searcharoo3.aspx. \u003c/p\u003e\r\n\r\n\u003cpre lang=\"cs\"\u003eSortedList output = \n \u003cspan class='cs-keyword'\u003enew\u003c/span\u003e SortedList (finalResultsArray.Count); \u003cspan class='cs-comment'\u003e// empty sorted list\u003c/span\u003e\n...\npg.DataSource = output.GetValueList();\npg.AllowPaging = \u003cspan class='cs-keyword'\u003etrue\u003c/span\u003e;\npg.PageSize = Preferences.ResultsPerPage; \u003cspan class='cs-comment'\u003e// defaults to 10 10;\u003c/span\u003e\npg.CurrentPageIndex = Request.QueryString[\u003cspan class='cpp-string'\u003e\"page\"\u003c/span\u003e]==\u003cspan class='cs-keyword'\u003enull\u003c/span\u003e?\u003cspan class='cs-literal'\u003e0\u003c/span\u003e:\n Convert.ToInt32(Request.QueryString[\u003cspan class='cpp-string'\u003e\"page\"\u003c/span\u003e])-\u003cspan class='cs-literal'\u003e1\u003c/span\u003e;\n\nSearchResults.DataSource = pg;\nSearchResults.DataBind();\n\u003c/pre\u003e\r\n\r\n\u003cp\u003emaking it a LOT easier to reformat the results list however you like!\u003c/p\u003e\r\n\r\n\u003cpre lang=\"html\"\u003e\u0026lt;asp:Repeater id=\"SearchResults\" runat=\"server\"\u0026gt;\r\n\u0026lt;HeaderTemplate\u0026gt;\r\n \u0026lt;p\u0026gt;\u0026lt;%=NumberOfMatches%\u0026gt; results for \u0026lt;%=Matches%\u0026gt; took \r\n \u0026lt;%=DisplayTime%\u0026gt;\u0026lt;/p\u0026gt;\r\n\u0026lt;/HeaderTemplate\u0026gt;\r\n\u0026lt;ItemTemplate\u0026gt;\r\n \u0026lt;a href=\"\u0026lt;%# DataBinder.Eval(Container.DataItem, \"Url\") %\u0026gt;\"\u0026gt;\u0026lt;b\u0026gt;\r\n \u0026lt;%# DataBinder.Eval(Container.DataItem, \"Title\") %\u0026gt;\u0026lt;/b\u0026gt;\u0026lt;/a\u0026gt;\r\n \u0026lt;a href=\"\u0026lt;%# DataBinder.Eval(Container.DataItem, \"Url\") %\u0026gt;\" \r\n target=\\\"_blank\\\" title=\"open in new window\" \r\n style=\"font-size:x-small\"\u0026gt;\u0026#8593;\u0026lt;/a\u0026gt;\r\n \u0026lt;font color=gray\u0026gt;(\u0026lt;%# DataBinder.Eval(Container.DataItem, \"Rank\") %\u0026gt;)\r\n \u0026lt;/font\u0026gt;\r\n \u0026lt;br\u0026gt;\u0026lt;%# DataBinder.Eval(Container.DataItem, \"Description\") %\u0026gt;...\r\n \u0026lt;br\u0026gt;\u0026lt;font color=green\u0026gt;\u0026lt;%# DataBinder.Eval(Container.DataItem, \"Url\") %\u0026gt;\r\n - \u0026lt;%# DataBinder.Eval(Container.DataItem, \"Size\") %\u0026gt;\r\n bytes\u0026lt;/font\u0026gt;\r\n \u0026lt;font color=gray\u0026gt;- \r\n \u0026lt;%# DataBinder.Eval(Container.DataItem, \"CrawledDate\") %\u0026gt;\u0026lt;/font\u0026gt;\u0026lt;p\u0026gt;\r\n\u0026lt;/ItemTemplate\u0026gt;\r\n\u0026lt;FooterTemplate\u0026gt;\r\n \u0026lt;p\u0026gt;\u0026lt;%=CreatePagerLinks(pg, Request.Url.ToString() )%\u0026gt;\u0026lt;/p\u0026gt;\r\n\u0026lt;/FooterTemplate\u0026gt;\r\n\u0026lt;/asp:Repeater\u0026gt;\r\n\u003c/pre\u003e\r\n\r\n\u003cp\u003eUnfortunately the page links are generated via embedded \u003ccode\u003eResponse.Write\u003c/code\u003e calls in \u003ccode\u003eCreatePagerLinks\u003c/code\u003e... maybe this will be templated in a future version...\u003c/p\u003e\r\n\r\n\u003ch2\u003eThe Future...\u003c/h2\u003e\r\n\r\n\u003cp\u003eIf you check the dates below, you'll notice there was almost one and a half years between version 2 and 3, so it might sound optimistic to discuss another 'future' version - but you never know...\u003c/p\u003e\r\n\r\n\u003cp\u003eUnfortunately many of the new features above are English-language specific (although they can be disabled to ensure Searcharoo can still be used on other language websites). However in a future version I'd like to try making the code can be a little more intelligent about handling European, Asian and other languages.\u003c/p\u003e\r\n\r\n\u003cp\u003eIt would also be nice if the user could type boolean OR searches, or group terms with quotes \" \" like Google, Yahoo, etc.\u003c/p\u003e\r\n\r\n\u003cp\u003eAnd finally, indexing of document types besides Html (mainly other web-types like PDF) would be useful for many sites.\u003c/p\u003e\r\n\u003ca name=\"aspnet2\"\u003e\u003c/a\u003e\r\n\u003ch2\u003eASP.NET 2.0\u003c/h2\u003e\r\n\r\n\u003cp\u003eSearcharoo3 runs on ASP.NET 2.0 pretty much unmodified - just remove \u003ccode lang=\"cs\"\u003esrc=\u003cspan class='cpp-string'\u003e\"Searcharoo.cs\"\u003c/span\u003e\u003c/code\u003e from the @Page attribute, and move the \u003cem\u003eSearcharoo.cs\u003c/em\u003e file into the \u003cem\u003eApp_Code\u003c/em\u003e directory.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003ca title=\"Click to enlarge\" href=\"Searcharoo_3/ClassDiagram.png\" target=\"_blank\"\u003e\u003cimg height=\"566\" alt=\"Visual Studio 2005 Class Diagram - click for larger view\" src=\"https://cloudfront.codeproject.com/ip/searcharoo_3/classdiagram_600x566.png\" width=\"600\" border=\"0\" /\u003e\u003c/a\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eVisual Studio.NET internal web server warning: the Searcharoo_VirtualRoot setting (where the spider starts looking for pages to index) defaults to \u003cem\u003ehttps:///.\u003c/em\u003e VS.NET's internal web server chooses a random port to run on, so if you're using it to test Searcharoo, you may need to set this web.config value accordingly.\u003c/p\u003e\r\n\r\n\u003ch2\u003eHistory\u003c/h2\u003e\r\n\r\n\u003cul\u003e\r\n\u003cli\u003e2004-06-30: \u003ca href=\"Searcharoo.asp\" target=\"_blank\"\u003eVersion 1\u003c/a\u003e on CodeProject \u003c/li\u003e\r\n\r\n\u003cli\u003e2004-07-03: \u003ca href=\"Spideroo.asp\" target=\"_blank\"\u003eVersion 2\u003c/a\u003e on CodeProject \u003c/li\u003e\r\n\r\n\u003cli\u003e2006-05-24: Version 3 (this page) on CodeProject \u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n"])</script><script>self.__next_f.push([1,"9:[\"$\",\"div\",null,{\"className\":\"custom-container flex px-2 gap-8 overflow-hidden\",\"children\":[[\"$\",\"div\",null,{\"className\":\"w-[948px] max-w-full\",\"children\":[[\"$\",\"div\",null,{\"className\":\"flex gap-2 mb-2 flex-wrap\",\"children\":[[\"$\",\"a\",\"Visual Studio .NET 2002\",{\"rel\":\"tag\",\"href\":\"/Tags/VS.NET2002\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Visual Studio .NET 2002\"}],[\"$\",\"a\",\".NET 1.0\",{\"rel\":\"tag\",\"href\":\"/Tags/.NET1.0\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\".NET 1.0\"}],[\"$\",\"a\",\"Visual Studio .NET 2003\",{\"rel\":\"tag\",\"href\":\"/Tags/VS.NET2003\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Visual Studio .NET 2003\"}],[\"$\",\"a\",\"Windows 2003\",{\"rel\":\"tag\",\"href\":\"/Tags/Win2003\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Windows 2003\"}],[\"$\",\"a\",\"WebForms\",{\"rel\":\"tag\",\"href\":\"/Tags/WebForms\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"WebForms\"}],[\"$\",\"a\",\".NET 1.1\",{\"rel\":\"tag\",\"href\":\"/Tags/.NET1.1\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\".NET 1.1\"}],[\"$\",\"a\",\"Windows 2000\",{\"rel\":\"tag\",\"href\":\"/Tags/Win2K\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Windows 2000\"}],[\"$\",\"a\",\"Windows XP\",{\"rel\":\"tag\",\"href\":\"/Tags/WinXP\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Windows XP\"}],[\"$\",\"a\",\"Intermediate\",{\"rel\":\"tag\",\"href\":\"/Tags/Intermediate\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Intermediate\"}],[\"$\",\"a\",\"Dev\",{\"rel\":\"tag\",\"href\":\"/Tags/Dev\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Dev\"}],[\"$\",\"a\",\"Visual Studio\",{\"rel\":\"tag\",\"href\":\"/Tags/Visual-Studio\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Visual Studio\"}],[\"$\",\"a\",\"Windows\",{\"rel\":\"tag\",\"href\":\"/Tags/Windows\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"Windows\"}],[\"$\",\"a\",\".NET\",{\"rel\":\"tag\",\"href\":\"/Tags/.NET\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\".NET\"}],[\"$\",\"a\",\"ASP.NET\",{\"rel\":\"tag\",\"href\":\"/Tags/ASP.NET\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"ASP.NET\"}],[\"$\",\"a\",\"C#\",{\"rel\":\"tag\",\"href\":\"/Tags/C#\",\"className\":\"px-2 border border-border-yellow rounded-l-full text-gray-light visited:text-gray-light hover:underline hover:bg-yellow-light\",\"children\":\"C#\"}]]}],[\"$\",\"h1\",null,{\"className\":\"text-4xl text-text-gray\",\"children\":\"Adding features to a C# search engine/web spider\"}],[\"$\",\"div\",null,{\"className\":\"max-w-[600px]\",\"children\":[[\"$\",\"div\",null,{\"className\":\"flex pt-4 justify-between\",\"children\":[[\"$\",\"div\",null,{\"className\":\"text-base hover:underline font-bold\",\"children\":[\"$\",\"$L3\",null,{\"href\":\"/script/Membership/View.aspx?mid=38339\",\"className\":\"text-gray-lightest\",\"children\":\"craigd\"}]}],[\"$\",\"div\",null,{\"className\":\"flex items-center gap-2\",\"children\":[[\"$\",\"div\",null,{\"className\":\"flex\",\"children\":[[\"$\",\"$L4\",\"0\",{\"src\":\"/icons/starIcon.png\",\"alt\":\"starIcon\",\"width\":24,\"height\":24}],[\"$\",\"$L4\",\"1\",{\"src\":\"/icons/starIcon.png\",\"alt\":\"starIcon\",\"width\":24,\"height\":24}],[\"$\",\"$L4\",\"2\",{\"src\":\"/icons/starIcon.png\",\"alt\":\"starIcon\",\"width\":24,\"height\":24}],[\"$\",\"$L4\",\"3\",{\"src\":\"/icons/starIcon.png\",\"alt\":\"starIcon\",\"width\":24,\"height\":24}],[\"$\",\"div\",\"4\",{\"className\":\"relative\",\"children\":[[\"$\",\"$L4\",null,{\"src\":\"/icons/emptyStarIcon.png\",\"alt\":\"emptyStarIcon\",\"width\":24,\"height\":24}],[\"$\",\"div\",null,{\"className\":\"absolute left-0 top-0 overflow-hidden\",\"style\":{\"maxWidth\":\"74.00000000000003%\"},\"children\":[\"$\",\"$L4\",null,{\"src\":\"/icons/starIcon.png\",\"alt\":\"starIcon\",\"width\":24,\"height\":24}]}]]}]]}],[\"$\",\"p\",null,{\"className\":\"text-sm\",\"children\":[\"4.74\",\"/5 (\",22,\" vote\",\"s\",\")\"]}]]}]]}],[\"$\",\"div\",null,{\"className\":\"flex pt-2 gap-4\",\"children\":[[\"$\",\"p\",null,{\"className\":\"text-gray-lightest text-[13px]\",\"children\":\"May 24, 2006\"}],[\"$\",\"$L3\",null,{\"href\":\"https://codeproject.org.cn/info/cpol10.aspx\",\"className\":\"text-[13px] flex\",\"children\":\"CPOL\"}],[\"$\",\"p\",null,{\"className\":\"text-gray-lightest text-[13px]\",\"children\":[17,\" min read\"]}],[\"$\",\"div\",null,{\"className\":\"flex gap-1 items-center\",\"children\":[[\"$\",\"$L4\",null,{\"src\":\"/icons/viewsIcon.png\",\"alt\":\"viewsIcon\",\"width\":16,\"height\":16,\"className\":\"w-4 h-4\"}],[\"$\",\"p\",null,{\"className\":\"text-gray-lightest text-[13px] pl-0.5\",\"children\":269519}]]}],[\"$\",\"div\",null,{\"className\":\"flex gap-1 items-center\",\"children\":[[\"$\",\"$L4\",null,{\"src\":\"/icons/downloadIcon.png\",\"alt\":\"downloadIcon\",\"width\":16,\"height\":16,\"className\":\"w-4 h-4\"}],[\"$\",\"p\",null,{\"className\":\"text-gray-lightest text-[13px] pl-0.5\",\"children\":3595}]]}]]}]]}],[\"$\",\"p\",null,{\"className\":\"pt-6 pb-4 text-gray-lightest\",\"children\":\"Adding advanced search-engine features (and persistent catalog) to Searcharoo project\"}],[\"$\",\"$L17\",null,{\"css\":\"$18\",\"children\":[\"$\",\"$L19\",null,{\"children\":[\"$\",\"div\",null,{\"className\":\"article\",\"dangerouslySetInnerHTML\":{\"__html\":\"$1a\"}}]}]}]]}],[\"$\",\"div\",null,{\"children\":[\"$\",\"$L5\",null,{\"adUnit\":\"/67541884/CPLargeSky160600\",\"adSlotId\":\"div-gpt-ad-1738591766860-0\",\"adSize\":[300,600],\"className\":\"mt-40 sticky top-40\"}]}]]}]\n"])</script><div style="padding: 1em 1em 5em; text-align: center;"> © <script>document.write(new Date().getFullYear())</script> <i><script>document.write(location.host)</script></i>. All rights reserved. </div></body></html>