Lucene 网站爬虫和索引器

stlane

4.62/5 (9投票s)

2009年1月31日

CPOL

4分钟阅读

95002

6093

Java Lucene 网站爬虫和索引器

下载源代码 - 5.35 MB

引言

该项目利用 Java Lucene 索引库，构建了一个紧凑而强大的 Web 爬虫和索引解决方案。许多强大的开源互联网和企业搜索解决方案都使用 Lucene，例如 Solr 和 Nutch。这些项目虽然很出色，但对于更简单的项目来说可能过于庞大。

背景

一篇 CodeProject 文章启发我创建了这个演示，那就是 .NET 由 craigd 创建的 searcharoo 搜索引擎。他创建了一个 Web 搜索引擎，旨在通过递归爬取目标网站首页的链接来搜索整个网站。这个 JSearchEngine Lucene 项目与 searcharoo 不同，因为它使用的是 Lucene 索引器而不是 searcharoo 中使用的自定义索引器。两个项目之间的另一个区别是，searcharoo 有一个函数使用 Windows 的文档 iFilters 来解析非 HTML 页面。如果兴趣足够，我可能会将该项目扩展到使用 Nutch Web 爬虫的文档过滤器来索引 PDF 和 Microsoft Office 类型的文件。

Using the Code

该解决方案由两个项目组成，一个名为 JSearchEngine ，另一个名为 JSP，这两个项目都使用 netbeans IDE 版本 6.5 创建。

索引器/爬虫

JSearchEngine 项目是操作的核心。在 main 方法中，要爬取和索引的网站的首页是硬编码的。由于这是一个命令行应用程序，可以轻松修改代码，使其将首页作为命令行参数。爬虫的主要控制函数如下，其工作原理如下：

indexDocs 函数以第一个页面作为参数被调用。
第一个页面的 URL 用于构建 Lucene Document 对象。文档对象由字段和值对组成，例如 <title> 标签作为字段，实际的内容作为值。这些都由文档对象构造函数处理。
文档构建完成后，Lucene 将其添加到索引中。Lucene 的工作原理超出了本文的范围，相关内容请参阅此处。
文档被索引后，文档中的链接会被解析为 string 数组，然后 indexDocs 函数会递归索引这些 string 。htmlparser.sourceforge.net 上的 HTMLParser 被使用。
只会跟踪原始页面中的 URL 名称，这将阻止爬虫跟踪外部链接并尝试爬取互联网！
索引器排除 zip 文件，因为它无法索引它们。

  private static void indexDocs(String url) throws Exception {

        //index page
        Document doc = HTMLDocument.Document(url);
        System.out.println("adding " + doc.get("path"));
        try {
            indexed.add(doc.get("path"));
            writer.addDocument(doc);          // add docs unconditionally
            //TODO: only add HTML docs
            //and create other doc types

            //get all links on the page then index them
            LinkParser lp = new LinkParser(url);
            URL[] links = lp.ExtractLinks();

            for (URL l : links) {
                //make sure the URL hasn't already been indexed
                //make sure the URL contains the home domain
                //ignore URLs with a querystrings by excluding "?"
                if ((!indexed.contains(l.toURI().toString())) &&
                    (l.toURI().toString().contains(beginDomain)) &&
                    (!l.toURI().toString().contains("?"))) {
                    //don't index zip files
                    if (!l.toURI().toString().endsWith(".zip")) {
                        System.out.print(l.toURI().toString());
                        indexDocs(l.toURI().toString());
                    }
                }
            }

        } catch (Exception e) {
            System.out.println(e.toString());
        }
    }

JSP 搜索客户端

一旦目标网站被完全索引，就可以查询该索引；在查询之前还可以向索引添加更多网站。由于该索引是基于 Lucene 的，因此可以使用任何兼容的 Lucene 库进行查询，例如 Java 或 .NET 实现。在本演示中，使用了 Java 实现。JSP 项目是一组 Java Server Pages，用于搜索和显示搜索结果。要运行此 Web 应用程序，有必要将编译好的 .war 文件部署到 J2EE 兼容的服务器上，如 Glassfish 或 Tomcat。以下标记是 Web 应用程序的入口点，它接受搜索词并将其传递给 results.jsp 页面，该页面将查询索引并显示结果。

 <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Search Page</title>
    </head>
    <body>
        <form name="search" action="results.jsp" method="get">
        <p>
            <input name="query" size="44"/> Search Criteria
        </p>
        <p>
            <input name="maxresults" size="4" value="100"/> Results Per Page
            <input type="submit" value="Search"/>
        </p>
        </form>
    </body>

以下是 results 页面中的主要 Java 代码。变量使用从搜索页面传递的参数进行初始化，以便构建 Lucene 索引搜索器：

String indexName = "/opt/lucene/index";
IndexSearcher searcher = null;
Query query = null;
Hits hits = null;
int startindex = 0;
int maxpage = 50;
String queryString = null;
String startVal = null;
String maxresults = null;
int thispage = 0;
searcher = new IndexSearcher(indexName);
queryString = request.getParameter("query");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("contents",
analyzer);
query = qp.parse(queryString);

hits = searcher.search(query);

一旦使用搜索结果实例化了 hits 对象，就可以循环遍历 hits 并使用页面上的 HTML 显示它们。

    for (int i = startindex; i < (thispage + startindex); i++) {  // for each element
%>
    <tr>
        <%
        Document doc = hits.doc(i);          //get the next document
        String doctitle = doc.get("title");  //get its title
        String url = doc.get("path");        //get its path field

值得关注的点

由于有两个独立的项目，它们可以与其他兼容 Lucene 的编程环境混合搭配，例如，JSP 项目可以轻松修改以查询由 Lucene.Net 创建的索引。

历史

2009年1月31日：初始发布

Lucene 网站爬虫和索引器

引言

背景

Using the Code

索引器/爬虫

JSP 搜索客户端

值得关注的点

更多信息

历史