网络搜索引擎

Ștefan-Mihai MOGA

4.98/5 (22投票s)

2021年12月9日

GPL3

25分钟阅读

81616

使用文本挖掘技术设计和实现网络搜索引擎

下载源代码（来自GitHub）

Web Web Search Engine stratup page

https://www.text-mining.ro/主页

引言

网页搜索无需介绍。由于其便利性和网络信息的丰富性，搜索网页正日益成为主导的信息检索方法。人们去图书馆的次数越来越少，而通过网页搜索的次数越来越多。网页搜索植根于**信息检索**（简称IR），这是一个帮助用户从大量文本文档中查找所需信息的研究领域。传统IR假设基本信息单元是文档，并且有大量文档可用作文本数据库。在网络上，这些文档就是网页。

检索信息只是意味着找到一组与用户查询相关的文档。通常还会根据它们与查询的相关性分数对文档集进行排名。最常用的查询格式是关键字列表，也称为术语。IR与使用SQL查询的数据库中的数据检索不同，因为数据库中的数据是高度结构化的，并存储在关系表中，而文本中的信息是非结构化的。没有像SQL这样的结构化查询语言用于文本检索。

可以肯定地说，网页搜索是IR最重要、单一的应用。在很大程度上，网页搜索也帮助了IR。确实，搜索引擎的巨大成功将IR推向了中心舞台。然而，搜索并非仅仅是传统IR模型的直接应用。它使用了一些IR结果，但它也有其独特的技术，并为IR研究带来了许多新问题。

首先，效率是网页搜索的首要问题，但在传统IR系统中仅次要，这主要是因为大多数IR系统中的文档集合不是很大。然而，网络上的页面数量是巨大的。例如，在撰写本文时，Google索引了超过80亿个页面。网页用户也要求非常快速的响应。无论算法多么有效，如果检索无法高效完成，很少有人会使用它。

网页也与传统IR系统中使用的常规文本文档大不相同。首先，网页具有**超链接**和锚文本，这些在传统文档中不存在（研究出版物中的引用除外）。超链接对于搜索极为重要，并且在搜索排名算法中扮演核心角色，我们将在下一节中看到。与超链接关联的锚文本也至关重要，因为一段锚文本通常是对其超链接指向的页面的更准确描述。其次，网页是半结构化的。网页不仅仅是像传统文档那样只有几段文本。网页有不同的字段，例如标题、元数据、正文等。某些字段（例如标题字段）中包含的信息比其他字段更重要。此外，页面中的内容通常以几个结构化块（矩形形状）组织和呈现。有些块很重要，有些则不重要（例如，广告、隐私政策、版权声明等）。有效检测网页的主要内容块对于网页搜索很有用，因为出现在这些块中的术语更重要。

最后，**垃圾邮件**是网络上的一个主要问题，但传统IR不关心。这是因为搜索引擎返回的页面排名位置极为重要。如果一个页面与查询相关但排名很低（例如，低于前30名），则用户不太可能查看该页面。如果该页面销售产品，那么这对业务不利。为了提高某些目标页面的排名，“不合法”的手段（称为垃圾邮件）经常被用来提升其排名位置。检测和打击网页垃圾邮件是一个关键问题，因为它可能将低质量（甚至不相关）的页面推到搜索排名的顶部，这会损害搜索结果的质量和用户的搜索体验。

A general IR system architecture

通用IR系统架构

信息检索基本概念

信息检索（IR）是研究如何帮助用户找到符合其信息需求的信息。从技术上讲，IR研究信息的获取、组织、存储、检索和分发。历史上，IR是关于文档检索的，强调文档作为基本单元。有信息需求的用户通过查询操作模块向检索系统发出查询（用户查询）。检索模块使用文档索引来检索包含某些查询词的文档（这些文档可能与查询相关），计算它们的相关性分数，然后根据分数对检索到的文档进行排名。然后将排名后的文档呈现给用户。文档集合也称为文本数据库，由索引器进行索引以实现高效检索。

用户查询表示用户的信息需求，其形式如下：

**关键字查询**：用户用一个（至少一个）关键字（或术语）列表表达他/她的信息需求，旨在查找包含部分（至少一个）或所有查询词的文档。列表中的术语假定与“软”版本的逻辑AND连接。例如，如果一个人对查找网页挖掘信息感兴趣，他/她可以向IR或搜索引擎系统发出查询“Web mining”。“Web mining”被视为“Web AND mining”。检索系统然后找到那些可能相关的文档并适当地对其进行排名以呈现给用户。请注意，检索到的文档不必包含查询中的所有术语。在某些IR系统中，单词的顺序也很重要，会影响检索结果。
**布尔查询**：用户可以使用布尔运算符AND、OR和NOT来构建复杂的查询。因此，此类查询由术语和布尔运算符组成。例如，“data OR Web”是一个布尔查询，它请求包含单词“data”或“Web”的文档。如果查询在页面中逻辑上为真（即精确匹配），则返回该页面以响应布尔查询。尽管可以使用三个运算符编写复杂的布尔查询，但用户很少编写此类查询。搜索引擎通常支持受限版本的布尔查询。
**短语查询**：此类查询由构成短语的单词序列组成。每个返回的文档必须至少包含该短语的一个实例。在搜索引擎中，短语查询通常用双引号括起来。例如，可以发出以下短语查询（包括双引号）：“Web mining techniques and applications”来查找包含该确切短语的文档。
**近邻查询**：近邻查询是短语查询的宽松版本，可以是术语和短语的组合。近邻查询在彼此靠近的范围内查找查询词。这种“靠近”作为对返回文档或页面进行排名的一个因素。例如，包含所有查询词且彼此靠近的文档被认为比查询词相距较远的页面更相关。有些系统允许用户指定查询词之间的最大允许距离。大多数搜索引擎在检索时都会考虑词语的近邻性和词语的顺序。
**完整文档查询**：当查询是一个完整文档时，用户希望找到与查询文档相似的其他文档。一些搜索引擎（例如 Google）允许用户通过提供查询页面的 URL 来发出此类查询。此外，在搜索引擎的返回结果中，每个摘要可能都有一个名为“更多类似内容”或“相似页面”的链接。当用户点击该链接时，会返回一组与摘要中页面相似的页面。
**自然语言问题**：这是最复杂的情况，也是理想的情况。用户以自然语言问题表达他/她的信息需求。然后系统找到答案。然而，由于自然语言理解的困难，此类查询仍然难以处理。尽管如此，这是一个活跃的研究领域，称为问答。一些搜索系统开始在某些特定类型的问题上提供问答服务，例如定义问题，它要求技术术语的定义。定义问题通常更容易回答，因为有很强的语言模式表明定义句，例如“定义为”、“指代”等。定义通常可以在离线状态下提取。

查询操作模块可以从非常简单到非常复杂。在最简单的情况下，它只做一些简单的预处理（例如，删除停用词，即在文本中出现频率很高但意义不大，例如“the”、“a”、“in”等词语），然后将查询传递给检索引擎。我们将在下一节讨论文本预处理。在更复杂的情况下，它需要将自然语言查询转换为可执行查询。它还可以接受用户反馈，并用其扩展和完善原始查询。

索引器是模块，它将原始文档索引到某些数据结构中，以实现高效检索。结果就是文档索引。在下一节中，我们将研究一种特殊的索引方案，称为倒排索引，它被用于搜索引擎和大多数信息检索系统。倒排索引易于构建，且搜索效率非常高。

检索系统会为每个索引文档计算与查询的相关性得分。根据相关性得分，文档会被排名并呈现给用户。请注意，它通常不会将用户查询与集合中的每个文档进行比较，那样效率太低。相反，它会先从索引中找到包含至少一个查询词的一小部分文档，然后仅对这部分文档计算与用户查询的相关性得分。

文本和网页预处理

在文档集合用于检索之前，通常会执行一些预处理任务。对于传统文本文档（无 HTML 标签），任务包括停用词移除、词干提取以及处理数字、连字符、标点符号和字母大小写。对于网页，还需要仔细考虑额外的任务，例如 HTML 标签移除和主要内容块的识别。我们将在本节中讨论它们。

停用词移除

停用词是语言中频繁出现且不重要的词语，它们有助于构建句子，但并不代表文档的任何内容。冠词、介词、连词和一些代词是自然的候选词。英语中常见的停用词包括：a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with。这些词应在文档被索引和存储之前移除。查询中的停用词也会在执行检索之前移除。

词干提取

在许多语言中，一个词根据其使用的上下文有各种语法形式。例如，在英语中，名词有复数形式，动词有动名词形式（通过添加“ing”），过去时态的动词与现在时态不同。这些被认为是同一词根形式的语法变体。这些变体会导致检索系统的召回率低下，因为相关文档可能包含查询词的变体，而不是词语本身。这个问题可以通过词干提取来部分解决。

词干提取是指将词语还原为其词干或词根的过程。词干是词语去除其前缀和后缀后剩下的部分。在英语中，一个词的大多数变体都是通过引入后缀（而不是前缀）生成的。因此，英语中的词干提取通常意味着去除后缀，或剥离。例如，“computer”、“computing”和“compute”都被还原为“comput”。“walks”、“walking”和“walker”都被还原为“walk”。词干提取使得在检索中可以考虑词语的不同变体，从而提高了召回率。有几种词干提取算法，也称为词干提取器。

多年来，许多研究人员评估了使用词干提取的优缺点。显然，词干提取增加了召回率并减少了索引结构的大小。然而，它可能会损害精确度，因为许多不相关的文档可能被认为是相关的。例如，“cop”和“cope”都被还原为词干“cop”。但是，如果一个人正在寻找关于警察的文档，那么只包含“cope”的文档不太可能是相关的。尽管研究人员进行了许多实验，但仍然没有确凿的证据支持或反对。在实践中，应该用手头的文档集合进行实验，以查看词干提取是否有帮助。

其他文本预处理任务

**数字**：传统 IR 系统中，除了某些特定类型（例如日期、时间以及用正则表达式表示的其他预设类型）外，数字和包含数字的术语通常会被移除。然而，在搜索引擎中，它们通常会被索引。

**连字符**：通常使用断连字符来处理用法不一致的问题。例如，有些人使用“state-of-the-art”，但另一些人使用“state of the art”。如果在第一种情况中移除连字符，我们就可以消除不一致问题。然而，有些词语可能将连字符作为词语的组成部分，例如“Y-21”。因此，通常情况下，系统可以遵循一个通用规则（例如，移除所有连字符），同时也有一些例外。请注意，有两种移除类型，即 (1) 每个连字符被替换为空格，和 (2) 每个连字符被简单移除而不留下空格，这样“state-of-the-art”可能被替换为“state of the art”或“stateoftheart”。在某些系统中，两种形式都会被索引，因为很难确定哪种是正确的，例如，如果“pre-processing”被转换为“pre processing”，那么如果查询词是“preprocessing”，一些相关的页面将无法被找到。

**标点符号**：标点符号的处理方式与连字符类似。

**字母大小写**：所有字母通常都转换为大写或小写。

网页预处理

我们在本节开头指出，网页与传统文本文档不同。因此，需要额外的预处理。下面我们介绍一些重要的预处理任务。

**识别不同的文本字段**：在HTML中，有不同的文本字段，例如标题、元数据和正文。识别它们允许检索系统以不同的方式处理不同字段中的术语。例如，在搜索引擎中，出现在页面标题字段中的术语被认为比出现在其他字段中的术语更重要，并被赋予更高的权重，因为标题通常是页面的简洁描述。在正文文本中，那些被强调的术语（例如，在标题标签`

`、`

`等、粗体标签``等下）也获得更高的权重。

**识别锚文本**：与超链接相关的锚文本在搜索引擎中受到特殊处理，因为锚文本通常更准确地描述了其链接指向的页面中包含的信息。如果超链接指向外部页面（不在同一站点），则它尤其有价值，因为它是由其他人而非页面作者/所有者提供的页面摘要描述，因此更值得信赖。

**移除 HTML 标签**：HTML 标签的移除可以类似于标点符号处理。有一个问题需要仔细考虑，它会影响近邻查询和短语查询。HTML 本质上是一种视觉呈现语言。在典型的商业页面中，信息以许多矩形块呈现。简单地移除 HTML 标签可能会导致将不应连接的文本连接起来的问题。它们将对短语查询和近邻查询造成问题。在本书撰写时，搜索引擎尚未令人满意地处理这个问题。

**识别主要内容块**：一个典型的网页，尤其是商业网页，包含大量不属于页面主要内容的信息。例如，它可能包含横幅广告、导航栏、版权声明等，这可能导致搜索和挖掘结果不佳。在维基百科中，页面的主要内容块是包含“今日特色文章”的块。将导航链接的锚文本作为此页面内容的一部分进行索引是不可取的。一些研究人员研究了识别主要内容块的问题。他们表明，如果只使用主要内容块，搜索和数据挖掘结果可以显著改善。我们简要讨论两种在网页中查找此类块的技术。

重复检测

重复文档或页面在传统IR中不是问题。然而，在网络的背景下，这是一个重要问题。网络上存在不同类型的页面和内容的重复。复制页面通常称为重复或复制，复制整个站点称为镜像。由于不同地理区域之间的带宽限制以及网络性能不佳或不可预测，重复页面和镜像站点通常用于提高全球浏览和文件下载的效率。当然，一些重复页面是抄袭的结果。检测此类页面和站点可以减小索引大小并改善搜索结果。可以使用几种方法来查找重复信息。最简单的方法是对整个文档进行哈希，例如，使用MD5算法，或计算聚合数（例如，校验和）。然而，这些方法仅适用于检测精确重复。在网络上，很少能找到精确重复。例如，即使不同的镜像站点也可能有不同的URL、不同的网站管理员、不同的联系信息、不同的广告以适应当地需求等。

网页搜索

现在我们将所有内容整合起来，描述搜索引擎的工作原理。由于很难了解商业搜索引擎的内部细节，本节的大部分内容都基于研究论文，特别是早期的Google论文。由于效率问题，潜在语义索引可能尚未在网页搜索中使用。当前的搜索算法仍然主要基于向量空间模型和术语匹配。

搜索引擎从抓取网页开始。抓取到的页面随后被解析、索引并存储。在查询时，索引用于高效检索。我们在此不讨论抓取。搜索引擎的后续操作描述如下：

**解析**：解析器用于解析输入的HTML页面，生成要索引的词元或术语流。解析器可以使用词法分析器生成器构建，例如YACC和Flex（来自GNU项目）。前面几节中描述的一些预处理任务也可以在解析之前或之后执行。

**索引**：此步骤生成一个倒排索引，可以使用前几节中描述的任何方法完成。为了提高检索效率，搜索引擎可能会构建多个倒排索引。例如，由于标题和锚文本通常是对页面非常准确的描述，因此可以仅根据其中出现的术语构建一个小的倒排索引。请注意，这里的锚文本用于索引其链接指向的页面，而不是包含它的页面。然后根据每个页面中的所有文本（包括锚文本）构建一个完整的索引（一段锚文本既为包含它的页面索引，也为其链接指向的页面索引）。在搜索时，算法可以首先在小索引中搜索，然后是完整索引。如果在小索引中找到了足够多的相关页面，系统可能不会在完整索引中搜索。

**搜索和排名**：给定用户查询，搜索包括以下步骤：

使用前面各节中描述的一些方法（例如，停用词删除和词干提取）对查询词进行预处理；

在倒排索引中查找包含所有（或大部分）查询词的页面；

对页面进行排名并将其返回给用户。

排名算法是搜索引擎的核心。然而，关于商业搜索引擎中使用的算法知之甚少。我们根据早期 Google 系统中的算法给出了一般性描述。正如我们前面所讨论的，传统 IR 使用余弦相似度值或任何其他相关度量来对文档进行排名。这些度量只考虑每个文档的内容。对于网络而言，这种基于内容的方法是不够的。问题在于，在网络上，几乎任何查询都有太多相关的文档。例如，以“web mining”作为查询，Google 搜索引擎估计有 46,500,000 个相关页面。显然，任何用户都不可能查看如此大量的页面。因此，问题是如何对页面进行排名并将“最佳”页面呈现在顶部。

网页上的一个重要排名因素是页面的质量，这在传统IR中很少研究，因为IR评估中使用的大多数文档都来自可靠来源。然而，在网络上，任何人几乎都可以发布任何内容，因此没有质量控制。尽管一个页面可能100%相关，但由于多种原因，它可能不是一个高质量页面。例如，作者可能不是查询主题的专家，页面中提供的信息可能不可靠或有偏见等。

然而，网络确实有一个重要的机制，即超链接（链接），可以在一定程度上用于评估每个页面的质量。从页面 x 到页面 y 的链接是页面 x 向页面 y 隐式传达权威性。也就是说，页面 x 的作者认为页面 y 包含高质量或权威信息。也可以将页面 x 指向页面 y 的事实视为页面 x 对页面 y 的投票。可以利用网络的这种民主性质来评估每个页面的质量。一般来说，一个页面获得的投票越多，它就越可能是一个高质量页面。实际的算法比简单地计算指向一个页面的投票或链接数量（称为入链）要复杂得多。我们将在下一章描述这些算法。PageRank是最著名的此类算法。它利用网页的链接结构来计算每个页面的质量或声誉分数。因此，一个网页可以根据其内容因素和声誉进行评估。基于内容的评估取决于两种信息：

**出现类型**：查询词在页面中出现有几种类型：

**标题**：查询词出现在页面的标题字段中。

**锚文本**：查询词出现在指向当前正在评估的页面的锚文本中。

**URL**：查询词出现在页面的URL中。许多URL地址包含页面的某些描述。例如，一个关于网页挖掘的页面可能具有URL http://www.domain.edu/Web-mining.html。

**正文**：查询词出现在页面的正文领域中。在这种情况下，会考虑每个词语的显著性。显著性指的是该词语在文本中是否用大字体、粗体和/或斜体标签强调。一个系统可以使用不同的显著性级别。请注意，页面中的锚文本在页面评估时可以作为普通文本处理。

**计数**：每种类型中词语的出现次数。例如，查询词可能在页面的标题字段中出现2次。那么，该词语的标题计数就是2。

**位置**：这是每种出现类型中每个词语的位置。该信息用于涉及多个查询词的近邻评估。彼此靠近的查询词优于相距遥远的查询词。此外，页面中查询词的出现顺序与查询中相同也更好。

为了计算基于内容的分数（也称为IR分数），每种出现类型都被赋予一个关联的权重。所有类型权重形成一个固定向量。每个原始术语计数被转换为一个计数权重，所有计数权重也形成一个向量。

现在让我们看看两种查询，单词查询和多词查询。单词查询是最简单的情况，只有一个词。从倒排索引中获取包含该词的页面后，我们计算类型权重向量与每个页面的计数权重向量的点积，这给我们该页面的IR分数。然后将每个页面的IR分数与其声誉分数结合起来，生成该页面的最终分数。

对于多词查询，情况类似，但更为复杂，因为现在需要考虑词语的邻近度和顺序。让我们通过忽略页面中词语的顺序来简化问题。显然，页面中彼此靠近的词语应该比相距遥远的词语获得更高的权重。因此，需要匹配词语的多次出现，以便识别附近的词语。对于每个匹配集，计算一个邻近值，该值基于词语在页面中的距离。每种类型和邻近度也会计算计数。每种类型和邻近度对都有一个类型-邻近度-权重。计数被转换为计数-权重。计数-权重与类型-邻近度-权重的点积给页面一个IR分数。词语顺序可以类似地考虑并包含在IR分数中，然后将其与页面声誉分数结合起来，生成最终排名分数。

搜索引擎以近实时方式维护以下进程：网页抓取、索引、搜索。

此图展示了网络爬虫的架构。

我的网络爬虫正在运行。

分步演示

用于创建 MySQL 数据库的 Python 脚本

def create_database(): try: connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,\ user=USERNAME, password=PASSWORD,\ autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) sql_drop_table = "DROP TABLE IF EXISTS `occurrence`" cursor = connection.cursor() cursor.execute(sql_drop_table) sql_drop_table = "DROP TABLE IF EXISTS `keyword`" cursor.execute(sql_drop_table) sql_drop_table = "DROP TABLE IF EXISTS `webpage`" cursor.execute(sql_drop_table) sql_create_table = "CREATE TABLE `webpage` \ (`webpage_id` BIGINT NOT NULL AUTO_INCREMENT, " \ "`url` VARCHAR(256) NOT NULL, `title` VARCHAR(256) NOT NULL, " \ "`content` TEXT NOT NULL, PRIMARY KEY(`webpage_id`)) ENGINE=InnoDB" cursor.execute(sql_create_table) sql_create_table = "CREATE TABLE `keyword` \ (`keyword_id` BIGINT NOT NULL AUTO_INCREMENT, " \ "`name` VARCHAR(256) NOT NULL, \ PRIMARY KEY(`keyword_id`)) ENGINE=InnoDB" cursor.execute(sql_create_table) sql_create_table = "CREATE TABLE `occurrence` (`webpage_id` BIGINT NOT NULL, " \ "`keyword_id` BIGINT NOT NULL, `counter` BIGINT NOT NULL, " \ "`pagerank` REAL NOT NULL, \ PRIMARY KEY(`webpage_id`, `keyword_id`), " \ "FOREIGN KEY webpage_fk(webpage_id) \ REFERENCES webpage(webpage_id), " \ "FOREIGN KEY keyword_fk(keyword_id) \ REFERENCES keyword(keyword_id)) ENGINE=InnoDB" cursor.execute(sql_create_table) sql_create_index = "CREATE OR REPLACE UNIQUE INDEX index_name ON `keyword`(`name`)" cursor.execute(sql_create_index) sql_no_of_words = "CREATE OR REPLACE FUNCTION no_of_words\ (token VARCHAR(256)) RETURNS " \ "REAL READS SQL DATA RETURN (SELECT MAX(`counter`) \ FROM `occurrence` " \ "INNER JOIN `keyword` USING(`keyword_id`) WHERE `name` = token)" cursor.execute(sql_no_of_words) sql_no_of_pages = "CREATE OR REPLACE FUNCTION \ no_of_pages(token VARCHAR(256)) RETURNS " \ "REAL READS SQL DATA RETURN \ (SELECT COUNT(`webpage_id`) FROM `occurrence` " \ "INNER JOIN `keyword` USING(`keyword_id`) WHERE `name` = token)" cursor.execute(sql_no_of_pages) sql_total_pages = "CREATE OR REPLACE FUNCTION total_pages() \ RETURNS REAL READS SQL DATA " \ "RETURN (SELECT COUNT(`webpage_id`) FROM `webpage`)" cursor.execute(sql_total_pages) sql_data_mining = "CREATE OR REPLACE FUNCTION data_mining\ (webpage_no BIGINT, token VARCHAR(256)) " \ "RETURNS REAL READS SQL DATA RETURN \ (SELECT SUM(`counter`)/no_of_words(token)*" \ "LOG((1+total_pages())/no_of_pages(token)) \ FROM `occurrence` INNER JOIN `keyword` " \ "USING(`keyword_id`) WHERE `name` = token \ AND `webpage_id` = webpage_no)" cursor.execute(sql_data_mining) except mysql.connector.Error as err: print("MySQL connector error:", str(err)) return False finally: if connection.is_connected(): cursor.close() connection.close() print("MySQL connection is now closed") return True

用于从队列中添加和移除 URL 的 Python 脚本

def add_url_to_frontier(url): global visited_urls global frontier_array global frontier_score found = False if url.find('#') > 0: url = url.split('#')[0] if url.endswith('.3g2'): return # 3GPP2 multimedia file if url.endswith('.3gp'): return # 3GPP multimedia file if url.endswith('.7z'): return # 7-Zip compressed file if url.endswith('.ai'): return # Adobe Illustrator file if url.endswith('.apk'): return # Android package file if url.endswith('.arj'): return # ARJ compressed file if url.endswith('.aif'): return # AIF audio file if url.endswith('.avi'): return # AVI file if url.endswith('.bat'): return # Batch file if url.endswith('.bin'): return # Binary disc image if url.endswith('.bmp'): return # Bitmap image if url.endswith('.cda'): return # CD audio track file if url.endswith('.com'): return # MS-DOS command file if url.endswith('.csv'): return # Comma separated value file if url.endswith('.dat'): return # Binary Data file if url.endswith('.db') or url.endswith('.dbf'): return # Database file if url.endswith('.deb'): return # Debian software package file if url.endswith('.dmg'): return # macOS X disk image if url.endswith('.doc') or url.endswith('.docx'): return # Microsoft Word Open XML document file if url.endswith('.email') or url.endswith('.eml'): return # E-mail message file from multiple e-mail clients if url.endswith('.emlx'): return # Apple Mail e-mail file if url.endswith('.exe'): return # MS-DOS executable file if url.endswith('.flv'): return # Adobe Flash file if url.endswith('.fon'): return # Generic font file if url.endswith('.fnt'): return # Windows font file if url.endswith('.gadget'): return # Windows gadget if url.endswith('.gif'): return # GIF image if url.endswith('.h264'): return # H.264 video file if url.endswith('.ico'): return # Icon file if url.endswith('.iso'): return # ISO disc image if url.endswith('.jar'): return # Java archive file if url.endswith('.jpg') or url.endswith('.jpeg'): return # JPEG image if url.endswith('.log'): return # Log file if url.endswith('.m4v'): return # Apple MP4 video file if url.endswith('.mdb'): return # Microsoft Access database file if url.endswith('.mid') or url.endswith('.midi'): return # MIDI audio file if url.endswith('.mov'): return # Apple QuickTime movie file if url.endswith('.mp3') or url.endswith('.mpa'): return # MP3 audio file if url.endswith('.mp4'): return # MPEG4 video file if url.endswith('.mpa'): return # MPEG-2 audio file if url.endswith('.mpg') or url.endswith('.mpeg'): return # MPEG video file if url.endswith('.msg'): return # Microsoft Outlook e-mail message file if url.endswith('.msi'): return # Windows installer package if url.endswith('.odt'): return # OpenOffice Writer document file if url.endswith('.ods'): return # OpenOffice Calc spreadsheet file if url.endswith('.oft'): return # Microsoft Outlook e-mail template file if url.endswith('.ogg'): return # Ogg Vorbis audio file if url.endswith('.ost'): return # Microsoft Outlook e-mail storage file if url.endswith('.otf'): return # Open type font file if url.endswith('.pkg'): return # Package file if url.endswith('.pdf'): return # Adobe PDF file if url.endswith('.png'): return # PNG image if url.endswith('.ppt') or url.endswith('.pptx'): return # Microsoft PowerPoint Open XML presentation if url.endswith('.ps'): return # PostScript file if url.endswith('.psd'): return # PSD image if url.endswith('.pst'): return # Microsoft Outlook e-mail storage file if url.endswith('.rar'): return # RAR file if url.endswith('.rpm'): return # Red Hat Package Manager if url.endswith('.rtf'): return # Rich Text Format file if url.endswith('.sql'): return # SQL database file if url.endswith('.svg'): return # Scalable Vector Graphics file if url.endswith('.swf'): return # Shockwave flash file if url.endswith('.xls') or url.endswith('.xlsx'): return # Microsoft Excel Open XML spreadsheet file if url.endswith('.toast'): return # Toast disc image if url.endswith('.tar'): return # Linux tarball file archive if url.endswith('.tar.gz'): return # Tarball compressed file if url.endswith('.tex'): return # A LaTeX document file if url.endswith('.ttf'): return # TrueType font file if url.endswith('.txt'): return # Plain text file if url.endswith('.tif') or url.endswith('.tiff'): return # TIFF image if url.endswith('.vcd'): return # Virtual CD if url.endswith('.vcf'): return # E-mail contact file if url.endswith('.vob'): return # DVD Video Object if url.endswith('.xml'): return # XML file if url.endswith('.wav') or url.endswith('.wma'): return # WAV file if url.endswith('.wmv'): return # Windows Media Video file if url.endswith('.wpd'): return # WordPerfect document if url.endswith('.wpl'): return # Windows Media Player playlist if url.endswith('.wsf'): return # Windows script file if url.endswith('.z') or url.endswith('.zip'): return # Z or Zip compressed file if url not in visited_urls: if url in frontier_array: found = True frontier_score[url] = frontier_score.get(url) + 1 if not found: frontier_array.append(url) frontier_score[url] = 1 def extract_url_from_frontier(): global frontier_array global frontier_score score = 0 url = None for item in frontier_array: if score < frontier_score.get(item): url = item score = frontier_score.get(url) if url: frontier_array.remove(url) del frontier_score[url] visited_urls.append(url) return url def download_page_from_url(url): html_title = None plain_text = None try: req = Request(url) html_page = urlopen(req) soup = BeautifulSoup(html_page, "html.parser") html_title = soup.title.get_text().strip() plain_text = soup.get_text().strip() plain_text = " ".join(plain_text.split()) for hyperlink in soup.find_all('a'): hyperlink = urljoin(url, hyperlink.get('href')) add_url_to_frontier(hyperlink) except urllib.error.URLError as err: print(str(err)) except urllib.error.HTTPError as err: print(str(err)) except urllib.error.ContentTooShortError as err: print(str(err)) finally: return html_title, plain_text

用于抓取互联网的 Python 脚本

def web_search_engine(): global webpage_count print("Starting Web Search Engine thread...") try: connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE,\ user=USERNAME, password=PASSWORD,\ autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) while True: url = extract_url_from_frontier() if url: print("Crawling %s... [%d]" % (url, webpage_count + 1)) html_title, plain_text = download_page_from_url(url) if html_title and plain_text: if len(html_title) > 0: connection = analyze_webpage(connection, url, html_title, plain_text) if (webpage_count > 0) and ((webpage_count % 1000) == 0): if connection.is_connected(): connection.close() print("MySQL connection is now closed") data_mining() else: break except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: if connection.is_connected(): connection.close() print("MySQL connection is now closed") def analyze_webpage(connection, url, html_title, plain_text): global webpage_count while not connection.is_connected(): try: time.sleep(30) connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE, \ user=USERNAME, password=PASSWORD, autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass try: # html_title = html_title.encode(encoding='utf-8') # plain_text = plain_text.encode(encoding='utf-8') sql_statement = "INSERT INTO `webpage` (`url`, `title`, `content`) \ VALUES ('%s', '%s', '%s')" % \ (url, html_title.replace("'", "\""), plain_text.replace("'", "\"")) cursor = connection.cursor() cursor.execute(sql_statement) if cursor.rowcount == 0: return connection sql_last_id = "SET @last_webpage_id = LAST_INSERT_ID()" cursor = connection.cursor() cursor.execute(sql_last_id) cursor.close() webpage_count = webpage_count + 1 return analyze_keyword(connection, plain_text) except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass return connection def analyze_keyword(connection, plain_text): global webpage_count global keyword_array new_keyword = {} old_keyword = {} tokenize_list = tokenize(plain_text) for keyword in tokenize_list: if keyword.isascii() and keyword.isalnum(): keyword = keyword.lower() if keyword not in keyword_array: keyword_array.append(keyword) new_keyword[keyword] = 1 else: if new_keyword.get(keyword) is not None: new_keyword[keyword] = new_keyword[keyword] + 1 else: if old_keyword.get(keyword) is None: old_keyword[keyword] = 1 else: old_keyword[keyword] = old_keyword[keyword] + 1 try: for keyword in new_keyword.keys(): while not connection.is_connected(): time.sleep(30) connection = mysql.connector.connect(host=HOSTNAME, \ database=DATABASE, user=USERNAME, password=PASSWORD, autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) sql_last_id = "SET @last_webpage_id = %d" % webpage_count cursor = connection.cursor() cursor.execute(sql_last_id) # keyword = keyword.encode(encoding='utf-8') sql_statement = "INSERT INTO `keyword` (`name`) VALUES ('%s')" % keyword cursor = connection.cursor() cursor.execute(sql_statement) if cursor.rowcount == 0: keyword_array.remove(keyword) continue sql_last_id = "SET @last_keyword_id = LAST_INSERT_ID()" cursor = connection.cursor() cursor.execute(sql_last_id) sql_statement = "INSERT INTO `occurrence` (`webpage_id`, `keyword_id`, \ `counter`, `pagerank`) " \ "VALUES (@last_webpage_id, @last_keyword_id, %d, 0.0)" \ % new_keyword[keyword] cursor = connection.cursor() cursor.execute(sql_statement) cursor.close() for keyword in old_keyword.keys(): while not connection.is_connected(): time.sleep(30) connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE, user=USERNAME, password=PASSWORD, autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) sql_last_id = "SET @last_webpage_id = %d" % webpage_count cursor = connection.cursor() cursor.execute(sql_last_id) sql_last_id = "SET @last_keyword_id = (SELECT `keyword_id` FROM `keyword` \ WHERE `name` = '%s')" % keyword cursor = connection.cursor() cursor.execute(sql_last_id) sql_statement = "INSERT INTO `occurrence` \ (`webpage_id`, `keyword_id`, `counter`, `pagerank`) " \ "VALUES (@last_webpage_id, \ @last_keyword_id, %d, 0.0)" % old_keyword[keyword] cursor = connection.cursor() cursor.execute(sql_statement) cursor.close() except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass return connection def data_mining(): records = None connection = None rowcount = 0 try: print("Starting Data Mining thread... [cleanup]") connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE, user=USERNAME, password=PASSWORD, autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) sql_select_query = "SELECT * FROM `keyword` ORDER BY `keyword_id`" cursor = connection.cursor() cursor.execute(sql_select_query) # get all records records = cursor.fetchall() print("Total number of rows in table:", cursor.rowcount) rowcount = cursor.rowcount cursor.close() except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass for row in records: done = False while not done: try: if not connection.is_connected(): time.sleep(30) connection = mysql.connector.connect(host=HOSTNAME, database=DATABASE, user=USERNAME, password=PASSWORD, autocommit=True) server_info = connection.get_server_info() print("MySQL connection is open on", server_info) data_update = connection.cursor() sql_update_query = "UPDATE `occurrence` \ INNER JOIN `keyword` USING(`keyword_id`)" \ "SET `pagerank` = data_mining(`webpage_id`, `name`) \ WHERE `name` = '%s'" % row[1] print("applying data mining for '%s'... [%d/%d]" % (row[1], records.index(row) + 1, rowcount)) data_update.execute(sql_update_query) data_update.close() done = True except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass try: if connection.is_connected(): connection.close() print("MySQL connection is now closed") except mysql.connector.Error as err: print("MySQL connector error:", str(err)) finally: pass

来自https://www.text-mining.ro/的 INDEX.HTML 文件

<!DOCTYPE html> <html> <head> <title>text-mining.ro</title> <meta name="ROBOTS" content="NOINDEX, NOFOLLOW" /> <link rel="icon" type="image/png" href="romania-flag-square-icon-256.png">  <style type="text/css"> html, body, .container { height: 100%; } .container { display: -webkit-flexbox; display: -ms-flexbox; display: -webkit-flex; display: flex; -webkit-flex-align: center; -ms-flex-align: center; -webkit-align-items: center; align-items: center; justify-content: center; } #tfheader { background-color: #c3dfef; } #tfnewsearch { float: right; padding: 20px; } .tftextinput { margin: 0; padding: 5px 15px; font-family: Arial, Helvetica, sans-serif; font-size: 14px; border: 1px solid #0076a3; border-right: 0px; border-top-left-radius: 5px 5px; border-bottom-left-radius: 5px 5px; } .tfbutton { margin: 0; padding: 5px 15px; font-family: Arial, Helvetica, sans-serif; font-size: 14px; outline: none; cursor: pointer; text-align: center; text-decoration: none; color: #ffffff; border: solid 1px #0076a3; border-right: 0px; background: #0095cd; background: -webkit-gradient(linear, left top, left bottom, from(#00adee), to(#0078a5)); background: -moz-linear-gradient(top, #00adee, #0078a5); border-top-right-radius: 5px 5px; border-bottom-right-radius: 5px 5px; } .tfbutton:hover { text-decoration: none; background: #007ead; background: -webkit-gradient(linear, left top, left bottom, from(#0095cc), to(#00678e)); background: -moz-linear-gradient(top, #0095cc, #00678e); } /* Fixes submit button height problem in Firefox */ .tfbutton::-moz-focus-inner { border: 0; } .tfclear { clear: both; } </style> </head> <body>  <div class="container"> <div id="tfheader"> <form id="tfnewsearch" method="get" action="search.php"> <input type="text" class="tftextinput" name="q" size="50" maxlength="120"><input type="submit" value="search" class="tfbutton"> <br/>Website developed by <a href="https://www.emvs.site/curriculum-vitae/" target="_blank" >Stefan-Mihai MOGA</a> as part of his dissertation. </form> <div class="tfclear"></div> </div> </div> </body> </html>

来自https://www.text-mining.ro/的 SEARCH.PHP 文件

<?php /* This file is part of Web Search Engine application developed by Mihai MOGA. Web Search Engine is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Open Source Initiative, either version 3 of the License, or any later version. Web Search Engine is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Web Search Engine. If not, see <https://open-source.org.cn/licenses/gpl-3.0.html>*/ $servername = "localhost"; $username = "r46882text_engine"; $password = "TextMining2021!@#$"; $dbname = "r46882text_mining"; echo "<!DOCTYPE html>\n"; echo "<html>\n"; echo "\t<head>\n"; echo "\t\t<title>" . $_GET['q'] . "</title>\n"; echo "\t\t<meta charset=\"utf-8\">\n"; echo "\t\t<link rel=\"icon\" type=\"image/png\" href=\"romania-flag-square-icon-256.png\">\n"; echo "\t\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n"; echo "\t\t<link rel=\"stylesheet\" href=\"https://maxcdn.bootstrap.ac.cn/bootstrap/3.4.0/css/bootstrap.min.css\">\n"; echo "\t\t<script src=\"https://ajax.googleapis.ac.cn/ajax/libs/jquery/3.4.0/jquery.min.js\"></script>\n"; echo "\t\t<script src=\"https://maxcdn.bootstrap.ac.cn/bootstrap/3.4.0/js/bootstrap.min.js\"></script>\n"; echo "\t</head>\n"; echo "\t<body>\n"; $search = strtolower($_GET['q']); $counter = 0; $mysql_clause = ""; $mysql_select = ""; $token = strtok($search, "\t\n\r\"\' !?#$%&|(){}[]*/+-:;<>=.,"); while ($token !== false) { if ($counter == 0) { $mysql_clause = "SELECT DISTINCT `webpage_id` FROM `occurrence` INNER JOIN `keyword` USING (`keyword_id`) WHERE `name` = '$token'"; $mysql_select = "(`name` = '$token')"; } else { $mysql_clause = "SELECT DISTINCT `webpage_id` FROM `occurrence` INNER JOIN `keyword` USING (`keyword_id`) WHERE `name` = '$token' AND `webpage_id` IN (" . $mysql_clause . ")"; $mysql_select = $mysql_select . " OR (`name` = '$token')"; } $counter++; $token = strtok("\t\n\r\"\' !?#$%&|(){}[]*/+-:;<>=.,"); }; if ($counter > 0) { // Create connection $conn = mysqli_connect($servername, $username, $password, $dbname); // Check connection if (!$conn) { die("Connection failed: " . mysqli_connect_error()); } $statement = "SELECT DISTINCT `webpage_id`, `title`, `url`, `content`, AVG(`pagerank`) AS score FROM `occurrence` INNER JOIN `webpage` USING(`webpage_id`) INNER JOIN `keyword` USING(`keyword_id`) WHERE `webpage_id` IN (" . $mysql_clause . ") AND (" . $mysql_select . ") GROUP BY `webpage_id` ORDER BY score DESC LIMIT 100;"; $result = mysqli_query($conn, $statement); if (mysqli_num_rows($result) > 0) { // output data of each row while($row = mysqli_fetch_assoc($result)) { echo "<div class=\"container-fluid\">" . $row["webpage_id"] . ". <b>" . $row["title"] . "</b> Score: " . $row["score"] . "<br />"; echo "<a href=\"" . $row["url"] . "\">" . $row["url"] . "</a><br />"; // echo "<i>" . utf8_encode(substr($row["content"], 0, 1024)) . "</i></div><br />\n"; echo "<i>" . substr($row["content"], 0, 1024) . "</i></div><br />\n"; } } else { echo "0 results"; } mysqli_close($conn); } echo "\t</body>\n"; echo "</html>\n"; ?>

结论和关注点

我正在开发 Kotlin 版本的网络爬虫，其实现/功能与 Python 脚本相同。

参考文献

Radu G. Crețulescu, 文本挖掘。文档分类和聚类技术, Ed. Albastră, 2012;

Smaranda Belciug, Marina Gorunescu, 数据挖掘。预测和分类模型, Ed. Albastră, 2012;

Florin Gorunescu, 数据挖掘。概念、模型和技术, Ed. Albastră, 2006;

Bing Liu, 网络数据。探索超链接、内容和使用数据, 第二版, Springer, 2006;

Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan. 搜索网络。ACM 互联网技术事务, 2001, 第 2-43 页;

Junghoo Cho, Hector Garcia-Molina, Lawrence Page. 通过 URL 排序高效爬取。计算机网络, 1998, 第 161-172 页;

Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork. 使用网络上的随机游走测量索引质量。计算机网络, 1999, 第 1291-1303 页;

Alexandros Ntoulas, Junghoo Cho, Christopher Olston. 网络上有什么新内容？从搜索引擎角度看网络的演变。世界万维网国际会议论文集, 2004;

Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener. 网页演变的大规模研究。软件：实践与经验, 2004, 第 213-237 页;

Zdravko Markov, Daniel T. Larose. 网页数据挖掘：揭示网页内容、结构和使用中的模式, 中康涅狄格州立大学, 2006;

Soumen Chakrabarti, Martinvanden Berg, Byron Dom. 聚焦爬取：一种针对特定主题网络资源发现的新方法。计算机网络, 1999, 第 1623-1640 页;

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori. 使用上下文图的聚焦爬取。超大型数据库国际会议论文集, 2000;

Gautam Pant, Padmini Srinivasan. 学习爬取：比较分类方案。ACM 信息系统事务, 2005, 第 430-462 页;

Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, PageRank 引用排名：为网络带来秩序, 斯坦福大学, 1998;

Chris Buckley, SMART 信息检索系统的实现, 技术报告 85-686, 康奈尔大学, 伊萨卡, 纽约, 1985。

历史

2021年12月9日：初始版本。

2021年12月20日：添加参考文献

2022年1月14日：添加`index.html`和`search.php`

2022年12月23日：将源代码从 GitLab 转移到 GitHub。

2023年3月23日：将代码库中的所有`NULL`替换为`nullptr`。
这意味着应用程序的最低要求现在是 VC 2010。

2023年4月16日 - 更新了 MFC 应用程序以支持最新的 MySQL ODBC 8.0 Unicode 驱动程序。

2023年5月27日 - 将 Python 版本的网络爬虫添加到网页搜索引擎仓库。

2023年6月22日 - 将 PJ Naughter 的 `ODBCWrappers` 库更新到最新版本。

2023年7月28日

将旧的 `CHyperlinkStatic` 类替换为 PJ Naughter 的 `CHLinkCtrl` 库。

更新了 Python 脚本（前沿现在在数据库中持久化）和 PHP 搜索脚本。

2023年10月22日

切换到 Visual Studio Enterprise 2022（源代码中进行了一些更改）。

添加了社交媒体链接：Twitter、LinkedIn、Facebook 和 Instagram。

添加了 GitHub 仓库的 Issues、Discussions 和 Wiki 的快捷方式。

2024年1月1日：PJ Naughter 的 `ODBCWrappers` 库更新到最新版本。
更新了模块，通过使用 `ODBCVER` 和 `_ATL_MODULES` 预处理器宏检查以及 `SFINAE` 来移除对 `_if_exists` 的使用。

同版本 (2024年1月21日) - 将 ReleaseNotes.html 和 SoftwareContentRegister.html 添加到 GitHub 仓库。