使用 dtSearch 和 AWS Aurora 进行全文搜索

Mike V Baker

0/5 (0投票)

2019 年 8 月 1 日

CPOL

12分钟阅读

19619

103

在本文中，我们将扩展基于 dtSearch Engine 的示例，使用 Amazon Aurora 存储服务，这是一种通过 AWS 提供的托管 MySQL 解决方案。

下载源代码 - 1.7 MB

在上一篇文章中，我们演示了如何利用 dtSearch Engine 的强大功能，通过 Amazon Web Services (AWS) 的全球可访问性和存储容量来索引和搜索 Microsoft Office 文档。在该示例中，我们使用 EBS 卷来存储源文档和搜索索引。然而，将相同的索引和搜索功能轻松扩展到其他云存储服务也很容易。

在本文中，我们将扩展基于 dtSearch Engine 的示例，使用 Amazon Aurora 存储服务，这是一种通过 AWS 提供的托管 MySQL 解决方案。我们将基于上一篇文章“在 Amazon Web Services 上使用 EC2 和 EBS 进行 dtSearch 搜索”中创建的、使用 EC2 和附加 EBS 卷的索引和搜索示例，因此我们建议先完成该示例。

MySQL 在很多方面都很出色，但在全文搜索方面却不那么出色。这使得 dtSearch Engine 成为 Aurora 的完美补充。我们将简要讨论如何从 AWS 设置 Aurora 数据库和其他服务，然后介绍两个应用程序的实现。一个应用程序读取文档，将其插入 Aurora 数据库，然后创建索引。另一个应用程序允许最终用户搜索索引。

项目先决条件

在设置项目时，我们将使用为上一篇文章创建的 EC2 实例。我们还将设置一个 Aurora MySQL 数据库，用于存储文档和索引数据。

本文假定我们已经拥有 AWS 账户，因此请先登录 AWS 管理控制台。进入控制台后，我们可以看到可用服务的列表，最近使用的服务会显示在顶部以便于访问。

创建 Aurora 数据库

我们将首先设置 Aurora 数据库。您可以在 Amazon Aurora 用户指南中找到相关文档。

当您到达 AWS 管理控制台时，请点击“RDS”。

点击“创建数据库”，确保已将“Aurora (MySQL)”选为数据库引擎，然后点击“下一步”。按照步骤完成数据库的创建。我们选择了“Serverless”容量类型，并将 ID 设置为“dtsearchtest”。

请注意安全组。我们需要将上一篇文章中 EC2 实例使用的安全组添加到 Aurora 数据库使用的安全组中，以便在 EC2 实例上运行的应用程序能够访问数据库。

取消选中“启用删除保护”，以便完成后可以删除数据库。然后点击“创建数据库”。

接下来，我们将创建用于保存要索引的数据的表。点击左侧的“查询编辑器”链接，弹出“连接到数据库”对话框。只有在 Serverless 环境中设置的数据库才支持查询编辑器。

连接后，我们将看到一个窗口，可以在其中输入 SQL 语句并将其执行到数据库中。创建此数据库的 SQL 语句是：

USE dtSearchTest;

USE dtSearchTest;
CREATE TABLE ShakespeareDoc ( doc_id INT AUTO_INCREMENT, 
                              doc_name VARCHAR(255), 
                              doc_file VARCHAR(2048), 
                              doc_content MEDIUMTEXT, 
                              PRIMARY KEY (doc_id) );

此语句指定了 doc_id（这是一个自动 ID 字段）、一个友好名称以及指向源数据的文件名。doc_content 包含文件的实际内容。

服务器设置应用程序：dtSearchSetupApp

我们创建了两个简单的应用程序项目。在此我们将详细介绍这些应用程序。下载项目源代码即可开始。

让我们看一下这两个应用程序中的第一个，dtSearchSetupApp。与上一篇文章中的控制台应用程序一样，该项目设置在 /lib 文件夹的同级文件夹中，该文件夹包含 dtSearchEngine.dll。

我们创建了一个 .NET Core Web 应用程序项目，并使用了默认设置，但未启用 HTTPS 选项。Visual Studio 创建项目后，我们删除了除“Index”和 Cookie 策略的局部视图之外的所有页面。这样就保留了所有用于按钮处理程序和跨站点伪造保护的代码。我们还打开了“_Layout”局部视图，删除了导航栏、Cookie 策略局部视图的引用，以及除代码外的任何内容。

该应用程序需要一个连接器才能与 MySQL 一起使用。我们选择了 NuGet 上的 MySql.Data 连接器 NuGet 包。有关使用该连接器的文档可在 MySQL Connector/NET 网站上找到。

我们还添加了 AWS Toolkit for Visual Studio，它允许我们浏览附加到 AWS 账户的服务。这对于连接到 EC2 实例尤其有用。通过 Visual Studio 的“扩展”>“管理扩展”菜单选项安装该工具包。

从“视图”菜单打开 AWS 资源管理器。配置完成后，点击“Amazon EC2”>“实例”，然后连接到之前配置的 EC2 实例。

与上一篇文章中创建的控制台应用程序一样，此应用程序也需要 dtSearchEngine.dll 引用。由于 .NET Core 是跨平台的，我们可能希望将其部署在 Windows Server 以外的系统上。要使引用跨平台，请直接修改 .csproj 文件并粘贴以下行。（有关更多信息，请参阅 dtSearch .NET 文档中的本机库。）

<ItemGroup Condition="'$([System.Runtime.InteropServices.RuntimeInformation]::IsOSPlatform($([System.Runtime.InteropServices.OSPlatform]::Linux)))' == 'true'">
  <Content Include="..\..\lib\engine\linux\x64\libdtSearchEngine.so" 
           Link="libdtSearchEngine.so">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>

<ItemGroup Condition="'$([System.Runtime.InteropServices.RuntimeInformation]::IsOSPlatform($([System.Runtime.InteropServices.OSPlatform]::OSX)))' == 'true'">
  <Content Include="..\..\lib\engine\macos\x64\libdtSearchEngine.dylib" 
           Link="libdtSearchEngine.dylib">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>

<ItemGroup Condition="'$(OS)' == 'Windows_NT'">
  <Content Include="..\..\lib\engine\win\x64\dtSearchEngine.dll" 
           Link="dtSearchEngine.dll">
    <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
  </Content>
</ItemGroup>

<ItemGroup>
  <Reference Include="dtSearchNetStdApi">
    <HintPath>..\..\lib\engine\NetStd\dtSearchNetStdApi.dll</HintPath>
  </Reference>
</ItemGroup>

请注意，所有这些条目都引用了库的 x64 版本。

部署应用程序

该程序读取文本文件并填充数据库。文本文件包含在 shakespeare-text.zip 文件中。通过 AWS Toolkit 连接的一个便捷功能是，您可以选中一个复选框，将本地驱动器映射为可以从远程系统访问的资源。将文件解压缩到 C:\dtSearch。

我们还需要设置 dtSearch Engine 以便在搜索程序运行时使用。有关使用 dtSearch Engine 设置应用程序的说明，可以在“安装 dtSearch Engine”帮助主题中找到。

将文件放在 EC2 实例上并安装 dtSearch Engine 后，我们就可以部署应用程序了。我们将应用程序发布到“publish”文件夹，然后使用远程桌面连接将文件复制到 EC2 实例。有关详细信息，请参阅 “如何创建新的 Web 应用程序”下的文档。

用户帐户“IIS AppPool\dtSearchSetupApp”将需要对该文件夹的权限。使用文件夹属性的安全选项卡，并设置“读取和执行”、“列出文件夹内容”和“读取”权限。

我们为每个应用程序指定了不同的端口。您需要将端口添加到 AWS 安全组，并在 EC2 实例的防火墙设置中打开端口。然后，可以使用 EC2 实例的公共 DNS 和端口号从本地计算机运行该应用程序。

dtSearchSetupApp 应用程序详细信息

设置应用程序 (dtSearchSetupApp) 会找到要索引的文件，将其设置到数据库中，并使用 dtSearch Engine 的 DataSource API 索引数据库。在 Index.cshtml 中，我们在表单上看到了四个按钮。

<form method="post">
  <button type="submit" class="btn btn-default" 
    asp-page-handler="EnumFiles">Find Files</button>
  <button type="submit" class="btn btn-default" 
    asp-page-handler="ClearDB">Clear DB</button>
  <button type="submit" class="btn btn-default" 
    asp-page-handler="ImportFiles">Import</button>
  <button type="submit" class="btn btn-default" 
    asp-page-handler="IndexContent">Index</button>
</form>

每个选项的作用如下：

“查找文件”读取文件夹中的文件列表。如果没有显示文件，请检查它们是否位于正确的文件夹中。
“清除数据库”运行一个查询以从数据库中删除所有项目。
“导入”加载supplied 文件中的文本，并为每个文件插入一条记录。
“索引”从数据库读取记录并构建索引。

让我们更详细地了解一下索引操作。

/// React to the Index button. Create the index from the database contents

public void OnPostIndexContent()
{
    bool result = false;
    // get connection
    MySqlConnection conn = GetConnection();
    try
    {
        conn.Open();

        // create our custom data source, pass in connection
        DBDataSource dataSource = new DBDataSource(conn);

        // create the index job and set basic params
        IndexJob indexJob = new IndexJob();
        indexJob.ActionAdd = true;
        indexJob.ActionCreate = true;
  indexJob.IndexingFlags |= IndexingFlags.dtsIndexCacheOriginalFile;
  indexJob.IndexingFlags |= IndexingFlags.dtsIndexCacheText;

        // Instead of "FoldersToIndex" we use "DataSourceToIndex" 
        // and set it to our derived class
        indexJob.DataSourceToIndex = dataSource;
        // Index destination is hard coded here for this example
        indexJob.IndexPath = "H" + Path.VolumeSeparatorChar 
            + Path.DirectorySeparatorChar
            + "dtSearch" + Path.DirectorySeparatorChar
            + "index" + Path.DirectorySeparatorChar;

        // execute the job and capture the result
        result = indexJob.Execute();

        indexErrors = indexJob.Errors != null ? 
            indexJob.Errors.ToString() : "";
    }
    catch (Exception ex)
    {
        dbError = ex.ToString();
        System.Diagnostics.Debug.WriteLine(ex.ToString());
    }

    Message("DONE INDEXING with result = " + result.ToString());
}

此函数设置一个 IndexJob（有关详细信息，请参阅上一篇文章）。但是，在本例中，我们使用了提供的 IndexJob 类，而不是对其进行扩展。

在索引数据库时，通常将文档缓存到索引中会很有用，这样可以轻松快速地显示命中高亮的搜索结果。

缓存纯文本，用于 SearchReportJob，以便有效地生成搜索结果的简短上下文命中显示。
缓存原始文档，用于 FileConverter，以便有效地生成完整文档的命中高亮版本，以便在用户选择搜索结果中的项目时显示。

要启用这两种缓存，请在 IndexJob 中设置标志 dtsIndexCacheText 和 dtsIndexCacheOriginalFile。有关缓存的更多信息，请参阅 dtSearch 文档中的缓存文档主题。

/// GetNextDoc override. The engine calls this to see if it 
/// should continue indexing, and to set up the next item

public bool GetNextDoc()
{
    skip++;
    string sql = "SELECT doc_id, doc_file, doc_name, doc_content FROM ShakespeareDoc ORDER BY doc_id LIMIT " + skip + ", 1";

    // create command, read database
    MySqlCommand cmd = new MySqlCommand(sql, connection);
    MySqlDataReader rdr = cmd.ExecuteReader();
    // set basic settings about index item
    DocIsFile = false;

    // we know in this case that all records have data
    // if rdr returns true then we're good, otherwise we're done
    if (rdr.Read())
    {
        DocId = (int)rdr[0];
        DocName = rdr[1].ToString();
        DocDisplayName = rdr[2].ToString();
        DocText = rdr[3].ToString();
        rdr.Close();
        return true;
    }
    else
    {
        return false;
    }
}

我们扩展了 dtSearchEngine.DataSource 类，以便控制馈送到 IndexJob 的文本。我们使用“skip”变量来控制记录中的位置，使用 LIMIT SQL 子句。每次 IndexJob 调用 GetNextDoc 时，我们的类都会从数据库读取另一条记录，然后相应地设置数据。当数据库中的数据用尽时，我们返回 false，让 IndexJob 知道作业已完成。

完成后，下一步就是搜索索引。

创建搜索应用程序

在您的开发文件夹中打开 dtSearchWebApp 解决方案。

与设置应用程序一样，我们从 .NET Core Web 应用程序开始，并删除了不必要的组件。

VersionInfo.cs 在上一篇文章中进行了说明：它检查 dtSearch Engine 的版本信息。

Startup.cs 中有几点需要指出。

public class WebDemoIndexCache : IndexCache
{
  public WebDemoIndexCache(IOptions<AppSettings> settings) : 
    base(settings.Value.IndexCache.MaxIndexCount)
  {
    AutoReopenTime = settings.Value.IndexCache.AutoReopenTime;
    AutoCloseTime = settings.Value.IndexCache.AutoCloseTime;
  }
}

public class Startup
{
  private void EnableDebugLogging()
  {
    string DebugLogName = Path.Combine(Path.GetTempPath(), 
      "dtSearchWebApp.log");
    Server.SetDebugLogging(DebugLogName, DebugLogFlags.dtsLogDefault);
  }
  public Startup(IConfiguration configuration)
  {
    // Un-comment to generate a diagnostic log
    EnableDebugLogging();
    Configuration = configuration;
  }
  ...
}

这里使用的 IndexCache 对象包含在 dtSearch Engine API 中，用于提高执行大量搜索的应用程序的性能。它维护一个已打开索引的缓存，该缓存可以在搜索中重用。我们在此处设置了一些缓存选项，以及日志文件的选项。有一个 AppSettings 类用于保存选项，但实际值保存在 appsettings.json 中。

让我们看一下 Index.cshtml 的一部分以及 Index.cshtml.cs 中对应的代码。

<input asp-for="SearchRequest" id="SearchRequest"
  class="typeahead form-control" autocomplete="off" 
  type="text" placeholder="Search request"
  value="@Model.SearchRequest" />

  [BindProperty(SupportsGet = true)]
  public string SearchRequest { set; get; }

在 cshtml 中有一个 SearchRequest 的输入。在代码隐藏文件中，有一个相应的绑定属性。这是搜索词和搜索作业所需所有选项遵循的模式。

本示例中只包含少数可用的搜索选项。

SearchType 控制搜索作业是查找匹配任何单词、所有单词还是布尔条件（如“dream AND caesar”）的索引项。
词干提取允许搜索作业根据词干词（如 dreamer、dream 和 dreaming）通过搜索“dream”来定位术语。
语音搜索可以找到听起来与搜索词中的词语相似的词。

搜索！

通过 SearchJob 类可以找到索引中有匹配项的文档。

搜索作业可以搜索多个索引，因此代码的顶部分构建了一个索引列表。本示例仅使用一个索引，因此索引属性是表单中的一个隐藏输入。

接下来，我们设置搜索作业的选项。我们创建的索引的路径会进入 IndexesToSearch 属性。我们将搜索词与任何布尔条件一起放入 Request 中。

/// Run the search using the words entered on the form and some options.

private IActionResult DoSearch()
{
  ...

  // all values for IxId into one comma-delimited string
  string IxIdString = "";
  foreach (var id in IxId)
  {
    if (IxIdString.Length > 0)
      IxIdString += ",";
    IxIdString = IxIdString + id;
  }
  if (string.IsNullOrWhiteSpace(IxIdString))
    IxIdString = Settings.IndexTable.GetDefaultIndexIds();
  IndexIds = IxIdString.Split(",");
  IndexesToSearch = Settings.IndexTable.GetIndexPaths(IxIdString);

  using (SearchJob searchJob = new SearchJob())
  {
    searchJob.IndexCache = indexCache;
    searchJob.IndexesToSearch = IndexesToSearch;
    searchJob.Request = SearchRequest;
    searchJob.BooleanConditions = BooleanConditions;

    searchJob.SearchFlags = dtsSearchDelayDocInfo;
    if (SearchType == SearchType.AllWords)
      searchJob.SearchFlags |= dtsSearchTypeAllWords;
    else if (SearchType == SearchType.AnyWords)
      searchJob.SearchFlags |= dtsSearchTypeAnyWords;
    if (Stemming)
      searchJob.SearchFlags |= dtsSearchStemming;
    if (PhonicSearching)
      searchJob.SearchFlags |= dtsSearchPhonic;

    searchJob.SearchFlags |= (SearchFlags)SearchFlags;

    bool ok = ExecuteSearch(searchJob);
    if (!ok)
    {
      string message = searchJob.Errors.ToString();
      return ShowError(message);
    }
  }
  stopwatch.Stop();

  // optionally generate a synopsis for the results
  if (Settings.Synopsis.GenerateSynopsis)
    GenerateSynopsisForThisPage();

  return Page();
}

SearchFlags 是一个包含不同项的集合，本示例仅演示了其中几个。有关详细信息，请参阅SearchFlags 枚举文档。

索引搜索完成后，ExecuteSearch 函数返回。如果返回 false，则任何错误都会设置到显示给用户的 Message 中。如果为 true（成功！），则程序会为每个项目可选地构建一个摘要。（在本例中，它会这样做，因为该设置是 true。）

GenerateSynopsisForThisPage 使用 SearchReportJob 为每个文档生成简短的上下文命中显示，其中包含一些命中的上下文。由于我们在索引中启用了文本缓存，SearchReportJob 可以快速生成此信息，而无需返回数据库获取原始文档。

private void GenerateSynopsisForThisPage()
{
  Stopwatch stopwatch = new Stopwatch();
  stopwatch.Start();
  using (SearchReportJob reportJob = new SearchReportJob())
  {
    reportJob.SetResults(SearchResults);
    reportJob.OutputFormat = OutputFormat.itUnformattedHTML;
    reportJob.BeforeHit = "";
    reportJob.AfterHit = "";

    reportJob.WordsOfContextExact = Settings.Synopsis.WordsOfContext;
    reportJob.ContextFooter = Settings.Synopsis.ContextFooter;
    reportJob.ContextHeader = Settings.Synopsis.ContextHeader;
    reportJob.ContextSeparator = Settings.Synopsis.ContextSeparator;
    reportJob.MaxContextBlocks = Settings.Synopsis.MaxContextBlocks;
    reportJob.MaxWordsToRead = Settings.Synopsis.MaxWordsToRead;
    reportJob.SelectItems(0, SearchResults.Count);
    reportJob.Flags = ReportFlags.dtsReportGetFromCache | 
                      ReportFlags.dtsReportLimitContiguousContext |
                      ReportFlags.dtsReportStoreInResults;
    if (Settings.Synopsis.IncludeFileStart)
      reportJob.Flags |= ReportFlags.dtsReportIncludeFileStart;
    reportJob.Execute();
  }
  stopwatch.Stop();
  _log.LogInformation(EventId.SearchReport, 
"SearchReport: \"{SearchRequest}\" Time: {SearchTime} Results count: {Count}", SearchRequest, stopwatch.ElapsedMilliseconds, SearchResults.Count);
}

摘要由 SearchReportJob 类生成。我们使用 SetResults 将报告集中在 IndexJob 返回的 SearchResults 上。我们使用 BeforeHit 和 AfterHit 将单词在输出中加粗。

需要注意的是，我们在 GenerateSynopsisForThisPage 函数中一次性处理所有搜索结果。在生产环境中，您可能需要设置分页机制或其他限制来显示结果。

显示搜索结果

在 Index.cshtml 的底部，我们看到对名为 SearchResults 的局部视图的调用。打开 SearchResults.cshtml 文件。

<div class="panel-heading">
  <h4>@Model.SearchResults.Request</h4>
</div>
<div class="panel-body">
  @Model.SearchResults.TotalFileCount.ToString("#,#") files with @Model.SearchResults.TotalHitCount.ToString("#,#") hits
</div>

<table class="table table-hover">
  <thead class="blue-grey lighten-4">
    <tr>
      <th>Hits</th>
      <th>Document</th>
      <th>Sample</th>
    </tr>
  </thead>
  <tbody>
    <!-- Show each item in a table row with the hit count and the synopsis which shows some sample hits from the file -->
    @{
      for (int i = 0; i < Model.SearchResults.Count; ++i)
      {
        SearchResultsItem item = new SearchResultsItem();
        if (Model.SearchResults.GetNthDoc(i, item))
        {
          <tr>
            <td class="HitsColumn">@item.HitCount</td>
            <td class="HitsColumn">@item.DisplayName</td>
            <td class="HitsColumn">
              @if (!string.IsNullOrWhiteSpace(item.Synopsis))
              {
                @Html.Raw(item.Synopsis)
              }
            </td>
        </tr>
        }
      }
    }
  </tbody>
</table>

它首先检查加载结果时是否发生错误。如果没有错误，它会检查是否有任何结果。（这两个检查未显示。）然后它会引用模型中的 SearchResults 项来构建屏幕的其余部分。

@Model.SearchResults.Request 显示搜索词的输入内容
@Model.SearchResults.TotalFileCount 提供至少有一个命中的文件数量
@Model.SearchResults.TotalHitCount 提供所有文件中所有命中的总数

屏幕最后要设置的是结果表。有关我们可以显示的内容的详细信息，请参阅SearchResultsItem 类。

总结

在此演示中，我们设置了一个 Aurora MySQL 数据库，创建了表，然后部署了一个程序，将文本数据从文件填充到表中并从中创建了索引。我们使用另一个程序搜索索引并在网页上显示搜索结果。

本文附带的示例源自 dtSearch Engine 安装文件夹中的 WebDemo 应用程序，路径为 \Program Files (x86)\dtSearch Developer\examples\NetStd\WebDemo。WebDemo 演示了更多索引搜索功能，包括分面搜索和分页浏览结果。浏览 Program Files (x86)\dtSearch Developer\examples\ 文件夹，可以找到许多使用 dtSearch Engine 的示例。