RediSearch 实战

Redis Labs

5.00/5 (1投票)

2021年7月12日

CPOL

8分钟阅读

4009

了解如何使用 Azure Cache for Redis 中的 RediSearch，实时摄取推文并以灵活的方式进行查询。

Redis 拥有一套丰富的数据结构，从简单的字符串到强大的抽象，如Redis Streams。原生数据类型可以满足很多需求，但在某些用例中可能需要一种变通方法。一个例子是需要在 Redis 中使用二级索引，以便超越基于键的搜索/查找，以获得更丰富的查询功能。虽然你可以使用有序集、列表等来实现目的，但你需要考虑一些权衡。

介绍 RediSearch！RediSearch 是一个Redis 模块，由于拥有一个一流的二级索引引擎，它提供了灵活的搜索功能。它提供了强大的功能，如全文搜索、自动完成、地理索引等。

为了展示 RediSearch 的强大功能，这篇博文提供了一个实用示例，说明如何使用 Go 服务（使用 RediSearch Go 客户端构建）结合 Azure Cache for Redis 来使用 RediSearch。它旨在为你提供一套应用程序，让你能够实时摄取推文并使用 RediSearch 以灵活的方式进行查询。

具体来说，你将学习如何：

使用 RediSearch 索引
使用不同的 RediSearch 数据类型，如 TEXT、NUMERIC、TAG 等
如何构建一个应用程序来展示 RediSearch 的功能
如何仅用几条命令将服务组件部署到 Azure
通过查询 RediSearch 来分析推文数据

应用程序概述

如前所述，示例服务允许你实时消耗推文，并使其可通过 RediSearch 进行查询。

它有两个组件：

消费者/索引器： 从 Twitter 流式 API 读取数据，创建索引，并随着推文的到达持续添加推文数据（以 Redis HASHES 的形式）。
搜索服务： 一个 REST API，允许你使用 RediSearch 的查询语法搜索推文。

此时，我将深入探讨如何设置解决方案并使其运行起来，以便你可以看到它的实际效果。但是，如果你有兴趣了解各个组件的工作原理，请参阅下面的代码演练部分以及本博文的 GitHub 仓库：https://github.com/abhirockzz/redisearch-tweet-analysis。

必备组件

首先，你需要一个 Microsoft Azure 帐户：在此处免费获取！
上述服务组件将使用原生的 Docker CLI 命令部署到 Azure Container Instances。此功能由 Docker 与 Azure 的集成实现。
你需要 Docker Desktop 版本 2.3.0.5 或更高版本，适用于 Windows、macOS，或者安装适用于 Linux 的 Docker ACI Integration CLI。要使用 Twitter 流式 API，你还需要一个 Twitter 开发人员帐户。如果你还没有，请按照这些说明进行操作。RediSearch 实战！

首先，使用此快速入门教程，在 Azure 上设置 Redis Enterprise 托管缓存。完成设置后，请确保已备妥 Redis 主机名和访问密钥。

我们的服务的两个组件都已作为 Docker 容器提供：推文索引服务和搜索 API 服务。（如果你需要构建自己的 Docker 镜像，请使用 GitHub 仓库中提供的相应 Dockerfile。）

现在你将看到将这些部署到 Azure Container Instances 是多么方便，它允许你在托管的无服务器 Azure 环境中按需运行 Docker 容器。

部署到 Azure

一个 docker-compose.yml 文件定义了各个组件（tweets-search 和 tweets-indexer）。你所要做的就是更新它，以替换你的 Azure Redis 实例和 Twitter 开发人员帐户凭据的值。以下是该文件的完整内容：

version: "2"
services:
  tweets-search:
    image: abhirockzz/redisearch-tweets-search
    ports:
      - 80:80
    environment:
      - REDIS_HOST=<azure redis host name>
      - REDIS_PASSWORD=<azure redis access key>
      - REDISEARCH_INDEX_NAME=tweets-index
  tweets-indexer:
    image: abhirockzz/redisearch-tweets-consumer
    environment:
      - TWITTER_CONSUMER_KEY=<twitter api consumer key>
      - TWITTER_CONSUMER_SECRET_KEY=<twitter api consumer secret>
      - TWITTER_ACCESS_TOKEN=<twitter api access token>
      - TWITTER_ACCESS_SECRET_TOKEN=<twitter api access secret>
      - REDIS_HOST=<azure redis host name>
      - REDIS_PASSWORD=<azure redis access key>
      - REDISEARCH_INDEX_NAME=tweets-index

创建 Azure 上下文:

docker login azure 
docker context create aci aci-context 
docker context use aci-context

克隆 GitHub 仓库：

git clone https://github.com/abhirockzz/redisearch-tweet-analysis 
cd redisearch-tweet-analysis

将两个服务组件作为容器组的一部分进行部署：

docker compose up -p azure-redisearch-app

(请注意，目前在 ACI 上下文中可用的 Docker Compose 命令以 docker compose 开头。它与带连字符的 docker-compose 不同。)

你将看到类似以下的输出：

[+] Running 1/3 
⠿ Group azure-redisearch-app Created 8.3s ⠸ tweets-search Creating 6.3s ⠸ tweets-indexer Creating 6.3s

等待服务启动，你也可以查看Azure 门户。一旦两个服务都启动并运行，你就可以查看它们各自的日志：

docker logs azure-redisearch-app_tweets-indexer 
docker logs azure-redisearch-app_tweets-search

如果一切顺利，推文消费者服务应该已经启动。它将读取推文流并将其持久化到 Redis。

关键时刻！

现在是时候查询推文数据了。为此，你可以访问 Azure Container Instances 中的 REST API，其 IP 地址和完全限定域名 (FQDN)（在容器访问中了解更多信息）。要查找 IP 地址，请运行 docker ps 并查看输出中的 PORTS 部分（如下所示）。

docker ps 
//output 
CONTAINER ID IMAGE COMMAND STATUS PORTS azure-redisearch-app_tweets-search abhirockzz/redisearch-tweets-search Running 20.197.96.54:80->80/tcazure-redisearch-app_tweets-indexer abhirockzz/redisearch-tweets-consumer Running

你现在可以运行各种查询了！在深入研究之前，这里快速了解一下可以在搜索查询中使用的已索引属性。

id - this is a the Tweet ID ( TEXT attribute) 
user - the is the screen name ( TEXT attribute) 
text - tweet contents ( TEXT attribute) 
source - tweet source e.g. Twitter for Android, Twitter Web App, Twitter for iPhone ( TEXT attribute) 
hashtags - hashtags (if any) in the tweet (available in CSV format as a TAG attribute) 
location - tweet location (if available). this is a user defined location (not the exact location per se) 
created - timestamp (epoch) of the tweet. this is NUMERIC field and can be used for range queries 
coordinates - geographic location (longitude, latitude) if made available by the client ( GEO attribute)

(请注意，我在下面的示例中使用了 curl，但强烈推荐使用 VS Code 的“REST Client”)

设置搜索服务 API 的基本 URL：

export REDISEARCH_API_BASE_URL=<for example, http://20.197.96.54:80/search>

从简单的查询开始，查询所有文档（使用 * ）

curl -i $REDISEARCH_API_BASE_URL?q=*

你将看到类似以下的输出：

HTTP/1.1 200 OK 
Page-Size: 10 
Search-Hits: 12 
Date: Mon, 25 Jan 2021 13:21:52 GMT 
Content-Type: text/plain; charset=utf-8 
Transfer-Encoding: chunked 
//JSON array of documents (omitted)

注意 Page-Size 和 Search-Hits 标头：这些是应用程序传递的自定义标头，主要用于演示分页和限制。对于我们“获取所有文档”的查询，我们在 Redis 中找到了 12 个结果，但返回的 JSON 主体包含 10 条条目。这是因为 RediSearch Go API 的默认行为，你可以通过不同的查询参数进行更改，例如

curl -i "$REDISEARCH_API_BASE_URL?q=*&offset_limit=0,100" 

offset_limit=0,100 will return up to 100 documents ( limit ) starting with the first one ( offset = 0).

或者，例如，搜索从 iPhone 发送的推文

curl -i "$REDISEARCH_API_BASE_URL?q=@source:iphone"

你可能不希望查询结果中始终包含所有属性。例如，这是仅获取用户（Twitter 屏幕名称）和推文文本的方法。

curl -i "$REDISEARCH_API_BASE_URL?q=@location:india&fields=user,text"

如何按用户名进行查询（例如，以 jo 开头）

curl -i "$REDISEARCH_API_BASE_URL?q=@user:jo*"

你也可以在查询中使用属性的组合

bash curl -i $REDISEARCH_API_BASE_URL?q=@location:India @source:android

如何查找包含特定标签的推文？可以使用多个标签（用 | 分隔）？

curl -i "$REDISEARCH_API_BASE_URL?q=@hashtags:\{potus|cov*\}"

想知道最近创建了多少包含 biden 标签的推文？使用范围查询

curl -i "$REDISEARCH_API_BASE_URL?q=@hashtags:{biden} @created:[1611556920000000000 1711556930000000000]"

如果你幸运地获取了推文的坐标信息，可以尝试提取它们，然后查询 coordinates 属性。

curl -i "$REDISEARCH_API_BASE_URL?q=*&fields=coordinates" 
curl -i "$REDISEARCH_API_BASE_URL?q=@coordinates:[-122.41 37.77 10 km]"

这只是一些例子。随意进行进一步的实验并尝试其他查询。RediSearch 文档的这一部分可能会派上用场！

重要提示： 完成后，不要忘记在 Azure Container Instances 中停止服务和相应的容器：

docker compose down -p azure-redisearch-app

使用 Azure 门户删除你创建的 Azure Redis 实例。

代码演练

本节提供对各个组件代码的高级概述。这将有助于更轻松地浏览 GitHub 仓库中的源代码。

推文消费者/索引器

使用了 go-twitter 库与 Twitter 进行交互。

它会向 Twitter 流式 API 进行身份验证

config := oauth1.NewConfig(GetEnvOrFail(consumerKeyEnvVar), GetEnvOrFail(consumerSecretKeyEnvVar)) 
token := oauth1.NewToken(GetEnvOrFail(accessTokenEnvVar), GetEnvOrFail(accessSecretEnvVar)) 
httpClient := config.Client(oauth1.NoContext, token) 
client := twitter.NewClient(httpClient)

并在单独的 goroutine 中监听推文流

demux := twitter.NewSwitchDemux() 
demux.Tweet = func(tweet *twitter.Tweet) { 
  if !tweet.PossiblySensitive { 
    go index.AddData(tweetToMap(tweet)) 
    time.Sleep(3 * time.Second) 
  } 
} 
go func() { 
  for tweet := range stream.Messages { 
    demux.Handle(tweet) 
  } 
}()

注意 go index.AddData(tweetToMap(tweet)) — 这是调用索引组件的地方。它连接到 Azure Cache for Redis

host := GetEnvOrFail(redisHost) 
password := GetEnvOrFail(redisPassword) 
indexName = GetEnvOrFail(indexNameEnvVar) 
pool = &redis.Pool{Dial: func() (redis.Conn, error) { 
  return redis.Dial("tcp", host, redis.DialPassword(password), redis.DialUseTLS(true), redis.DialTLSConfig(&tls.Config{MinVersion: tls}
  }

它会先删除索引（以及现有的文档），然后再重新创建它

rsClient := redisearch.NewClientFromPool(pool, indexName) 
err := rsClient.DropIndex(true) 
schema := redisearch.NewSchema(redisearch.DefaultOptions). 
AddField(redisearch.NewTextFieldOptions("id", redisearch.TextFieldOptions{})). 
AddField(redisearch.NewTextFieldOptions("user", redisearch.TextFieldOptions{})). 
AddField(redisearch.NewTextFieldOptions("text", redisearch.TextFieldOptions{})). 
AddField(redisearch.NewTextFieldOptions("source", redisearch.TextFieldOptions{})). 
//tags are comma-separated by default 
AddField(redisearch.NewTagFieldOptions("hashtags", redisearch.TagFieldOptions{})). 
AddField(redisearch.NewTextFieldOptions("location", redisearch.TextFieldOptions{})). 
AddField(redisearch.NewNumericFieldOptions("created", redisearch.NumericFieldOptions{Sortable: true})). 
AddField(redisearch.NewGeoFieldOptions("coordinates", redisearch.GeoFieldOptions{})) 
indexDefinition := redisearch.NewIndexDefinition().AddPrefix(indexDefinitionHashPrefix) 
err = rsClient.CreateIndexWithIndexDefinition(schema, indexDefinition)

删除索引及其关联的文档是为了让你从干净的状态开始，这使得实验/演示更加容易。如果你愿意，可以选择注释掉这部分。

每条推文的信息使用 HSET 操作存储在 HASH (名为 tweet:<tweet ID>) 中：

func AddData(tweetData map[string]interface{}) { 
    conn := pool.Get() 
    hashName := fmt.Sprintf("tweet:%s", tweetData["id"]) 
    val := redis.Args{hashName}.AddFlat(tweetData) 
    _, err := conn.Do("HSET", val...) 
}

推文搜索公开了一个 REST API 来查询 RediSearch。所有选项（包括查询等）都以查询参数的形式传递。例如，https://:8080/search?q=@source:iphone。它提取所需的查询参数。

qParams, err := url.ParseQuery(req.URL.RawQuery) 
if err != nil { 
    log.Println("invalid query params") 
    http.Error(rw, err.Error(), http.StatusBadRequest) 
    return 
} 
searchQuery := qParams.Get(queryParamQuery) 
query := redisearch.NewQuery(searchQuery)

q 参数是必需的。但是，你还可以使用以下参数进行搜索：

fields：用于指定要在结果中返回哪些属性，以及
offset_limit：如果你想指定要搜索的偏移量以及要包含在结果中的文档数量（默认情况下，偏移量为 0，限制为 10 — 根据 RediSearch Go 客户端）。

例如

https://:8080/search?q=@source:Web&fields=user,source&offset_limit=5,100 
fields := qParams.Get(queryParamFields) 
offsetAndLimit := qParams.Get(queryParamOffsetLimit)

最后，结果被迭代并作为 JSON（文档数组）返回。

docs, total, err := rsClient.Search(query) 
response := []map[string]interface{}{} 
for _, doc := range docs { 
response = append(response, doc.Properties) 
} 
rw.Header().Add(responseHeaderSearchHits, strconv.Itoa(total)) 
err = json.NewEncoder(rw).Encode(response)

本节到此结束！

Azure Cache for Redis 上的 Redis Enterprise 托管层

Redis Enterprise 以两种新 Azure Cache for Redis 托管层的形式作为原生服务在 Azure 上可用，这些托管层由 Microsoft 和 Redis Labs 运营和支持。此服务使开发人员能够访问丰富的 Redis Enterprise 功能，包括 RediSearch 等模块。有关更多信息，请参阅以下资源：

结论

此端到端应用程序演示了如何使用索引，摄取实时数据以创建由 RediSearch 引擎索引的文档（推文信息），然后使用丰富的查询语法提取这些推文的见解。

想了解当你在Redis Labs 文档中搜索某个主题时幕后发生的事情吗？查看这篇博文，了解 Redis Labs 站点如何通过 RediSearch 集成实时全文搜索！或者，也许你有兴趣探索如何在无服务器应用程序中使用 RediSearch？

如果你仍在入门阶段，请访问 RediSearch 快速入门页面。

如果你想了解更多关于 Azure Cache for Redis 的企业级功能，可以查看以下资源：