使用 OpenVINO™ 工具包进行命名实体识别

Ragesh_Hajela

3.67/5 (4投票s)

2022 年 7 月 18 日

CPOL

3分钟阅读

3128

NER 系统可以从称为知识库的非结构化自然语言文档集合中提取实体。在本文中，我们将展示如何使用 OpenVINO™ 工具包从简单文本中进行实体识别，您将看到如何创建以下管道来执行实体提取。

命名实体识别 (NER) 是一个标准的自然语言处理问题，涉及信息提取。主要目标是定位文本中的命名实体，并将其分类为预定义的类别，例如人名、组织机构、地点等。它被广泛应用于各个行业，用于数据丰富、内容推荐、文档信息检索、客户支持、文本摘要、高级搜索算法等用例。

NER 系统可以从称为知识库的非结构化自然语言文档集合中提取实体。这个演示 notebook 204-named-entity-recognition.ipynb 展示了如何使用 OpenVINO™ 工具包从简单文本中进行实体识别，您将看到如何创建以下管道来执行实体提取。

此示例使用小型 BERT-large-like 模型，该模型在 SQuAD v1.1 训练集上从更大的 BERT-large 模型中蒸馏并量化为 INT8。该模型来自 Open Model Zoo。

# desired precision
precision = "FP16-INT8"# model name as named in Open Model Zoo
model_name = "bert-small-uncased-whole-word-masking-squad-int8-0002"

这种实体提取模型的输入将是具有不同内容大小（即动态输入形状）的文本。借助 OpenVINO™ 2022.1，您可以从 CPU 上的动态形状支持中受益。这意味着您可以使用设置上限或未定义的形状来编译模型，这使得能够在每次迭代中对不同长度的文本执行推理，而无需对网络或数据进行任何额外的操作。让我们首先初始化推理引擎并读取模型。

# initialize inference engine
ie_core = Core()# read the network and corresponding weights from file
model = ie_core.read_model(model=model_path, weights=model_weights_path)

在使用动态输入形状时，需要在加载模型之前指定输入维度。可以通过将 -1 赋值给输入维度，或者通过设置输入维度的上限来指定。在此 notebook 的范围内，由于允许的最长输入文本为 384，即 380 个内容标记 + 1 个实体 + 3 个特殊（分隔）标记，因此更建议使用 Dimension(, upper bound) 即 Dimension(1, 384) 来分配动态形状，这样可以更有效地利用内存。

# assign dynamic shapes to every input layer on the last dimension
for input_layer in model.inputs:
    input_shape = input_layer.partial_shape
    input_shape[1] = Dimension(1, 384)
    model.reshape({input_layer: input_shape})# compile the model for the CPU
compiled_model = ie_core.compile_model(model=model, device_name="CPU")

NLP 模型将实体、上下文和词汇表作为标准输入。首先，您从上下文和实体创建一个标记列表，然后通过尝试上下文的不同部分并比较预测置信度分数来提取最佳实体。只有预测置信度分数大于 0.4 的实体才会被捕获到最终输出中，用户可以根据需要设置此值。

def get_best_entity(entity, context, vocab):
    # convert context string to tokens
    context_tokens, context_tokens_end = tokens.text_to_tokens(
        text=context.lower(), vocab=vocab)
    # convert entity string to tokens
    entity_tokens, _ = tokens.text_to_tokens(text=entity.lower(), vocab=vocab)    network_input = prepare_input(entity_tokens, context_tokens)
    input_size = len(context_tokens) + len(entity_tokens) + 3    # openvino inference
    output_start_key = compiled_model.output("output_s")
    output_end_key = compiled_model.output("output_e")
    result = compiled_model(network_input)    # postprocess the result getting the score and context range for the entity
    score_start_end = postprocess(output_start=result[output_start_key][0],
                                output_end=result[output_end_key][0],
                                entity_tokens=entity_tokens,
                        context_tokens_start_end=context_tokens_end,
                                input_size=input_size)    # return the part of the context as an extracted entity
    return context[score_start_end[1]:score_start_end[2]], score_start_end[0]

对于此示例，您可以尝试对以下定义为简单模板的实体进行命名实体识别。

template = ["building", "company", "persons", "city", "state", "height", "floor", "address"]

它奏效了！输出将显示为一个 JSON 对象，其中包含提取的实体、上下文和提取置信度分数。

资源

声明与免责条款

性能因使用、配置和其他因素而异。欲了解更多信息，请访问 www.intel.com/PerformanceIndex

没有任何产品或组件能够做到绝对安全。

英特尔技术可能需要启用硬件、软件或服务激活。

所描述的产品可能包含设计缺陷或被称为勘误表的错误，这些缺陷或错误可能导致产品偏离已发布的规格。当前已描述的勘误表可根据要求提供。