使用 OpenVINO™ 执行语音到文本识别

Adrian Boguszewski

4.50/5 (3投票s)

2022 年 7 月 20 日

CPOL

3分钟阅读

5000

在本文中，我将向您展示如何使用 OpenVINO™ 轻松运行语音转文本识别模型的推理，以便您可以开始在自己的应用程序中应用此功能。

语音转文本正迅速成为日常生活的重要组成部分。无论您是想帮助驾驶员安全地发送消息而无需将手从方向盘上移开，还是希望使客户更容易访问的企业，它都是 AI 开发人员必须具备的关键能力。

如今，语音转文本最常见的用例包括电话和会议的自动记录。但也有将其作为更大服务的一部分实施的趋势。例如，语音转文本技术可以与机器翻译服务相结合，以自动创建其他语言的视频字幕。

在本指南中，我将向您展示如何使用 OpenVINO™ 轻松运行语音转文本识别模型的推理，以便您可以开始在自己的应用程序中应用此功能。

在此演示中，我们将使用 quartznet 15x15 模型来执行自动语音识别。这种特定模型基于 Jasper，这是一种使用 Connectionist Temporal Classification (CTC) 损失训练的神经声学端到端架构。

演示的第一步是导入各种函数并声明程序的变量。它还指定模型精度（在本例中为 FP16）并为模型指定名称。

代码的下一部分检查是否需要下载模型，并创建一个子目录结构来执行此操作。在这种情况下，正在下载的 quartznet-15x5-en 模型来自 Open Model Zoo，必须转换为 Intermediate Representation (IR)。

# Check if model is already downloaded in download directory
path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
downloaded_model_file = list(path_to_model_weights.glob('*.pth'))

if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
    download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
    ! $download_command
    
# Check if model is already converted in model directory
path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')

if not path_to_converted_weights.is_file():
    convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
    ! $convert_command

转换过程由 Model Converter omz_converter 处理，这是一个 openvino-dev 包命令行工具。它将预训练的 PyTorch 模型转换为 ONNX 格式，然后将其转换为 Intel 的 OpenVINO 格式（Intermediate Representation 或 IR 文件）。这两个步骤都在同一个函数中处理。

完成此操作后，演示会加载音频文件并定义一个用于语音识别的字母表。支持多种音频格式，包括 WAV、FLAC 和 OGG。

转换为 OpenVINO 格式后，预处理的音频将转换为 Mel 频谱图，它使用一种量表来教计算机如何像人类一样聆听。这使得处理数据更容易，并且产生更好的性能。有关 Mel 频谱图及其在深度学习中的各种用途的完整深入研究，请阅读本文。

注意：音频必须为 16KHz 格式才能进行转换。

def audio_to_mel(audio, sampling_rate):
    assert sampling_rate == 16000, "Only 16 KHz audio supported"
    preemph = 0.97
    preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])
    
    # Calculate window length
    win_length = round(sampling_rate * 0.02)
    
    # Based on previously calculated window length run short-time Fourier transform
    spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01), win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))
    
    # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
    mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
    return mel_basis, spec
    
    def mel_to_input(mel_basis, spec, padding=16):
        # Convert to logarithmic scale
        log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)
        
        # Normalize output
        normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)
        
        # Calculate padding
        remainder = normalized.shape[1] % padding
        if remainder != 0:
            return np.pad(normalized, ((0, 0), (0, padding - remainder)))
    [None]
        return normalized[None]

此代码块在加载模型之前运行必要的转换。最终用户可以选择以 CPU、GPU 或 MYRIAD (Neural Compute Stick 2) 为目标。如果设置为 AUTO，系统将选择自己的目标设备以获得最佳性能。

要初始化和加载网络，请参考下面的代码示例。默认情况下，模型将在 CPU 上执行，但您可以选择手动在 CPU、GPU 或 MYRIAD 上运行工作负载。下面的 print (i.e., available_devices) 命令将列出可以执行工作负载的所有位置。要更改目标设备，请更改 device_name（当前设置为 CPU）。

ie = Core()

print(ie.available_devices)

model = ie.read_model(model=f"{model_folder}/public/{model_name}/{precision}/{model_name}.xml")
model_input_layer = model.input(0)
shape = model_input_layer.partial_shape
shape[2] = -1
model.reshape({model_input_layer: shape})
compiled_model = ie.compile_model(model=model, device_name="CPU")

output_layer_ir = compiled_model.output(0)

character_probabilities = compiled_model([audio])[output_layer_ir]

output_layer_ir 是网络输出节点的句柄。完成推理后，必须读取数据并将其转换为更易于理解的格式。

默认输出是字母表中每个符号的每帧概率。这些概率必须通过 Connectionist Temporal Classification (CTC) 函数解码。编码后的字母表为 0 = 空格，1 到 26 = “a” 到 “z”，27 = 撇号，28 = CTC 空白符号。

下面的代码符号处理字母表解码

def ctc_greedy_decode(predictions):
    previous_letter_id = blank_id = len(alphabet) - 1
    transcription = list()
    for letter_index in predictions:
        if previous_letter_id != letter_index != blank_id:
            transcription.append(alphabet[letter_index])
        previous_letter_id = letter_index
    return ''.join(transcription)

这样，识别过程就完成了！

预计语音转文本和文本转语音功能在未来几年会变得更加普遍，因为越来越多的企业采用 AI 来实现各种面向客户的功能。我们希望这篇博文和随附的代码示例对您自己的主题探索有所帮助。要了解有关 OpenVINO 的更多信息并提高您的 AI 开发人员技能，我们邀请您参加我们的 30 天开发挑战赛。

使用 OpenVINO™ 执行语音到文本识别

资源