识别语音 Mono .NET C#

使用开源 ASR 框架在 Mono 和 .NET C# 中进行语音识别

DaveMathews

4.96/5 (27投票s)

2015年3月29日

CPOL

8分钟阅读

85945

Mono 和 .NET C# 中的说话人无关语音识别

从 GitHub 下载项目（约 34.1 MB）

（包含 Mono 项目文件，以及所有必需的声学模型和 2 个额外的样本波形音频文件。只需点击右下角的“下载 zip”按钮。）

本文使用的框架可作为开源项目提供。您可以在下方找到存储库链接。

https://github.com/SynHub/syn-speech

引言

本文通过两种不同的方法解释了 Mono 中连续的说话人无关语音识别，将音频文件（编码为 WAVE 格式）转录为文本。

背景

关于在 Mono Linux 中进行语音识别 - 我一直在耐心地等待灵感的到来。特别是当我从事一个智能家居项目，并且不希望在该项目中使用 Windows 作为我的主要操作系统时。我使用的是 Linux 和 Mono 框架。我确实需要一个可以使用的语音识别库。因此，在网上抱怨了好几个月后，我终于找到了一个合适的候选。

我头顶上出现了一个闪耀的琥珀色光环，我决定在 CodeProject 上分享我的经验和一些有用的代码。

GitHub 存储库

最近（截至本次编辑时），该框架已开源。您可以在此处找到语音识别引擎的完整源代码。

入门

现在开始正题，我不会创建一个科幻的 GUI 来演示功能和使用模式。相反，我将使用老派的 Console 界面。

在 MonoDevelop 中

我将使用 MonoDevelop，但 Visual Studio 开发者应该不会觉得难以理解。相反，这对于使用 Visual Studio 的开发者来说应该更容易。

启动 MonoDevelop -> 文件 -> 新建 -> 解决方案，然后选择控制台 C#

创建控制台项目后，您需要导入所需的 NuGet 包。

要做到这一点（如果您已在 MonoDevelop 中安装了 NuGet 包管理器），请右键单击您的项目名称，然后选择管理 NuGet 包...

您将看到一个管理程序包...对话框。在搜索框中，键入 Syn.Speech，然后按 Enter。

找到库后，点击添加。

在 Visual Studio 中

选择文件 -> 新建 -> 项目 -> Visual C# -> 控制台应用程序

要将库导入到您的项目中，请点击工具 -> NuGet 包管理器 -> 程序包管理器控制台，然后键入

PM> Install-Package Syn.Speech

太棒了！我们已经导入了所需的库。

接着，该库本身无法在没有任何数据支持的情况下转录给定的音频文件。在语音识别领域，这些被称为声学模型。

获取所需文件

大多数模型都相当大，因为它们是在大量数据上训练的，并且描述了一种复杂的语言。这也是我没有将它们上传到本文中的原因之一。

所以请耐心一点，下载附加到本文的项目文件。

下载文件后，解压缩存档。浏览到Bin/Debug目录，您将找到Models和Audio文件夹。将这些文件夹复制并粘贴到您的 Mono 项目的Bin/Debug目录中。

好了，我们已经有了所需的模型（声学模型）和要转录的音频文件。

开始编写代码...

在没有语法的情况下转录音频文件

我们正在使用的语音识别引擎（在撰写本文时）只能处理 WAVE 音频文件。

在 Audio 目录中，我们有两个音频文件，名为Long Audio.wav和Long Audio 2.wav。在我们尝试转录这些文件之前，请收听Long Audio 2.wav。您会听到有人说“现在是一点差五分”。

对于任何离线语音识别引擎（处理有限的声学模型集），以上句子都是一个相当长的句子要转录。但我们还是要转录它，所以请坚持。

在您的控制台应用程序的MainClass中，添加以下 C# 代码

static Configuration speechConfiguration;
static StreamSpeechRecognizer speechRecognizer;

上面的代码声明了 2 个重要对象，我们将在稍后进行初始化。

第一个是Configuration类。此类包含声学模型、语言模型和词典的位置等信息。此外，它还告诉语音识别器我们是否打算使用语法文件。

接着，StreamSpeechRecognizer类是允许您将音频流定向到语音识别引擎的主要类。一旦计算完成，我们将使用这个类来获取结果。

获取日志信息

我们不能盲目地让StreamSpeechRecognizer为我们转录音频文件而不了解内部发生的情况。相反，我们应该尝试更多地了解它。故事的结局。

要获取语音识别引擎生成的内部日志，请在上述静态变量下方添加以下 C# 代码。

static void LogReceived (object sender, LogReceivedEventArgs e)
{
     Console.WriteLine (e.Message);
}

现在在Main方法中，添加以下行

Logger.LogReceived += LogReceived;

太棒了！所以现在，记录器收到的任何消息都将写入控制台。

让我们初始化并设置Configuration和StreamSpeechRecognizer类。

在Main方法中添加以下代码。

var modelsDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Models");
var audioDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Audio");
var audioFile = Path.Combine (audioDirectory, "Long Audio 2.wav");

if (!Directory.Exists (modelsDirectory)||!Directory.Exists(audioDirectory)) {
    Console.WriteLine ("No Models or Audio directory found!! Aborting...");
    Console.ReadLine ();
    return;
}

speechConfiguration = new Configuration ();
speechConfiguration.AcousticModelPath=modelsDirectory;
speechConfiguration.DictionaryPath = Path.Combine (modelsDirectory, "cmudict-en-us.dict");
speechConfiguration.LanguageModelPath = Path.Combine (modelsDirectory, "en-us.lm.dm

在上面的代码中，我创建了几个变量，它们保存了Bin/Debug文件夹内Models和Audio目录的位置。

稍后的代码中，有一个令人讨厌的检查 - 以验证您是否已正确将Audio和Models目录复制到正确的文件夹。

稍后，我们会遇到speechConfiguration变量并初始化其属性。

speechConfiguration.AcousticModelPath - 我们大部分声学模型文件所在的位置
speechConfiguration.DictionaryPath - 词典文件的路径（在本例中位于Models目录内）
speechConfiguration.LanguageModelPath - 语言模型文件的路径（也位于Models目录内）

以上所有属性都必须在我们将配置传输到语音识别器之前进行分配。

为什么有这么多 Path.Combine(s) ？

嗯，Windows 和 Linux 中的路径分隔符字符不同，即 Windows 使用反斜杠 (\)，而 Linux 使用正斜杠 (/)。Path.Combine负责处理这种混乱，并在组合路径时确保我们的代码在 Windows 和 Linux 上都能正常工作。

启动 StreamSpeechRecognizer

将以下代码添加到上面的代码之后

speechRecognizer = new StreamSpeechRecognizer (speechConfiguration);
speechRecognizer.StartRecognition (new FileStream (audioFile, FileMode.Open));

Console.WriteLine ("Transcribing...");
var result = speechRecognizer.GetResult ();

if (result != null) {
    Console.WriteLine ("Result: " + result.GetHypothesis ());
} 
else {
    Console.WriteLine ("Sorry! Coudn't Transcribe");
}

Console.ReadLine ();

在上面的代码中，我们首先实例化speechRecogizer对象，然后调用StartRecognition方法，并传入一个指向我们正在尝试转录的音频文件的FileStream。

语音识别实际上在调用StartRecognition方法后并不会立即开始，而是在调用GetResult方法时才开始计算。

一旦计算出结果，我们就调用GetHypothesis方法来检索作为string的假设。

无语法转录音频文件的整体代码

using System;
using Syn.Speech.Api;
using System.IO;
using Syn.Logging;

namespace Speech.Recognition.Example
{
    class MainClass
    {
        static Configuration speechConfiguration;
        static StreamSpeechRecognizer speechRecognizer;

        static void LogReceived (object sender, LogReceivedEventArgs e)
        {
            Console.WriteLine (e.Message);
        }

        public static void Main (string[] args)
        {
            Logger.LogReceived += LogReceived;

            var modelsDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Models");
            var audioDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Audio");
            var audioFile = Path.Combine (audioDirectory, "Long Audio 2.wav");

            if (!Directory.Exists (modelsDirectory)||!Directory.Exists(audioDirectory)) {
                Console.WriteLine ("No Models or Audio directory found!! Aborting...");
                Console.ReadLine ();
                return;
            }

            speechConfiguration = new Configuration ();
            speechConfiguration.AcousticModelPath=modelsDirectory;
            speechConfiguration.DictionaryPath = Path.Combine (modelsDirectory, "cmudict-en-us.dict");
            speechConfiguration.LanguageModelPath = Path.Combine (modelsDirectory, "en-us.lm.dmp");

            speechRecognizer = new StreamSpeechRecognizer (speechConfiguration);
            speechRecognizer.StartRecognition (new FileStream (audioFile, FileMode.Open));

            Console.WriteLine ("Transcribing...");
            var result = speechRecognizer.GetResult ();

            if (result != null) {
                Console.WriteLine ("Result: " + result.GetHypothesis ());
            } else {
                Console.WriteLine ("Sorry! Couldn't Transcribe");
            }

            Console.ReadLine ();
        }
    }
}

如果您运行上面的代码（在 MonoDevelop 中按 Ctrl+F5），您应该会看到一个控制台，其中包含大量信息。最后（几秒钟后），您应该会在屏幕上看到结果。类似以下内容

希望您已经成功转录了您的音频文件。

您可能已经注意到，应用程序花费了几秒钟才转录Long Audio 2.wav文件。这是因为我们在配置中没有使用任何语法文件。这会将搜索域缩小到有限数量的指定令牌。

接下来，我们将学习如何使用语法文件来指定我们希望识别的词语和句子集。

使用语法 (JSGF) 转录音频文件

要使用语法转录音频文件或流，我们首先需要创建一个语法文件，当然！该库支持 JSGF（JSpeech Grammar Format）语法文件。

JSGF 的语法非常简单，并且是 SRGS（Speech Recognition Grammar Specification）最初派生出来的实际语法。

JSGF 语法超出了本文的范围。但是，我将在您面前提供一个简单的示例代码。

假设您希望识别 2 个孤立的、奇怪的句子，例如

现在是一点差五分
左边三个，离我们最近的那个

您的 JSGF 语法文件的内容将如下所示。

#JSGF V1.0;
grammar hello;
public <command> = ( the time is now exactly twenty five to one | 
                     this three left on the left side the one closest to us );

有关创建 JSGF 语法的更多信息，请在此处查找。

为了简单起见，我已经为您创建了一个 JSGF 语法文件，并将其放置在Models目录中。（文件名为“hello.gram”）。您上面看到的内容就是它的内容。

为了让StreamSpeechRecognizer使用我们创建的语法文件，我们需要在将Configuration类作为参数传递给StreamSpeechRecognizer之前设置 3 个重要属性。

speechConfiguration.UseGrammar = true;
speechConfiguration.GrammarPath = modelsDirectory;
speechConfiguration.GrammarName = "hello";

speechConfiguration.UseGrammar = true; - 告诉语音识别器我们打算使用Grammar文件
speechConfiguration.GrammarPath - Grammar文件所在的位置
speechConfiguration.GrammarName - 我们要使用的语法的名称（在 Linux 中区分大小写）。省略.gram扩展名，因为它会自动附加

使用语法转录音频文件的整体代码

using System;
using Syn.Speech.Api;
using System.IO;
using Syn.Logging;

namespace Speech.Recognition.Example
{
    class MainClass
    {
        static Configuration speechConfiguration;
        static StreamSpeechRecognizer speechRecognizer;

        static void LogReceived (object sender, LogReceivedEventArgs e)
        {
            Console.WriteLine (e.Message);
        }

        public static void Main (string[] args)
        {
            Logger.LogReceived += LogReceived;

            var modelsDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Models");
            var audioDirectory = Path.Combine (Directory.GetCurrentDirectory (), "Audio");
            var audioFile = Path.Combine (audioDirectory, "Long Audio 2.wav");

            if (!Directory.Exists (modelsDirectory)||!Directory.Exists(audioDirectory)) {
                Console.WriteLine ("No Models or Audio directory found!! Aborting...");
                Console.ReadLine ();
                return;
            }

            speechConfiguration = new Configuration ();
            speechConfiguration.AcousticModelPath=modelsDirectory;
            speechConfiguration.DictionaryPath = Path.Combine (modelsDirectory, "cmudict-en-us.dict");
            speechConfiguration.LanguageModelPath = Path.Combine (modelsDirectory, "en-us.lm.dmp");

            speechConfiguration.UseGrammar = true;
            speechConfiguration.GrammarPath = modelsDirectory;
            speechConfiguration.GrammarName = "hello";

            speechRecognizer = new StreamSpeechRecognizer (speechConfiguration);
            speechRecognizer.StartRecognition (new FileStream (audioFile, FileMode.Open));

            Console.WriteLine ("Transcribing...");
            var result = speechRecognizer.GetResult ();

            if (result != null) {
                Console.WriteLine ("Result: " + result.GetHypothesis ());
            } else {
                Console.WriteLine ("Sorry! Couldn't Transcribe");
            }

            Console.ReadLine ();
        }
    }
}

如果您运行该应用程序，您将在几毫秒内获得转录的音频文件，输出仍然相同。

因此，这标志着我在此初步发布中的探索结束。

外部资源

GitHub 存储库 - 可以在这里找到外部演示项目
CMU 的声学模型 - 包含 CMU 的最新发布的声学模型
GitHub 中的项目文件 - 指向我上传了项目文件的 GitHub 存储库的链接

关注点

转录音频文件是一个缓慢且耗时的过程。我建议使用 JSGF 语法文件以获得更快的语音识别。但请记住，语法文件在 Linux 中是区分大小写的。

在性能方面，与 Mono 相比，该库在 .NET Framework 下的性能略好，但当我使用自定义的、尺寸有限的Grammar文件时，差异几乎可以忽略不计。

语音识别引擎可以利用卡内基梅隆大学发布的所有声学模型。它们可在 source forge 上找到，但我更倾向于鼓励使用 Sphinx4 项目此 GitHub 存储库中的最新声学模型数据。

历史

2015 年 3 月 29 日，星期五 - 初次发布
2017 年 10 月 9 日，星期一 - 小幅修改
2020 年 8 月 26 日，星期三 - 小幅修改（添加了开源库代码的链接）