C# 中的正则表达式引擎

honey the codewitch

5.00/5 (12投票s)

2019年3月27日

CPOL

6分钟阅读

28281

729

.NET (Core) 的非回溯正则表达式引擎

下载源代码 - 445.1 KB

FA Visualizer

FA Test

引言

既然 .NET 已经包含了功能丰富的正则表达式库，为什么还要创建正则表达式引擎呢？

首先，正则表达式引擎主要有两大类——回溯型和非回溯型。大多数现代正则表达式引擎都采用回溯。回溯引擎比非回溯引擎具有更强的表达能力，但代价是性能的下降。

为什么要创建一个非回溯引擎呢？

嗯，首先，它们在处理短字符串时，性能一再优于回溯引擎。其次，它们更适合直接操作流式源，而回溯正则表达式引擎通常需要将文本预先加载到内存中。

更普遍地说，.NET 提供的正则表达式引擎风格更侧重于匹配长 string 中的模式。它不太适合从 stream 中词法分析出标记。词法分析对解析器至关重要，因此用于词法分析终端的正则表达式通常使用非回溯引擎来实现。

背景

词法分析是把输入的字符流分解成“标记”的过程——即具有相关“符号”的“字符串”。符号表明了该 string 的类型。例如，字符串 "124.35" 可能被报告为符号 "NUMBER"，而字符串 "foo" 可能被报告为符号 "IDENTIFIER"。

解析器通常在底层使用词法分析器，然后将传入的符号组合成语法树。由于词法分析器在核心解析代码中被调用，因此词法分析操作必须相当高效。.NET 正则表达式在这里并不太合适，虽然也能工作，但实际上会增加代码的复杂性，同时降低性能。

本项目包含一个名为 "FA.cs" 的文件，其中包含使用有限自动机（最终归结为简单但无处不在的状态机）实现的正则表达式引擎的核心代码。

有限状态机由状态图组成。每个状态可以通过“输入转移”或“epsilon 转移”引用另一个状态。输入转移在移动到目标状态之前会消耗指定的输入。epsilon 转移在不消耗或检查任何输入的情况下移动到目标状态。包含一个或多个 epsilon 转移的机器被认为是“非确定性”（NFA），并且可能在任何给定时间处于多个状态。非确定性状态机更复杂且效率较低，但幸运的是，对于任何非确定性状态机都存在一个确定性状态机（DFA），并且有一个算法可以通过一种称为幂集构造的技术将 NFA 转换为 DFA——这部分的数学内容在此不详述。

状态的闭包是该状态本身以及从该状态通过 epsilon 转移或输入转移可达的所有状态的唯一集合。闭包代表由根状态指示的状态机。这些图可以是递归的。

匹配 "foo" 的状态机的图如下所示

"foo" state machine

q₀ 的闭包是 { q₀, q₁, q₂, q₃ }，因为其中每个状态直接或间接连接到 q₀。

每次跨越黑色箭头时，输入必须匹配箭头上的字符才能沿着该路径继续。一旦遇到双圆圈（q₃），机器就会“接受”输入作为有效/匹配，并返回状态名称下的标记，在本例中是 "foo"，一旦匹配输入就会返回该标记。

现在，让我们匹配一个形式为 [A-Z_a-z][0-9A-Z_a-z]* 的标识符

identifier state machine

如您所见，循环相当直接。

现在让我们开始实现有用的功能。词法分析器（又名标记生成器）区分输入中的多种不同模式。让我们构建一个代表标识符、整数或空格的词法分析器。

Lexer state machine

您会注意到有两种匹配整数的方式（q₂, q₄），但它们都返回相同的符号 "int" 作为结果。

上面提出的每个机器都是确定性的（每个都是DFA）。DFA 机器一次只会在一个状态。还有非确定性机器，或NFA，它们可以同时处于多个状态！

An NFA lexer machine

您会注意到这个机器中有灰色虚线。这些是 epsilon 转移或 epsilon 上的转移。在此上下文中，epsilon 简单地表示“无输入”。每当机器遇到虚线箭头时，它自动处于箭头指向的状态以及自身。可以说它在没有输入的情况下进行转移。

从 q₀ 开始也意味着您同时处于 q₁, q₄, q₅, q₇, q₈ 和 q₁₁。这些状态的集合，即从该状态通过 epsilon 转移可达的状态集合，称为epsilon 闭包。

这些 NFA 机器更容易组合/构建自其他机器，但运行起来更难，因为它们更复杂。

如前所述，每个 NFA 机器都有一个 DFA 机器，并且有一个将 NFA 转换为 DFA 的算法。

这段代码允许您轻松地创建这些机器，绘制漂亮的图表（借助 Graphviz），生成超快的代码来运行给定的状态机，在不先生成代码的情况下在运行时运行状态机，将正则表达式表达式转换为机器，反之亦然，并以其他方式检查和组合机器。

Using the Code

下面包含在示例项目中，并带有大量注释，向您展示了使用此库的基础知识

//
// See the included project for a full sample
//
var lexer = FA.Lexer(
    // our id from the article
    FA.Parse(@"[A-Z_a-z][0-9A-Z_a-z]*", "id"),
    // our int from the article
    FA.Parse(@"0|([\-]?[1-9][0-9]*)", "int"),
    // our space from the article
    FA.Parse(@" ", "space")
    );

// clean up any purely pass through states in our graph - not required
lexer.TrimNeutrals();

// transform the lexer into its DFA equivalent - not required
var dfaLexer = lexer.ToDfa();

// make sure there are no duplicate states left after the transformation.
// this can be an expensive op and isn't strictly necessary, so it's a 
// separate step. It *should* be done before code generation.
// If code generation is done on an NFA, it is automatically transformed to
// a DFA and duplicates are removed during that process
dfaLexer.TrimDuplicates();  // not required - but recommended on any DFA 
                            // prior to code generation

Console.WriteLine("Rendering graphs");
try
{
    // try to render the graphs (requires GraphViz - specifically dot.exe)
    // to change the format, simply use a different extension like PNG or SVG
    // pass additional options using FA.DotGraphOptions if desired
    lexer.RenderToFile(@"..\..\..\lexer_nfa.jpg");
    dfaLexer.RenderToFile(@"..\..\..\lexer_dfa.jpg");
}
catch
{
    Console.WriteLine("GraphViz is not installed.");
}

// generate the code (can generate from either an NFA or DFA)
// if using an NFA, it will first be transformed to a DFA
// and then any duplicates will be trimmed from the result

// generate code direct to C# 
// we can use the CodeDOM here if CODEDOM compile constant is defined
// but to keep things simple, we won't here.
Console.WriteLine("Generating C# Code");
using (var tw = new StreamWriter(@"..\..\..\Generated.cs"))
{
    tw.WriteLine("namespace {0}", typeof(Program).Namespace);
    tw.WriteLine("{");
    tw.WriteLine("partial class Program {");
    // generate TryLexValue
    dfaLexer.WriteCSharpTryLexMethodTo(tw, "Value");
    tw.WriteLine();
    // generate LexValue
    dfaLexer.WriteCSharpLexMethodTo(tw, "Value");
    tw.WriteLine();
    tw.WriteLine("}");
    tw.WriteLine("}");
}

// our test string
var test = "foo 456 bar 123 baz";

Console.WriteLine("Runtime Lexing:");

// Create a ParseContext over the string
// See https://codeproject.org.cn/Articles/1280230/Easier-Hand-Rolled-Parsers
// ParseContext handles the input Cursor, the Position, Line and Column counting 
// as well as error handing. It operates over an enumeration of characters, a string
// or a TextReader. The main members are Advance(), Current, Expecting(), Position
// Line, Column, Capture and CaptureCurrent()

// It's easy enough to add code to generate lexers that do not require this class
// but we would lose error handling and position tracking, so it's not implemented.
// to use other interfaces, wrap them with ParseContext.Create()
var pc = ParseContext.Create(test);

pc.EnsureStarted(); // make sure we're on the first character.

try
{
    // ParseContext.Current works like a TextReader.Peek() does, returning an int
    // without advancing the cursor. Unline Peek() it is not problematic.
    while (-1 != pc.Current) // loop until end of input. 
    {
        // runtime lexing (can be NFA or DFA - DFA is intrinsically more efficient)
        // if we pass it a StringBuilder, it will reuse it - perf is better, 
        // but we don't bother here.
        var tok = dfaLexer.Lex(pc); // lex the next token
                                    // symbol is in tok.Key, value is in tok.Value
        Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
    }
}
catch(ExpectingException eex)
{
    // We got a lexing error, so report it. The message might be long, but you can always
    // build your own using ExpectingException members.
    Console.WriteLine("Error: {0}", eex.Message);
}
Console.WriteLine();

Console.WriteLine("Compiled Lexing:");

pc = ParseContext.Create(test);    // restart the cursor
pc.EnsureStarted();
try
{                
    // reusing a stringbuilder allows the lex routine to run a little faster
    // don't use it for anything else if you're passing it to lex.
    var sb = new StringBuilder();  // don't have to use this, can pass null
                
    while (-1 != pc.Current)       // loop until end of input. 
    {
        // compiled lexing (always DFA - most efficient option)
        // Generated.cs must be present and valid
        var tok = LexValue(pc,sb); // lex the next token
        Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
    }
}
catch (ExpectingException eex)
{
    Console.WriteLine("Error: {0}", eex.Message);
}

现在让我们看看生成的代码

q1:
    if((pc.Current>='0'&& pc.Current<='9')||
        (pc.Current>='A'&& pc.Current<='Z')||
        (pc.Current=='_')||
        (pc.Current>='a'&& pc.Current<='z')) {
        sb.Append((char)pc.Current);
        pc.Advance();
        goto q1;
    }
    return new System.Collections.Generic.KeyValuePair<System.String,string>("id",sb.ToString());

如您所见，这是状态 q₁，对应于上面图形中 "id" 标记的最后一部分。

您可以看到从该状态的输出转移被编码到 if 语句中。

您可以看到使用 goto 指向自身的箭头。

您可以看到，如果 if 条件不满足，则以成功的结果结束，返回一个标记。

所有这些属性都反映在上面的图形中，如果您仔细查看的话。

关注点

从 FA 生成正则表达式（文本表示）并非易事。从文本表示解析它们则非常简单。这几乎与语法树的编码方式完全相反。

生成的 DFA 在发布版本中的速度比调试版本快一个数量级。我仍然不知道为什么。无论如何，在任何构建配置中，它们通常仍然是最快的选择。

对于大型状态机，基于表的词法分析器可能比纯生成的词法分析器更快，但它需要更多的内存。对于较小的词法分析器，生成的词法分析器速度很快。超过 100 个状态后，最好测试性能。

GraphViz Dot 具有欺骗性的复杂性和表达能力。

经过多年对该设计的大量迭代，我终于得到了一个能够很好地处理通用用例（包括错误处理）的版本。我最终在正确的组合下使其快速、一致且易于使用。

历史

2019 年 3 月 26 日：首次发布
2019 年 3 月 30 日：性能改进、功能添加、FA 可视化器项目和简单的表达式求值器
2019 年 3 月 31 日：添加了基于表的词法分析器和生成方法