词法 Antlr 生成器 Markdown .NET4.6 Visual Studio 2017 解析编译器 .NET Core 文本 JSON XML 中级开发 Javascript .NET C#

Pck：代码汇总和快速入门指南

honey the codewitch

5.00/5 (4投票s)

2019年8月20日

MIT

9分钟阅读

16014

148

使用 PCK 为 C# 和其他 .NET 语言创建语法、解析器和词法分析器

下载 pck.zip - 4.1 MB

在 GitHub 下载最新版本

pckedit

引言

解析器构建工具包（Parser Construction Kit）是一个针对 .NET 平台，并用 C# 编写的解析器生成器。它在设计时就考虑到了 C#。它可以使用 Microsoft CodeDOM 将解析器渲染成其他 .NET 语言，尽管没有努力确保与 VB 的兼容性，因为 Microsoft 的 VBCodeProvider 类中的一个错误导致它无法渲染。如果可用，其他 CodeDOM 提供程序可能会工作。

PCK 提供了涵盖三种主要解析范式的工具

LL(1) 解析器：如果能满足所需，这是首选的解析机制。
LALR(1) 解析器：一种更强大的解析器，接受更多的语法，但与 LL(1) 相比有一些缺点，例如额外的复杂性，以及由于算法的性质导致的更笨拙的错误恢复。
手写解析器，因为它们在小范围内经常很有用。例如，使用整个上下文无关语法来解析整数会非常笨重！

使用生成的解析器所需的运行时库，简称为 pck (pck.dll)，是一个小型库，它为生成的 LL(1) 和 LALR(1) 解析器提供支持，以及使用 ParseContext 类为手写解析器提供支持。

pck 附带的各种工具可用于解析器和词法分析器/标记器生成。

它可以基于 LL(1) 算法生成基于 FA 的词法分析器/标记器和基于 PDA 的解析器，着眼于未来支持 LL(1) 冲突解决。它可以基于通过 SLR(1) 转换的 LALR(1) 生成 LALR(1) 解析器。如果需要，它可以很容易地扩展以支持 LR(0)，也许未来支持 LR(1) 和 SLR(1)。

安装和更新

只需将二进制文件放在您的路径中的某个位置，或者在需要使用它们时导航到该文件夹即可安装。请注意，pckedit 在首次运行时会自动将自身注册为 .xbnf 和 .pck 扩展名，并与 shell 关联，因此它不需要在路径中。如果您不想使用命令行，则无需设置路径。

要保持它们最新，只需关闭所有 PCK 程序，导航到二进制文件所在的文件夹 - 无论您的 PATH 变量如何，这都是必需的，然后键入

pckver /update

使用工具

到目前为止，在 Windows 操作系统上生成解析器最简单的方法是使用 pckedit 创建 XBNF 语法，然后使用菜单测试或构建解析器。

pckedit 是一个 Windows 应用程序，作为 XBNF、PCK 规范和其他文件的多文档容器，具有语法高亮功能。它有用于代码生成和测试语法的菜单，但其工作方式与 notepad++ 大致相同。菜单对当前活动文档进行操作，因此如果您的菜单变灰，请注意这一点。例如，它不会让您将 PCK 规范文件转换为另一个 PCK 规范文件，也不会让您从 XBNF 或 PCK 文件以外的任何内容构建/生成代码。文档下方是上次操作的警告和错误列表。测试菜单允许您使用 LL(1) 和 LALR(1) 解析器测试您的语法，而无需生成代码。生成不会创建文件。与 Visual Studio 等不同，文件必须显式保存才能存储。

除了 pckedit，pckw 和 pckp（相同应用程序的 DF 和 Core 版本，功能相同）在命令行提供了广泛的操作集，包括语法转换、导出以及两种类型解析器和词法分析器的代码生成。

PCK 还允许您使用不同的解析器生成器工具。

"XBNF" 语法格式可用于构建传递给解析器生成器（包括 YACC）的语法，具有可扩展的翻译功能，允许创建未来的语法和词法分析器文件转换。

这是 Pckw/Pckp 实用程序使用屏幕，它将不同的解析器/词法分析器生成任务分解为控制台可访问的操作

它们是同一个可执行文件的不同构建版本，尽管 Pckw 是 Windows 版本，Pckp 是 .NET core 版本。功能上它们是相同的。

Usage: pckw <command> [<arguments>]

Commands:

pckw fagen [<specfile> [<outputfile>]] [/class <classname>] [/namespace <namespace>] [/language <language>]

  <specfile>    The pck specification file to use (or stdin)
  <outputfile>  The file to write (or stdout)
  <classname>   The name of the class to generate (or taken from the filename or from the start symbol of the grammar)
  <namespace>   The namespace to generate the code under (or none)
  <language>    The .NET language to generate the code for (or draw from filename or C#)

  Generates an FA tokenizer/lexer in the specified .NET language.

pckw ll1gen [<specfile> [<outputfile>]] [/class <classname>] [/namespace <namespace>] [/language <language>]

  <specfile>    The pck specification file to use (or stdin)
  <outputfile>  The file to write (or stdout)
  <classname>   The name of the class to generate (or taken from the filename or from the start symbol of the grammar)
  <namespace>   The namespace to generate the code under (or none)
  <language>    The .NET language to generate the code for (or draw from filename or C#)

  Generates an LL(1) parser in the specified .NET language.

pckw ll1factor [<specfile> [<outputfile>]]

  <specfile>    The pck specification file to use (or stdin)
  <outputfile>  The file to write (or stdout)

  Factors a pck grammar spec so that it can be used with an LL(1) parser.

pckw ll1tree <specfile> [<inputfile>]

  <specfile>    The pck specification file to use
  <inputfile>   The file to parse (or stdin)

  Prints a tree from the specified input file using the specified pck specification file.

pckw lalr1gen [<specfile> [<outputfile>]] [/class <classname>] [/namespace <namespace>] [/language <language>]

  <specfile>    The pck specification file to use (or stdin)
  <outputfile>  The file to write (or stdout)
  <classname>   The name of the class to generate (or taken from the filename or from the start symbol of the grammar)
  <namespace>   The namespace to generate the code under (or none)
  <language>    The .NET language to generate the code for (or draw from filename or C#)

  Generates an LALR(1) parser in the specified .NET language.

pckw lalr1tree <specfile> [<inputfile>]

  <specfile>    The pck specification file to use
  <inputfile>   The file to parse (or stdin)

  Prints a tree from the specified input file using the specified pck specification file.

pckw xlt [<inputfile> [<outputfile>]] [/transform <transform>] [/assembly <assembly>]

  <inputfile>   The input file to use (or stdin)
  <outputfile>  The file to write (or stdout)
  <transform>   The name of the transform to use (or taken from the input and/or output filenames)
  <assembly>    The assembly to reference

  Translates an input format to an output format.

  Available transforms include:

   cgtToPck     Translates a Gold Parser cgt file into a pck spec. (requires manual intervention)
   pckToLex     Translates a pck spec to a lex/flex spec. (requires manual intervention)
   pckToYacc    Translates a pck spec to a yacc spec
   xbnfToPck    Translates an xbnf grammar to a pck spec.

so - if you want to take an XBNF document

这样的内容。

grammar<start>= productions;

productions<collapsed> = production productions | production;
production= identifier [ "<" attributes ">" ] "=" expressions ";";

expressions<collapsed>= expression { "|" expression }; 
expression= { symbol };
symbol= literal | regex | identifier | 
    "(" expressions ")" | 
    "[" expressions "]" |
    "{" expressions ("}"|"}+");
...

并转换为 pck 规范（在您可以对其进行更多操作之前，您必须这样做）

grammar:start
productions:collapsed
expressions:collapsed
symbollist:collapsed
attributelisttail:collapsed
grammar-> productions
productions-> production productions
productions-> production
production-> identifier lt attributes gt eq expressions semi
production-> identifier eq expressions semi
expressions-> expression expressionlisttail
expressions-> expression
expression-> symbollist
expression->
symbol-> literal
symbol-> regex
symbol-> identifier
symbol-> lparen expressions rparen
symbol-> lbracket expressions rbracket
symbol-> lbrace expressions rbrace
symbol-> lbrace expressions rbracePlus
...

您将像这样使用命令行

pckw xlt xbnf.xbnf xbnf.pck

现在，在您可以使用 LL(1) 解析器（包括 Pck 的，或 Coco/R）之前，您必须“分解”它

pckw ll1factor xbnf.pck xbnf.ll1.pck

然后您可以为其生成代码或导出它，或者您喜欢的任何操作

pckw fagen xbnf.ll1.pck XbnfTokenizer.cs

pckw ll1gen xbnf.ll1.pck XbnfParser.cs

或者您可以管道这些操作。例如，将 xbnf 语法转换为解析器

pckw xlt xbnf.xbnf /transform xbnfToPck | pckw ll1factor | pckw ll1gen /class XbnfParser > XbnfParser.cs

有一个用于构建额外 xlt 转换的 API。考虑到它们的实现相对容易。目前只有 Lex 和 YACC，以及部分 Gold。

Gold 词法转换要等到我能在 C# 中实现 arden 定理之后才会发生

XBNF 属性语法格式

XBNF 格式设计得易于学习，如果您对编写语法有所了解的话。

产生式的形式为

identifier [ < attributes > ] = expressions ;

例如，这是一个简单的“足够合规”的 json 语法，非常适合教程

json<start>= object | array;
object= "{" "}" | "{" fields "}";
fields<collapsed>= field | field "," fields;
field= string ":" value;
array= "[" "]" | "[" values "]";
values<collapsed>= value | value "," values;
value= string | number | object | array | boolean | null;
boolean= true|false;

// terminals
number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
// below: string is not compliant, should make sure the escapes are valid JSON escapes rather than accepting everything
string = '"([^"\\]|\\.)*"';
true="true";
false="false";
null="null";
lbracket<collapsed>="[";
rbracket<collapsed>="]";
lbrace<collapsed>="{";
rbrace<collapsed>="}";
colon<collapsed>=":";
comma<collapsed>=",";
whitespace<hidden>='[ \t\r\n\f\v]+';

首先要注意的是 json 产生式被标记为 start 属性。由于未指定值，因此它隐含地为 start=true

这告诉解析器 json 是起始产生式。如果未指定，将使用语法中的第一个非终结符。此外，这可能会在生成期间导致警告，因为它不是一个好主意将其留作隐式。只有第一次出现的 start 将被遵守

object | array 告诉我们 json 产生式派生自一个对象或数组。
object 产生式包含几个字面量和一个对 fields 的引用

表达式

( ) 括号允许您创建子表达式，如 `foo (bar|baz)`
[ ] 可选表达式允许子表达式出现零次或一次
{ } 此重复构造重复子表达式零次或多次
{ }+ 此重复构造重复子表达式一次或多次
| 此选择构造派生任何一个子表达式
连接是隐式的，用空白分隔

终结符

所有终结符都在底部定义，但它们可以位于文档中的任何位置。XBNF 将任何不引用其他产生式的产生式视为终结符。这类似于 ANTLR 如何在其语法中区分终结符。

正则表达式用单引号 ' 括起来，字面表达式用双引号 " 括起来。您可以使用 XBNF 构造或使用正则表达式声明一个终结符。正则表达式遵循 posix + 标准扩展范式，但目前不支持所有 posix。它们支持大部分。如果 posix 表达式不起作用，请将其视为一个 bug。

属性

collapsed 属性告诉 Pck 该节点不应出现在解析树中。相反，它的子节点将传播到其父节点。如果语法需要一个非终结符或终结符来解析一个构造，但它对解析树的使用者没有用处，这很有帮助。在 LL(1) 因子分解期间，必须创建生成的规则，并且它们的关联非终结符通常会被折叠。上面我们已经使用它显著修剪了我们不需要的节点解析树，包括折叠不必要的终结符，如 JSON 语法中的 :。这是因为它们不能帮助我们定义任何东西——它们只是帮助解析器识别输入，所以我们可以将它们扔掉以使解析树更小。如果 ShowCollapsed 设置为 false，LL(1) 解析器可以在底层读取期间删除折叠的节点。这可以显著加快包含大量折叠节点的大文档的解析速度。此性能功能在其他解析器中不可用，但它们生成的解析树将删除折叠节点。

hidden 属性告诉 Pck 应跳过此终结符。这对于注释和空格等内容很有用。如果解析器将 ShowHidden 设置为 true，则可以显示隐藏的终结符，例如在解析期间需要注释的存在或位置时。

blockEnd 属性适用于具有多字符结束条件的终结符，例如 C 块注释、XML CDATA 部分和 SGML/XML/HTML 注释。如果存在，词法分析器将继续直到匹配到指定为 blockEnd 的字面量。

terminal 属性显式声明一个产生式为终结符。即使它引用了其他产生式，这样的产生式也被视为终结符。如果它引用了其他产生式，那些其他产生式将以它们的终结符形式包含在内，就好像它们是原始表达式的一部分一样。这允许您从多个终结符定义创建复合终结符。

可以应用其他属性，但它们将被忽略。但是，在解析过程中可以使用 GetAttribute() 检索它们，因为解析器会公开它们。

使用生成的解析器

考虑以下简单的表达式语法，expr.xbnf

expr= expr "+" term | term;
term= term "*" factor | factor;
factor= "(" expr ")" | int;
add= "+";
mul= "*";
lparen= "(";
rparen= ")";
int= '[0-9]+';

生成一个解析器

pckw xlt expr.xbnf /transform xbnfToPck | pckw ll1factor | pckw ll1gen /class ExprParser /namespace Demo > ExprParser.cs

生成一个标记器

pckw xlt expr.xbnf /transform xbnfToPck | pckw ll1factor | pckw fagen /class ExprTokenizer /namespace Demo > ExprTokenizer.cs

我们刚刚所做的就是获取 XBNF，将其转换为 PCK 规范，对其进行分解，然后从中生成代码。

我们也必须对标记器进行分解，否则解析器和标记器中的符号将不匹配，所以我们两次都这样做。

现在，创建一个新的 .NET 控制台项目，添加对 pck 的引用，添加两个新生成的文件，并在您的 using 指令中引用 Demo 命名空间

在 Main() 函数中写入

var parser = new ExprParser(new ExprTokenizer("3*(4+7)"));

这将为您提供一个关于表达式 3*(4+7) 的新解析器和标记器

从这里开始，您可以使用解析器的一些方法。

如果您想要对预转换解析树进行流式访问，您可以像 Microsoft 的 XmlReader 一样在循环中调用解析器的 Read() 方法

while(parser.Read())
{
   switch(parser.NodeType)
   {
      case LLNodeType.Terminal:
      case LLNodeType.Error:
         Console.WriteLine("{0}: {1}",parser.Symbol,parser.Value);
         break;
      case LLNodeType.NonTerminal:
         Console.WriteLine(parser.Symbol);
         break;
   }
}

但很多时候这不是您想要的。您想要一个解析树，并且您希望折叠的节点等都消失。

因此，您可以使用以下内容代替上述内容

var tree = parser.ParseSubtree(); // pass true if you want the tree to be trimmed.
// write an ascii tree
Console.WriteLine(tree); // calls tree.ToString()

tree 及其每个节点都包含有关解析中该位置节点的所有信息。

生成和使用 LALR(1) 解析器非常相似。

生成一个解析器

pckw xlt expr.xbnf /transform xbnfToPck | pckw lalr1gen /class ExprParser /namespace Demo > ExprParser.cs

生成一个标记器

pckw xlt expr.xbnf /transform xbnfToPck | pckw fagen /class ExprTokenizer /namespace Demo > ExprTokenizer.cs

请注意，我们没有对语法进行分解，因为 LALR(1) 不需要这样做

现在，几乎和以前一样

while(parser.Read())
{
   switch(parser.NodeType)
   {
      case LRNodeType.Shift:
      case LRNodeType.Error:
         Console.WriteLine("{0}: {1}",parser.Symbol,parser.Value);
         break;
      case LRNodeType.Reduce:
         Console.WriteLine(parser.Rule);
         break;
   }
}

或

var tree = parser.ParseReductions(); // pass true if you want the tree to be trimmed.
Console.WriteLine(tree);

使用 ParseContext 进行手写解析器

如果您曾经尝试使用 TextReader 或 IEnumerator<char> 来解析文档，您可能遇到过这些常见的挫折：枚举器无法事后检查是否已到达枚举末尾，并且文本阅读器的 Peek() 函数在某些源（例如 NetworkStream）上不可靠。这两个限制都需要额外的簿记来克服，从而使解析代码复杂化并分散核心职责——解析输入！

另一个显著的限制是缺乏前瞻。您通常只能向前看一个字符，但有时，您需要看得更远才能完成解析。

在解析过程中，也没有任何东西可以帮助处理错误和报告。

ParseContext 可以帮助解决所有这些问题，此外还有许多有用的辅助函数，用于跳过空格和注释、解析数字和字符串等。

详情请参阅这篇 codeproject 文章。