词法解析所有主题高级中级开发 .NET C#

Cfg：加速你的解析器生成器项目

honey the codewitch

5.00/5 (2投票s)

2020年1月31日

MIT

2分钟阅读

7008

一个小型库，用于提供基本的上下文无关文法计算

下载源代码 - 41.1 KB

引言

大多数主要的解析器生成器都基于上下文无关文法理论的某种变体，这是一种严格的数学模型，旨在描述乔姆斯基Type-2语言。该库为查询符号和为文法计算first/follows/predicts提供基本支持，k=1。

基本上，大多数解析器生成器都需要相同的基本服务：一种跟踪规则和符号的方法，以及一种计算FIRSTS/FOLLOWS/PREDICT集合的方法。这就是这个小库所做的。

它非常专业化，但如果你正在构建一个解析器生成器，那么从这个库开始可能不会比其他选择更差。

概念化这个混乱的局面

首先，关于我的API中集合的说明。我不使用“getter”属性来获取集合。我使用“Fill”方法，它们更灵活，因为它们可以返回填充了项目的新的集合，或者填充现有的集合。你会在我的代码中看到各种FillXXXX()方法，这就是它们的用途。像这样调用它们...

var syms = cfg.FillSymbols();

...相当于使用getter属性。

CfgDocument是主要的CFG对象，基本上代表一个文法，而CfgRule代表一个单独的规则。所有符号都表示为简单的字符串。 CfgDocument.Rules属性保存构成文法的规则集合。填充这些规则后，就可以对它们执行各种计算。

之后，编码就非常简单了，并且我包含了一个广泛的演示项目，该项目包含一个由这个小库驱动的运行时解析器。

编写这个混乱的程序

我更喜欢让演示代码来说明问题。

static void Main(string[] args)
{
    if(0==args.Length)
    {
        Console.Error.WriteLine("Must specify input CFG");
        return;
    }
    var cfg = CfgDocument.ReadFrom(args[0]);

    // not-necessary but faster access since we're not modifying:

    cfg.RebuildCache();
    Console.WriteLine("See: http://hackingoff.com/compilers/ll-1-parser-generator");
    Console.WriteLine();
    Console.WriteLine("CFG has {0} rules composed of {1} non-terminals and 
                       {2} terminals for a total of {3} symbols" ,cfg.Rules.Count, 
                       cfg.FillNonTerminals().Count, cfg.FillTerminals().Count, 
                       cfg.FillSymbols().Count);
    Console.WriteLine();

    Console.Write("Terminals:");
    foreach(var t in cfg.FillTerminals())
    {
        Console.Write(" ");
        Console.Write(t);
    }
    Console.WriteLine();
    Console.WriteLine();

    // compute the various aspects of the CFG
    var predict = cfg.FillPredict();
    // var firsts = cfg.FillFirsts(); // we don't need this because we have predict
    var follows = cfg.FillFollows();

    // enum some stuff
    foreach(var nt in cfg.FillNonTerminals())
    {
        Console.WriteLine(nt+" has the following rules:");
        foreach(var ntr in cfg.FillNonTerminalRules(nt))
        {
            Console.Write("\t");
            Console.WriteLine(ntr);
        }
        Console.WriteLine();
        Console.WriteLine( nt + " has the following PREDICT:");
        foreach (var t in predict[nt])
        {
            Console.Write("\t");
            Console.WriteLine((t.Symbol??"<empty>")+" - "+t.Rule);
        }
        Console.WriteLine();
        // PREDICT makes this redundant
        //Console.WriteLine(nt + " has the following FIRSTS:");
        //foreach (var t in firsts[nt])
        //{
        //    Console.Write("\t");
        //    Console.WriteLine(t);
        //}
        //Console.WriteLine();
        Console.WriteLine(nt + " has the following FOLLOWS:");
        foreach (var t in follows[nt])
        {
            Console.Write("\t");
            Console.WriteLine(t);
        }
        Console.WriteLine();
    }

    // now let's parse some stuff

    Console.WriteLine("Building simple parse table");

    // the parse table is simply nested dictionaries where each outer key is a non-terminal
    // and the inner key is each terminal, where they map to a single rule.
    // lookups during parse are basically rule=parseTable[<topOfStack>][<currentToken>]
    var parseTable = new Dictionary<string, Dictionary<string, CfgRule>>();
    foreach (var nt in cfg.FillNonTerminals())
    {
        var d = new Dictionary<string, CfgRule>();
        parseTable.Add(nt, d);
        foreach(var p in predict[nt])
        {
            if(null!=p.Symbol)
            {
                CfgRule or;
                if(d.TryGetValue(p.Symbol,out or))
                {
                    Console.Error.WriteLine("First-first conflict between " + 
                                             p.Rule + " and " + or);
                } else
                    d.Add(p.Symbol, p.Rule);
            } else
            {
                foreach(var f in follows[nt])
                {
                    CfgRule or;
                    if (d.TryGetValue(f, out or))
                    {
                        Console.Error.WriteLine("First-follows conflict between " + 
                                                 p.Rule + " and " + or);
                    }
                    else
                        d.Add(f, p.Rule);
                }
            }
        }
    }

    #region Build a Lexer for our parser - out of scope of the CFG project but necessary
    Console.WriteLine("Building simple lexer");
    var fas = new FA[] 
    {
        FA.Literal("+","add"),
        FA.Literal("*","mul"),
        FA.Literal("(","lparen"),
        FA.Literal(")","rparen"),
        FA.Repeat(FA.Set("0123456789"), "int") 
    };
            
    var lexer = new FA();
    for(var i = 0;i<fas.Length;i++)
        lexer.EpsilonTransitions.Add(fas[i]);
    Console.WriteLine();
    #endregion

    var text = "(1+3)*2";
            
    Console.WriteLine("Creating tokenizer");
    var tokenizer = new Tokenizer(lexer, text);
    Console.WriteLine("Creating parser");
    var parser = new Parser(parseTable, tokenizer, "Expr");
    Console.WriteLine();
    Console.WriteLine("Parsing " + text);
    Console.WriteLine(parser.ParseSubtree());
}

这就是你的概述。显然，如果你正在构建一个解析器生成器，你将使用PREDICT/FOLLOWS集合来构建代码中的解析表，或者像Parsley那样用它来渲染递归下降解析器。

其他资源

历史

2020年1月31日 - 初始提交