动态评估的 SQL LINQ 查询

Don Kackman

4.95/5 (35投票s)

2009年11月10日

CPOL

8分钟阅读

211975

2650

用于在 IEnumerable 集合上评估纯文本 SQL 查询的扩展方法。

介绍

现在有了一个 NuGet 包。

你是否曾想过对内存中的集合和列表执行简单的 SQL SELECT 语句？我曾想过，本文介绍了一种利用 LINQ 和运行时生成的 Lambda 表达式进行编译和执行的机制。其结果是支持运行时查询评估，例如：

var result = source.Query<Person, Tuple<string, double>>(
               "SELECT Address, Avg(Age) FROM this GROUP BY Address"); 
var result2 = source.Query<Person, Family>(
              "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address"); 
var result3 = source.Query<Person>("SELECT * FROM this ORDER BY age");
var result4 = source.Query<Person, string>("SELECT DISTINCT address FROM this")

我非常喜欢 LINQ，尤其是 LINQ to Objects。通过 `foreach` 迭代和评估可能需要数十行代码的逻辑，通常可以缩短到 1-2 行。任何能减少代码行的东西，对可维护性和质量来说都是一个巨大的加分项。随着我在生产代码和业余代码中越来越多地使用 LINQ，我发现虽然 `Enumerable` 扩展方法很容易理解和使用，但我仍然会在“几乎 SQL”的内联 LINQ C# 关键字上遇到麻烦。也许我在这方面花了太长时间，但我的手指就是无法用 `SELECT` 以外的任何词来开始一个查询语句。

撇开我个人的不足之处，我发现 LINQ 有一个更实际的限制，那就是那些“几乎 SQL”的语句仍然与应用程序的静态结构紧密耦合。我过去非常喜欢 `VARIANT`，并且一直认为 `IDispatchEx` 在 COM 过时之前，在微软应用程序之外从未真正发挥其真正的潜力。动态类型有很多优点，尤其是在日益庞大、复杂和分布式的系统中。也许我应该转用 Python，但 C# 才是我的饭碗。

幸运的是，微软一直在为 .NET 和 C# 添加动态类型功能。.NET/C# 4.0 增加了一些有趣的功能：F# 的 Tuple 类型已成为 BCL 的一部分；C# 获得了 dynamic 关键字以及对 Dynamic Language Runtime 的访问权限，而 BCL 则获得了 `ExpandoObject`，这使得像 VB.NET 和 C# 这样的静态类型语言也能拥有 Duck Typing 语言的某些特性。动态类型和静态类型在 C# 中的结合可能是一个强大的新补充，也可能最终会变成一个融合了两方面最糟糕特性的“弗兰肯斯坦”。随着时间的推移，看看所有这些是如何发展的将会很有趣。

但言归正传。本文的真正目的是我一直想编写一个运行时评估引擎。.NET 3.5（通过添加 `System.Linq.Expressions`）和 C# 4.0 的动态类型功能提供了合适的工具集。所以我试了一下。

最终得到的结果是，你可以把这样的东西

var result = from p in source
group p by p.Address into g 
where g.Average(p => p.Age) > 40
select new { Address = g.Key, AverageAge = g.Average(p => p.Age) };

替换成这样的东西

var result = source.Query<Person, dynamic>("SELECT Address, Avg(Age) AS AverageAge
         FROM this GROUP BY Address HAVING AverageAge > 40")

你为什么会在微软费了这么大力气创建 Linq 后还要这种东西呢？嗯，大多数时候你不需要。Linq 对于大多数数据模型是静态的、应用程序对其的视图只缓慢变化的应用程序来说是极好的。我开始研究这个想法，是因为那些数据模型是动态的，或者应用程序对其数据的视图无法在编译时定义的情况。

背景

本文的基础是之前的篇章：将 LINQ to SQL 推向另一个方向。该文章描述了所使用的基本解析和评估基础设施。大部分底层代码是相同的（或者至少当初是这样开始的），尤其是在使用 GOLD 解析器解析 SQL 方面。之前代码的一个限制是，要评估的数据必须是 `IDictionary<string, object>` 的形式，并且返回给调用者的数据也是这种形式。

它还借鉴了微软在 VS2008 示例集中发布的 Dynamic LINQ 示例的一些灵感。

买者自负

附加的代码和项目是用 Microsoft Visual Studio 2010 Beta 2 创建的，以及 C# 和 .NET 4.0 的一些新功能，例如：

Tuple 类型
dynamic 关键字
ExpandObject

总而言之，你将无法使用 VS.NET 2008 或 .NET 3.5 来玩转附加的代码。

Using the Code

Class Diagram

API 的入口是一组小的扩展方法，它们扩展了 `IEnumerable<T>` 并接受字符串形式的 SQL 语句。有两组重载：

QueryScalar - 返回单个标量值（如 int）
Query - 返回一个 `IEnumerable<T>`

SQL 字符串立即被转换为一个 `Query<,>` 派生对象（一个用于标量查询，一个用于枚举查询）。`Query` 类的职责是：

将 SQL 文本解析为解析树
将解析树转换为表达式树
将表达式树转换为 lambda 函数并进行缓存，以便用于评估数据集

解析在基类中进行，与上一篇文章中的基本相同。

创建评估函数，就是确定 SQL 查询的哪些部分存在，并为每个部分生成一个 lambda 函数

protected override void OnCompile()
{
    if (SyntaxNode.GroupByClause == null)
        Select = SyntaxNode.CreateSelector<TSource, TResult>();
    else
        GroupBySelect = SyntaxNode.CreateGroupBySelector<TSource, TResult>();

    if (SyntaxNode.WhereClause != null)
        Where = SyntaxNode.WhereClause.CreateEvaluator<TSource>();

    if (SyntaxNode.OrderByClause != null)
        OrderBy = SyntaxNode.OrderByClause.CreateFunction<TSource>();

    if (SyntaxNode.HavingClause != null)
        Having = SyntaxNode.HavingClause.CreateEvaluator<TResult>();

由此产生的评估方法按顺序执行每个 lambda 函数

public override IEnumerable<TResult> Evaluate(IEnumerable<TSource> source)
{
    if (Where != null)
        source = source.Where<TSource>(Where);

    if (OrderBy != null)
        source = OrderBy(source);

    IEnumerable<TResult> result = Enumerable.Empty<TResult>();
    if (Select != null)
        result = source.Select<TSource, TResult>(Select);
    else
        result = GroupBySelect(source);

    if (Having != null)
        result = result.Where<TResult>(Having);

    if (SyntaxNode.Columns.Distinct)
        return result.Distinct();

    return result;
}

关于 SQL 语法的几点说明

SQL 语法基于 SQL 89 的 DML 的 `SELECT` 部分。未来可能会支持 `INSERT`、`DELETE` 和 `UPDATE` 功能。

为了同时支持查询类型的属性和值，内置了一个 `value()` 函数。这允许区分如下查询：

List<string> l = new List<string>();
l.Add("don");
l.Add("phillip");
l.Add("kackman");

IEnumerable<string> result = l.Query<string>("SELECT * FROM this WHERE value() = 'don'");
IEnumerable<string> result2 = l.Query<string>("SELECT * FROM this WHERE Length > 3");

此外，`FROM` 子句的连接链部分中的 `this` 条目仅是一个占位符。它引用了传递给扩展方法或传递给 `Evaluate` 方法的 `IEnumerable<T>` 的同一个 `this`。在未来的更新中，我想添加对跨多个集合的连接的支持，但目前尚不具备此功能。

关注点

这其中最具挑战性的是创建结果选择器。SQL 的返回类型是多态的，取决于查询。`SELECT` 查询可以在不进行类型转换的情况下返回输入数据的子集，例如 `SELECT * FROM this WHERE Age = 40`。它可以返回输入数据类型的字段子集：`SELECT name, age FROM this`。在某些情况下（如 `SELECT Avg(Age) FROM this`），它可以返回单个值。或者查询可以返回完全转换的类型，这些类型是对输入数据的聚合：`SELECT name, Avg(age) FROM this GROUP BY name`。

传递给 `Query` 方法的类型参数指示了可枚举中包含的类型和要创建并返回的类型。

public static IEnumerable<TResult> Query<TSource, TResult>
                       (this IEnumerable<TSource> enumerable, string sql)

子选择

对于将返回与传入类型相同的类型的查询，`TSource` 和 `TResult` 都将是相同的。在这种情况下有一个只接受一个类型参数的重载。

IEnumerable<int> source = TestData.GetInts();
IEnumerable<int> result = source.Query<int>("SELECT * FROM this WHERE value() > 3");

在这种情况下，选择器是一个简单的标识 lambda 函数

public static Expression<Func<TSource, TResult>> CreateIdentitySelector<TSource, TResult>()
{
    ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");
    return Expression.Lambda<Func<TSource, TResult>>(arg, arg);
}

单个属性

为了选择源类型的单个属性，会创建一个 lambda 函数，该函数从每个源对象返回该属性的值。这不需要创建和初始化新对象。

IEnumerable<Person> source = TestData.GetPeople();
IEnumerable<int> result = source.Query<Person, int>("SELECT Age FROM this WHERE age > 40"); 

...

public static Expression<Func<TSource, TResult>> 
       CreateSinglePropertySelector<TSource, TResult>(string propertyName)
{
    ParameterExpression param = Expression.Parameter(typeof(TSource), "item");
    return Expression.Lambda<Func<TSource, TResult>>(
           Expression.PropertyOrField(param, propertyName), param);
}

多个属性

返回多个属性需要创建 `TResult` 的新实例并用结果数据填充它们。有三种方法可以做到这一点。

在大多数情况下，预期返回类型具有与 `SELECT` 语句中的每个字段相对应的读/写属性。在这种情况下，如果源属性名与结果属性名不同，则可以使用 `AS` 关键字来映射它们。在下面的示例中，`Person` 类有一个名为 `Address` 的属性，而 `OtherPerson` 有一个名为 `Location` 的属性。

var result = source.Query<Person, OtherPerson>("SELECT name, address AS location FROM this")

...

public static Expression<Func<TSource, TResult>> CreateMemberInitSelector<TSource, TResult>
                (IEnumerable<string> sourceFields, IEnumerable<string> resultFields)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      // loop through all of the result fields and generate an expression
      // that will assign it from the source fields of the param
      var bindings = new List<MemberAssignment>();
      int i = 0;
      foreach (string field in resultFields)
      {
          MemberInfo member = typeof(TResult).GetPropertyOrField(field);
          MemberExpression memberExpression = 
             Expression.PropertyOrField(arg, sourceFields.ElementAt(i++));
          bindings.Add(Expression.Bind(member, memberExpression));
      }

      var init = Expression.MemberInit(Expression.New(typeof(TResult)), bindings);
      return Expression.Lambda<Func<TSource, TResult>>(init, arg);
}

旁注：所有属性名都以**不区分大小写**的方式进行评估。如果 `TSource` 或 `TResult` 同时具有 `Property` 和 `property`，则会抛出异常。

为 `Tuple` 类型创建了一个特殊选择器，该类型有一个构造函数，它接受每个组成属性的参数。`Select` 语句中字段的顺序必须与构造函数声明中参数的顺序一致。

var result = source.Query<Person, Tuple<string, string>>("SELECT name, address FROM this")

...

public static Expression<Func<TSource, TResult>> 
       CreateConstructorCallSelector<TSource, TResult>
      (IEnumerable<string> fields, Type[] constructorTypes)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      var bindings = fields.Select(field => Expression.PropertyOrField(arg, field));
      // the values that will intialize a TResult

      ConstructorInfo constructor = typeof(TResult).GetConstructor(constructorTypes);
      // the constructor for a new TResult

      NewExpression _new = Expression.New(constructor, bindings);
      return Expression.Lambda<Func<TSource, TResult>>(_new, arg);
}

动态和 Expando

当 `dynamic` 指定为 `TResult` 时，会创建另一个特殊选择器。在这种情况下，你将始终收到 `ExpandoObject` 的集合。`ExpandoObject` 实现 `IDictionary<string, object>` 以存储动态分配的属性集，并使用此接口填充返回对象。正是通过这种机制，该 API 才从静态类型转向动态类型。我注意到 `ExpandoObject` 的一个特点是其属性名是区分大小写的。我不知道这是好是坏，但不知何故，我期望它们不区分大小写，因为这样似乎更能与动态类型环境相协调。

var result = source.Query<Person, dynamic>("SELECT age, address FROM this")

...

public static Expression<Func<TSource, TResult>> CreateExpandoSelector<TSource, TResult>
                      (IEnumerable<string> sourceFields, IEnumerable<string> resultFields)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      // loop through all of the result fields and generate an expression that will 
      // add a new property to the result expando object using its IDictionary interface
      var bindings = new List<ElementInit>();
      MethodInfo addMethod = typeof(IDictionary<string, object>).GetMethod(
                        "Add", new Type[] { typeof(string), typeof(object) });
      int i = 0;
      foreach (string field in resultFields)
      {
          MemberExpression memberExpression = 
             Expression.PropertyOrField(arg, sourceFields.ElementAt(i++));
          bindings.Add(Expression.ElementInit(addMethod, 
                   Expression.Constant(field), 
                   Expression.Convert(memberExpression, typeof(object))));
      }

      var expando = Expression.New(typeof(ExpandoObject));
      return Expression.Lambda<Func<TSource, TResult>>(
             Expression.ListInit(expando, bindings), arg);
}

分组

到目前为止，最具挑战性的选择器是 `GROUP BY`。这不仅涉及 `TSource` 和 `TResult` 之间的类型转换，还涉及对返回类型属性的计算，而不是仅仅赋值。我花了一段时间才弄清楚如何在不计算和缓存每个聚合的中间状态的情况下实现这一点。最后，我创建了一个类型 `GroupByCall`，以便在编译时缓存每个聚合的委托，以便在评估期间稍后调用。

var result = source.Query<Person, Family>(
   "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address")

...

public static Expression<Func<IEnumerable<TSource>, IEnumerable<TResult>>> 
                    CreateGroupBySelector<TSource, TResult>(string groupByField,
                    Type keyType, IEnumerable<AggregateNode> aggregates)
{
      // create the key selector
      // the parameter passsed to the keySelector
      var keyLambdaArg = Expression.Parameter(typeof(TSource), "keyLambdaArg");      
      var keyLambda = Expression.Lambda(
                 Expression.PropertyOrField(keyLambdaArg, groupByField), keyLambdaArg);

      // the grouped subset passed to resultSelector
      var group = Expression.Parameter(typeof(IEnumerable<TSource>), "group");
            
      // create an object to cache some state for the result selector
      GroupByCall<TSource, TResult> groupingCall = 
                 new GroupByCall<TSource, TResult>(groupByField);
      // for each aggregate in the query create a lambda expression
      // and add it to the cache
      foreach (AggregateNode aggregate in aggregates)
      {
          var aggregateExpression = aggregate.GetCallExpression(
              typeof(IEnumerable<TSource>), typeof(TSource), group);
          groupingCall.Aggregates.Add(aggregate.Alias, 
              Expression.Lambda(aggregateExpression, group).Compile());
      }

      // create the call to the result selector
      var key = Expression.Parameter(keyType, "key");
      var groupingFunc = Expression.Call(Expression.Constant(groupingCall), 
                           "GroupingFunc", new Type[] { keyType }, key, group);
      var resultSelectorLambda = Expression.Lambda(groupingFunc, key, group);

      // package all of that up in a call to Enumerable.GroupBy
      var data = Expression.Parameter(typeof(IEnumerable<TSource>), "data"); // the input data
      var groupByExpression = Expression.Call(typeof(Enumerable), "GroupBy",
                     new Type[] { typeof(TSource), keyType, typeof(TResult) }, 
                     data, keyLambda, resultSelectorLambda);

      // create the lambda
      return Expression.Lambda<Func<IEnumerable<TSource>, 
                   IEnumerable<TResult>>>(groupByExpression, data);
}

Joining (汇合)

更新后的代码还支持两个 IEnumerables 之间的简单内连接。鉴于需要识别外部可枚举的 SQL 测试，我发明了一个“that”关键字。也许有点牵强，但它满足了我对这种可能性可行性的好奇心。

[TestMethod]
public void JoinAndGroup()
{
    IEnumerable<Person> source = TestData.GetPeople();
    IEnumerable<Family> families = TestData.GetFamilies();

    var answer = from p in source
                 join f in families on p.Address equals f.Address
                 group f by f.Name into g
                 select new Tuple<string, int>(g.Key, g.Count());

    var result = source.Query<Person, Family, Tuple<string, int>>("SELECT that.Name, COUNT(*)" + 
          "FROM this INNER JOIN that ON this.Address = that.Address GROUP BY that.Name", families);

    Assert.IsTrue(result.Any());
    Assert.IsTrue(answer.Any());
    Assert.IsTrue(result.SequenceEqual(answer));
}

结论

附加的单元测试包含上面各种组合的许多示例，但大多数 SQL SELECT 的基本语法应该都可以工作。到目前为止，我对这段代码进行了很多有趣的尝试，并计划对其进行更新以增加更多功能。希望你能觉得它有用，或者至少很有趣。

历史

初始上传 - 2009 年 11 月 11 日。
添加了连接、错误修复、杂项其他内容 - 2013 年 5 月 8 日
修复了一些与可空类型比较相关的错误 - 2013 年 5 月 25 日
增加了将 ExpandoObject 用作枚举源和连接目标的功能 - 2013 年 8 月 2 日