使用 LINQ 计算基本统计数据

Don Kackman

4.90/5 (60投票s)

2009年9月19日

CPOL

6分钟阅读

171528

2246

用于方差、标准差、极差、中位数、众数和其他一些基本描述性统计信息的扩展方法

下载源代码 - 27.4 KB
最新的源代码在 GitHub 上
现已提供 NuGet 包

引言

在开发另一个项目时，我发现自己需要计算各种底层类型的各种数据集的基本统计信息。LINQ 提供了 Count、Min、Max 和 Average，但没有其他统计聚合函数。在这种情况下，我像往常一样开始搜索 Google，认为肯定有人已经编写了一些方便的扩展方法。市面上有许多统计和数值处理包，但我想要的是一个简单轻量级的实现，用于计算基本统计信息：方差（样本和总体）、标准差（样本和总体）、协方差、皮尔逊（卡方）、极差、中位数、最小二乘法、均方根、直方图和众数。

背景

我的 API 设计模仿了 Enumerable.Average 的各种重载，因此您可以在这些方法接受的相同类型的集合上使用这些方法。希望这能使用法熟悉且易于使用。

这意味着提供适用于常见数值数据类型及其 Nullable 对应项的集合重载，以及方便的选择器重载。

public static decimal? StandardDeviation(this IEnumerable<decimal?> source);
public static decimal StandardDeviation(this IEnumerable<decimal> source);
public static double? StandardDeviation(this IEnumerable<double?> source);
public static double StandardDeviation(this IEnumerable<double> source);
public static float? StandardDeviation(this IEnumerable<float?> source);
public static float StandardDeviation(this IEnumerable<float> source);
public static double? StandardDeviation(this IEnumerable<int?> source);
public static double StandardDeviation(this IEnumerable<int> source);
public static double? StandardDeviation(this IEnumerable<long?> source);
public static double StandardDeviation(this IEnumerable<long> source);
public static decimal? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, decimal?> selector);
public static decimal StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, decimal> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, double?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, double> selector);
public static float? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, float?> selector);
public static float StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, float> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, int?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, int> selector);
public static double? StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, long?> selector);
public static double StandardDeviation<TSource>
    (this IEnumerable<TSource> source, Func<TSource, long> selector);

所有接受 Nullable 类型集合的重载都只将实际值包含在计算结果中。例如

public static double? StandardDeviation(this IEnumerable<double?> source)
{
    IEnumerable<double> values = source.AllValues();
    if (values.Any())
        return values.StandardDeviation();

    return null;
}

其中 AllValues 方法是

public static IEnumerable<T> AllValues<T>(this IEnumerable<T?> source) where T : struct
{
    Debug.Assert(source != null);
    return source.Where(x => x.HasValue).Select(x => (T)x);
}

关于众数的一点说明

由于数值分布可能没有众数，所有 Mode 方法都返回一个 Nullable 类型。例如，在序列 { 1, 2, 3, 4 } 中，没有单个值出现一次以上。在这种情况下，返回值将是 null。

当存在多个众数时，Mode 返回最大众数（即出现次数最多的值）。如果最大众数存在平局，则返回最大众数集合中的最小值。

还有两个计算系列中所有众数的方法。它们返回一个 IEnumerable，包含所有众数，按出现次数降序排列。

统计计算

来自 Wikipedia 的链接、描述和数学图像。

方差

方差是衡量一个变量所有得分的变异量（不仅仅是给出极差的极端值）。

样本方差通常用小写西格玛平方表示：σ²。

variance

public static double Variance(this IEnumerable<double> source) 
{ 
    int n = 0;
    double mean = 0;
    double M2 = 0;

    foreach (double x in source)
    {
        n = n + 1;
        double delta = x - mean;
        mean = mean + delta / n;
        M2 += delta * (x - mean);
    }
    return M2 / (n - 1);
}

标准差

一个统计总体、数据集或概率分布的标准差是其方差的平方根。

标准差通常用小写西格玛表示：σ。

standard deviation

public static double StandardDeviation(this IEnumerable<double> source) 
{ 
    return Math.Sqrt(source.Variance());
}

中位数

中位数是将样本、总体或概率分布的较高一半与较低一半分开的数字。

public static double Median(this IEnumerable<double> source) 
{ 
    var sortedList = from number in source 
        orderby number 
        select number; 
        
    int count = sortedList.Count(); 
    int itemIndex = count / 2; 
    if (count % 2 == 0) // Even number of items. 
        return (sortedList.ElementAt(itemIndex) + 
                sortedList.ElementAt(itemIndex - 1)) / 2; 
        
    // Odd number of items. 
    return sortedList.ElementAt(itemIndex); 
}

模式

众数是在数据集或概率分布中出现频率最高的值。

public static T? Mode<T>(this IEnumerable<T> source) where T : struct
{
    var sortedList = from number in source
                     orderby number
                     select number;

    int count = 0;
    int max = 0;
    T current = default(T);
    T? mode = new T?();

    foreach (T next in sortedList)
    {
        if (current.Equals(next) == false)
        {
            current = next;
            count = 1;
        }
        else
        {
            count++;
        }

        if (count > max)
        {
            max = count;
            mode = current;
        }
    }

    if (max > 1)
        return mode;

    return null;
}

直方图

一个直方图是数据连续分布的表示。给定一个连续数据集，直方图计算其数据点落入一组连续值范围（也称为 bin）的数量。确定 bin 数量没有单一的方法，因为这取决于数据和正在进行的分析。有一些基于数据点数量计算 bin 大小的标准机制。其中三个包含在一组 BinCount 扩展方法中。确定每个 bin 的范围也有不同的方法。这些由 BinningMode 枚举指示。在一种情况之外的所有情况下，bin 范围包括 >= 范围最小值且 < 范围最大值的值；[min, max)。当 BinningMode 为 MaxValueInclusive 时，最大 bin 范围将包含最大值而不是排除它：[min, max]。

/// <summary>
/// Controls how the range of the bins are determined
/// </summary>
public enum BinningMode
{
    /// <summary>
    /// The minimum will be equal to the sequence min and the maximum equal to infinity
    /// such that:
    /// [min, min + binSize), [min * i, min * i + binSize), ... , [min * n, positiveInfinity)
    /// </summary>
    Unbounded,

    /// <summary>
    /// The minimum will be the sequence min and the maximum equal to sequence max
    /// The last bin will max inclusive instead of exclusive
    /// </summary>
    /// [min, min + binSize), [min * i, min * i + binSize), ... , [min * n, max]
    MaxValueInclusive,

    /// <summary>
    /// The total range will be expanded such that the min is
    /// less then the sequence min and max is greater then the sequence max
    /// [min - (binSize / 2), min - (binSize / 2) + binSize), 
    /// [min - (binSize / 2) * i, min - (binSize / 2) * i + binSize), ... , 
    /// [min - (bin / 2) * n, min + (binSize / 2))
    /// </summary>
    ExpandRange
}

创建直方图涉及创建具有适当范围的 Bin 数组，然后确定每个范围中有多少数据点。

public static IEnumerable<Bin> Histogram
(this IEnumerable<double> source, int binCount, BinningMode mode = BinningMode.Unbounded)
{
    if (source == null)
        throw new ArgumentNullException("source");

    if (!source.Any())
        throw new InvalidOperationException("source sequence contains no elements");

    var bins = BinFactory.CreateBins(source.Min(), source.Max(), binCount, mode);
    source.AssignBins(bins);

    return bins;
}

Range

极差是包含所有数据的最小区间长度。

public static double Range(this IEnumerable<double> source)
{
    return source.Max() - source.Min();
}

协方差

协方差是衡量两个变量如何一起变化的度量。

public static double Covariance(this IEnumerable<double> source, IEnumerable<double> other)
{
    int len = source.Count();

    double avgSource = source.Average();
    double avgOther = other.Average();
    double covariance = 0;
    
    for (int i = 0; i < len; i++)
        covariance += (source.ElementAt(i) - avgSource) * (other.ElementAt(i) - avgOther);

    return covariance / len; 
}

皮尔逊卡方检验

皮尔逊卡方检验用于评估两种类型的比较：拟合优度检验和独立性检验。

换句话说，它是一种衡量样本分布与预测分布的匹配程度，或两个样本分布之间相关程度的度量。皮尔逊方法常用于科学应用中检验假设的有效性。

public static double Pearson(this IEnumerable<double> source, 
                             IEnumerable<double> other)
{
    return source.Covariance(other) / (source.StandardDeviationP() * 
                             other.StandardDeviationP());
}

线性最小二乘法

最小二乘法是一种用于确定回归分析中数据分布近似解的方法。换句话说，给定一个二维数据分布，哪个方程能最好地预测 y 作为 x 的函数，形式为 y = mx + b，其中 m 是直线的斜率，b 是它在二维图上的 y 轴截距。

对于此计算，将返回一个指示 m 和 b 的 struct。

public static LeastSquares LeastSquares(this IEnumerable<Tuple<double, double>> source)
{
    int numPoints = 0;
    double sumX = 0;
    double sumY = 0;
    double sumXX = 0;
    double sumXY = 0;

    foreach (var tuple in source)
    {
        numPoints++;
        sumX += tuple.Item1;
        sumY += tuple.Item2;
        sumXX += tuple.Item1 * tuple.Item1;
        sumXY += tuple.Item1 * tuple.Item2;
    }

    if (numPoints < 2)
        throw new InvalidOperationException("Source must have at least 2 elements");

    double b = (-sumX * sumXY + sumXX * sumY) / (numPoints * sumXX - sumX * sumX);
    double m = (-sumX * sumY + numPoints * sumXY) / (numPoints * sumXX - sumX * sumX);

    return new LeastSquares(m, b);
}

均方根

均方根是一个变化序列幅度的度量。这对于波形特别有用。

public static double RootMeanSquare(this IEnumerable<double> source)
{
    if (source.Count() < 2)
        throw new InvalidOperationException("Source must have at least 2 elements");

    double s = source.Aggregate(0.0, (x, d) => x += Math.Pow(d, 2));

    return Math.Sqrt(s / source.Count());
}

Using the Code

包含的单元测试应该能提供许多如何使用这些方法的示例，但最简单的用法就像其他可枚举扩展方法一样。以下程序……

static void Main(string[] args)
{
      IEnumerable<int> data = new int[] { 1, 2, 5, 6, 6, 8, 9, 9, 9 };

      Console.WriteLine("Count = {0}", data.Count());
      Console.WriteLine("Average = {0}", data.Average());
      Console.WriteLine("Median = {0}", data.Median());
      Console.WriteLine("Mode = {0}", data.Mode());
      Console.WriteLine("Sample Variance = {0}", data.Variance());
      Console.WriteLine("Sample Standard Deviation = {0}", data.StandardDeviation());
      Console.WriteLine("Population Variance = {0}", data.VarianceP());
      Console.WriteLine("Population Standard Deviation = {0}", 
                    data.StandardDeviationP());
      Console.WriteLine("Range = {0}", data.Range());
}

……产生

Count = 9
Average = 6.11111111111111
Median = 6
Mode = 9
Sample Variance = 9.11111111111111
Sample Standard Deviation = 3.01846171271247
Population Variance = 8.09876543209877
Population Standard Deviation = 2.8458329944146
Range = 8

关注点

我没有花太多时间优化计算，因此在评估非常大的数据集时请小心。如果您对附加代码中的任何部分进行了优化，请通知我，我将更新源代码。

希望下次需要进行一些简单的统计计算时，您会觉得这些代码很有用。

关于 T4 模板的一点说明

我从未发现代码生成模板有什么特别大的用处，但在开发此库的过程中，它们极大地简化了一种用例：即算术运算符无法在 C# 泛型中直接表示。由于运算符是作为 static 方法实现的，并且没有机制来要求 Type 具有特定的 static 方法，因此编译器无法以泛型方式解析此代码块中的“-”。

public static T Range<T>(this IEnumerable<T> source)
{
    // error CS0019: Operator '-' cannot be applied to operands of type 'T' and 'T'
    return source.Max() - source.Min(); 
}

如果您查看框架类中 Average 和 Sum 设置的模式（我试图在这里效仿），它们作用于 int、long、float、double 和 decimal 的枚举。为了避免大量的“复制、粘贴、修改代码和注释中的操作数类型”，T4 模板非常方便。

基本上，对于可以在一组内在类型上操作的每种操作，该模板

声明支持的类型列表
遍历列表并为支持给定内在类型的所有重载生成代码注释、方法签名和主体

    public static partial class EnumerableStats
    {
    <# var types = new List<string>()
    {
        "int", "long", "float", "double", "decimal"
    };

    foreach(var type in types)
    {#>	
    	/// <summary>
    	/// Computes the Range of a sequence of nullable <#= type #> values.
    	/// </summary>
        /// <param name="source">The sequence of elements.</param>
        /// <returns>The Range.</returns>
        public static <#= type #>? Range(this IEnumerable<<#= type #>?> source)
        {
            IEnumerable<<#= type #>> values = source.AllValues();
            if (values.Any())
                return values.Range();

            return null;
        }

    	/// <summary>
    	/// Computes the Range of a sequence of <#= type #> values.
    	/// </summary>
        /// <param name="source">The sequence of elements.</param>
        /// <returns>The Range.</returns>
        public static <#= type #> Range(this IEnumerable<<#= type #>> source)
        {
            return source.Max() - source.Min();
        }

	...
	etc etc
	...
<# } #>
   }

这很好，因为它确保所有类型都支持相同的重载集，并且具有相同的实现和代码注释。

历史

2009 年 9 月 19 日：版本 1.0 - 首次上传
2009 年 10 月 26 日：版本 1.1 - 添加了协方差和皮尔逊以及一些修复/优化
2013 年 12 月 3 日：版本 1.2 - 更新了方差实现并添加了 GitHub 和 NuGet 链接
2014 年 8 月 30 日：版本 1.3 - 添加了最小二乘法和直方图的描述