C#中的通用频率表及描述性统计

V. Thieme

4.72/5 (14投票s)

2007年1月18日

CPOL

4分钟阅读

95082

2087

频率、描述性统计和正态性检验

有什么新功能？

GetData：新的参数 Pristine 指示您是否希望以输入的相同顺序获取数据。这对于时间相关的数据可能很重要。新的 Add 方法使用 Dictionary<T, List<int>> 存储每个值的\u5b9a\u4f4d。

引言

在各种领域中，对经验数据的探索是一项常见任务。特别是对\u8fd9\u4e9b数据的计算分析，常常受到数据性质（类型）的阻碍。因此——在我自己编写统计例程时——我决定开发一个“频率表类”，它可以接受任何可能类型的数据。

要求

我的类必须满足以下要求：

接受任何类型的数据（尤其是多精度类型）
接受给定的数组
提供添加和删除单个值的\u65b9\u6cd5
在添加/删除值时自动更新绝对频率
获取众数、最高频率...的简单方法
提供一种将表作为数组获取的\u65b9\u6cd5
返回的数组必须可以按频率和值排序
提供字段/属性来描述表

代码

这个类的骨干是 FrequencyTableEntry<T> 结构

// A generic structure storing the frequency information for each value
public struct FrequencyTableEntry<T> where T : IComparable<T>
{
    // Constructor
    // val: The value counted
    // absFreq: The absolute frequency
    // relFreq: The relative frequency
    // percentage: The percentage
    public FrequencyTableEntry(T val, int absFreq, double relFreq, double percentage)
    {
        Value = val;
        AbsoluteFreq = absFreq;
        RelativeFreq = relFreq;
        Percentage = percentage;
    }
    public T Value;
    public int AbsoluteFreq;
    public double RelativeFreq;
    public double Percentage;
}

指定的类型 T 必须实现 IComparable 接口（对排序例程是必需的）。该类将数据存储在一个泛型 Dictionary 中：_entries = new Dictionary<T,int>()：_entries.Keys 集合包含要计数的\u503c，_entries.Values 集合包含该特定值的绝对频率。

public FrequencyTable(int initialCapacity)
{
    _entries = new Dictionary<T, int>(initialCapacity);
    .
    .
}

为了方便访问表项，实现的枚举器返回上述结构。

public IEnumerator<FrequencyTableEntry<T>> GetEnumerator()
{
    // the structure to return
    FrequencyTableEntry<T> _output;
    // the frequency
    int int_f;
    // the "double-typed" frequency
    double dbl_f;
    foreach (T key in _entries.Keys)
    {
        int_f = _entries[key];
        dbl_f = (double)int_f;
        // fill the structure
        _output = new FrequencyTableEntry<T>
        (key, int_f, dbl_f / _count, dbl_f / _count * 100.0);
        // yielding - cool thing that
        yield return _output;
    }
}

通用的 Add(T value) 方法如下所示。

public void Add(T value)
{
    List<int> _tempPos;
    // if the Dictionary already contains value, then
    // we have to update frequency and _count
    if (_entries.ContainsKey(value))
    {
        // update the frequency
        _entries[value]++;
        // update mode and highest frequency
        if (_entries[value] > _high)
        {
            _high = _entries[value];
            _mode = value;
        }
        // add 1 to sample size
        _count++;
        foreach (T key in _entries.Keys)
        {
            _relFrequencies[key] = (double)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        // store the actual position of the entry in the dataset
        _positions.TryGetValue(value, out _tempPos);
        // the position is equal to _count
        _tempPos.Add(_count);
        // remove old entry
        _positions.Remove(value);
        // store new entry
        _positions.Add(value, _tempPos);
    }
    else /* if the dictionary does not contain value add a new entry */
    {
        // if the highest frequency is still zero, set it to one
        if (_high < 1)
        {
            _high = 1;
            _mode = value;
        }
        // add a new entry - frequency is one
        _entries.Add(value, 1);
        // add 1 to table length
        _length++;
        // add 1 to sample size
        _count++;
        // update relative frequencies
        _relFrequencies.Add(value, 0.0);
        foreach (T key in _entries.Keys)
        {
            _relFrequencies[key] = (double)_entries[key] / _count;
        }
        UpdateSumAndMean(value);
        // create a new entry and set position to _count
        _tempPos = new List<int>();
        _tempPos.Add(_count);
        // store it
        _positions.Add(value, _tempPos);
    }
}

为了简化给定文本的分析，我实现了一个特殊的构造函数。

// Constructor - the created instance analyzes the 
// frequency of characters in a given string
// Text: String to analyze
public FrequencyTable(T Text, TextAnalyzeMode mode)
{
    _positions = new Dictionary<T, List<int>>();
    // if T is not string -> Exception
    if (!(Text is string))
        throw new ArgumentException();
    // the table itself
    _entries = new Dictionary<T, int>();
    _relFrequencies = new Hashtable();
    // number of entries in _entries
    _length = 0;
    // sample size
    _count = 0;
    // description of the table
    _description = "";
    // a user defined tag
    _tag = 0;
    // the highest frequency
    _high = 0;
    _dblSum = double.NaN;
    _mean = double.NaN;
    _alpha = double.NaN;
    AnalyzeString(Text, mode);
    _p = double.NaN;
}

关联的 Add 方法

public void Add(T Text, TextAnalyzeMode mode)
{
    if (!(Text is string))
        throw new ArgumentException();
    AnalyseString(Text, mode);
}

在我看来，提供关于文本分析的不同模式很有用。这些模式由 TextAnalyzeMode 提供。

public enum TextAnalyzeMode
{
    AllCharacters,
    NoNumerals,
    NoSpecialCharacters,
    LettersOnly,
    NumeralsOnly,
    SpecialCharactersOnly
}

分析本身由 AnalyzeString(T Text, TextAnalyzeMode mode) 执行。

private void AnalyzeString(T Text, TextAnalyzeMode mode)
{
    // character strings
    string str_specialChars = @"""!§$%&/()=?@€<>|µ,.;:-_#'*+~²³ ";
    string str_Numbers = "0123456789";
    // Adding the entries according to mode
    switch (mode)
    {
        case TextAnalyzeMode.AllCharacters:
            foreach (char v in Text.ToString())
                this.Add((T)Convert.ChangeType((object)v, Text.Getype()));
            break;
        case TextAnalyzeMode.LettersOnly:
            foreach (char v in Text.ToString())
            {
                if ((str_specialChars.IndexOf(v) == -1) & 
            	(str_Numbers.IndexOf(v) == -1))
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NoNumerals:
            foreach (char v in Text.ToString())
            {
                if (str_Numbers.IndexOf(v) == -1)
                        this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NoSpecialCharacters:
            foreach (char v in Text.ToString())
            {
                if (str_specialChars.IndexOf(v) == -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.NumeralsOnly:
            foreach (char v in Text.ToString())
            {
                if (str_Numbers.IndexOf(v) != -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
        case TextAnalyzeMode.SpecialCharactersOnly:
            foreach (char v in Text.ToString())
            {
                if (str_specialChars.IndexOf(v) != -1)
                    this.Add((T)Convert.ChangeType((object)v, Text.GetType()));
            }
            break;
    }
}

正态性检验

给定的数据是否“服从高斯分布”的问题经常被提出。有一些稳健有效的检验可以回答这个问题。我实现了“老式”的 Kolmogorov-Smirnov 检验（KS 检验）。或者也可以使用 D'Agostino-Pearson 检验。有两个新的关于正态性检验的属性：

IsGaussian：如果数据是数值型的，并且计算出的 p 值大于 Alpha（见下文），则返回 true。
Alpha：定义 KS 检验的“显著性水平”。

下面展示了 KS 检验方法。如果检验适用，该方法返回 true。如果数据非数值型，该方法返回 false。out 参数 p 包含退出时的 p 值。可以通过访问 P_Value 属性来获取此值。

private bool KS_Test(out double p)
{
   // D-statistic
   double D = double.NaN;
   CumulativeFrequencyTableEntry<T>[] empCDF = 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat.EachDatapointOnce);
   // store the test CDF
   double testCDF;
   // array to store datapoints
   double[] data = new double[empCDF.Length];
   FrequencyTableEntry<T>[] table = GetTableAsArray
			(FrequencyTableSortOrder.Value_Ascending);
   int i = 0;
   // prevent exceptions if T is not numerical
   try
   {
      foreach (FrequencyTableEntry<T> entry in table)
      {
         data[i] = (double)Convert.ChangeType(entry.Value, TypeCode.Double);
         i++;
      }
   }
   catch
   {
      p = double.NaN;
      return false;
   }
   // estimate the parameters of the expected Gaussian distribution
   // first: compute the mean
   double mean = Mean;
   // compute the bias-corrected variance
   // as an estimator for the population variance (actually we need the
   // square root)
   double _sqrt_var = Math.Sqrt(VariancePop);
   // now we have to determine the greatest difference between the
   // sample cumulative distribution function (empCDF) and
   // the distribution function to test (testCDF)
   double _sqrt2 = Math.Sqrt(2.0);
   double _erf;
   double max1 = 0.0;
   double max2 = 0.0;
   double _temp;
   for (i = 0; i < empCDF.Length; i++)
   {
      // compute the expected distribution using the error function
      _erf = Erf(((data[i] - mean) / _sqrt_var) / _sqrt2);
      testCDF = 0.5 * (1.0 + _erf);
      _temp = Math.Abs(empCDF[i].CumulativeRelativeFrequency - testCDF);
      if (_temp > max1)
        max1 = _temp;
      if (i > 0)
        _temp = Math.Abs(empCDF[i - 1].CumulativeRelativeFrequency - testCDF);
      else
        _temp = testCDF;
      if (_temp > max2)
        max2 = _temp;
   }
   // the statistics to use is
   // max{diff1,diff2}
   D = max1 > max2 ? max1 : max2;
   // now compute the p-value using a z-transformation
   if (!Double.IsNaN(D))
   {
      double z = Math.Sqrt((double)SampleSize) * D;
      p = KS_Prob_Smirnov(z);
   }
   else
      p = double.NaN;
   return true;
}

为了计算“检验分布”（在此情况下为高斯 CDF），我们需要所谓的误差函数。我使用了 Miroslav Stampar 编写的 Erf 实现（请参阅 C# 的特殊函数），这是 Stephen L. Moshier 的 Cephes 数学库的翻译。

描述性统计

我认为在类中实现一些基本的统计属性很有用。

累积频率

首先，需要实现一个返回给定数据的经验分布函数（累积分布函数）的\u65b9\u6cd5。

public CumulativeFrequencyTableEntry<T>[] 
	GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
{
   CumulativeFrequencyTableEntry<T>[] _output = null;
   // get the frequency table as array for easier processing
   FrequencyTableEntry<T>[] _freqTable = 
       GetTableAsArray(FrequencyTableSortOrder.Value_Ascending);
   // temporary values
   double tempCumRelFreq = 0.0;
   int tempCumAbsFreq = 0;
   int i, k;
   switch (Format)
   {
      // each datapoint will returned
      case CumulativeFrequencyTableFormat.EachDatapoint:
        // initialize the result
        _output = new CumulativeFrequencyTableEntry<T>[SampleSize];
        for (i = 0; i < _freqTable.Length; i++)
        {
          // update the cumulative frequency - relative and absolute
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          // fill the array
          for (k = tempCumAbsFreq - _freqTable[i].AbsoluteFreq;k < tempCumAbsFreq; k++)
          {
            _output[k] = new CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
          }
        }
      break;
      // here each different entry will be returned once
      case CumulativeFrequencyTableFormat.EachDatapointOnce:
        // initialize the result
        _output = new CumulativeFrequencyTableEntry<T>[Length];
        for (i = 0; i < _freqTable.Length; i++)
        {
          // update the cumulative frequency - relative and absolute
          tempCumAbsFreq += _freqTable[i].AbsoluteFreq;
          tempCumRelFreq += _freqTable[i].RelativeFreq;
          // fill the array
          _output[i] = new CumulativeFrequencyTableEntry<T>
			(_freqTable[i].Value, tempCumRelFreq, tempCumAbsFreq);
        }
       break;
   }
   // done
   return _output;
}

(抱歉格式奇怪 - 这是编辑工具...)

我的数据在哪里？？

好的——你需要一个已添加数据的数组？这是方法。

public T[] GetData(bool Pristine)
{
    T[] result = new T[SampleSize];
    // if the order is not important
    if (!Pristine)
    {
        CumulativeFrequencyTableEntry<T>[] cf = GetCumulativeFrequencyTable
	        (CumulativeFrequencyTableFormat.EachDatapoint);
        for (int i = 0; i < SampleSize; i++)
            result[i] = cf[i].Value;
    }
    else /* return the data in same order as entered */
    {
        List<int> l;
        foreach (T key in _positions.Keys)
        {
            _positions.TryGetValue(key, out l);
            foreach (int k in l)
            {
                result[k - 1] = key;
            }
        }
    }
    return result;
}

还有什么？

有一些与描述性统计相关的 public 属性：

平均
中位数
模式
最低
最大
VarianceSample
VariancePop（无偏估计量）
StandardDevSample
StandardDevPop（无偏估计量）
StandardError
Sum
SampleSize - 数据数量（只读）
HighestFrequency - 观察到的最高频率
SmallestFrequency - 最小频率
ScarcestValue - 最不常见的值
Kurtosis
KurtosisExcess
Skewness

如果数据不是数值型的，所有上述属性将返回 double.NaN。

杂项

这是其余 public 属性和方法的列表。

属性

Length - 表项的数量（只读）
Tag - 用户可设置的对象（可写）
Description - 表的描述（可写）
P_Value（包含 Kolmogorov-Smirnov 检验计算出的 p 值）

方法

Add(T Value) 和 Add(T Test, TextAnalyzeMode mode)
Remove(T Value)
GetTableAsArray() 和 GetTableAsArray(FrequencyTableSortOrder order)（排序使用 Quicksort 算法完成）
GetEnumerator()
ContainsValue(T value)
GetCumulativeFrequencyTable(CumulativeFrequencyTableFormat Format)
GetData(bool Pristine) - 以数组形式返回数据（已排序或按输入顺序）
GetRelativeFrequency(T value, out double relFreq)

我认为代码是（应该是）有文档的，因此您可以使用它来详细了解我的解决方案。我确信这个解决方案并不完美，但它是一个很好的起点。

为了更好的概述，我添加了一个编译好的帮助文件（参见页面顶部的下载）。

历史

版本 1.0 - 2007 年 1 月 18 日
- 首次发布
版本 1.5 - 2007 年 2 月 4 日
- 小的 bug 修复（最高频率未正确设置）
- 添加了正态性检验
- 添加了描述性统计
2007 年 2 月 9 日
- 添加了 P_Value，未更改发布号
版本 2.0 2007 年 2 月 26 日
- 添加了 GetData(bool Pristine)