65.9K
CodeProject 正在变化。 阅读更多。
Home

使用 SQLite、Leverage 的 create_aggregate() 和 SQL 的 Between 操作符创建正态概率分布直方图,或者更常见地称为钟形曲线

starIconstarIconstarIconstarIconstarIcon

5.00/5 (4投票s)

2016年8月7日

CPOL

3分钟阅读

viewsIcon

33692

使用 SQLite,利用 create_aggregate() 和 SQL 的 Between 操作符创建一个正态概率分布直方图,或者更常见地称为钟形曲线。

我在工作中扮演的角色之一是绩效分析员。我工作在一个多平台的环境中,因此,我自然对保留那些可以可靠地在不同平台使用而无需过多麻烦的技术感兴趣,Python 是一个合乎逻辑的选择。只需稍加留意,就可以设计一个 Python 脚本,使其能够在从 Andrioid 到 Windows 的任何支持的平台上运行。此外,还可以动态地在运行时处理不同的 Python 版本。

Python 开箱即用地支持 SQLite3,使用这种组合来构建一套用于绩效分析中常见的统计处理工具集是很有吸引力的。

与 Oracle、IBM DB2、Microsoft SQL Server 和 MySQL 等企业级客户端-服务器市场中的 SQLite 的强大竞争对手不同,它们提供了丰富的内置聚合函数,SQLite 只提供基本的聚合函数,如 AverageMaxMinCountgroup_concatSumTotal。更复杂的统计聚合函数则令人遗憾地缺失了。

然而,这并不是 SQLite 人员的疏忽,而是有意为之。SQLite 被设计成占用的空间很小,在许多应用中,它通常被嵌入到硬件中,或者驻留在内存中。

在我的情况下,我需要这些统计函数,尤其是标准差,因为它是许多计算的基础。

幸运的是,SQLite 允许通过 create_aggregate() 方法在运行时将自定义聚合函数导入数据库,这可以被描述为一种“ à la carte” 的方式。

这非常有益,因为它允许在不需要“糟糕的”将数据拉取、处理和推回的范式,以及数据层之外的 CPU 密集型循环和迭代的情况下,添加自定义的临时功能。数据函数是在层内执行的,而不是跨层执行的。数据库可以被视为一个容器对象,它在内部处理数据,这个对象可以非常聪明,也可以非常愚蠢。选择取决于您在运行时根据您的需求而定。

此脚本演示了如何使用 SQLite 的 create_aggregate() 作为基础函数来生成最常见的统计分析工具之一,即分布直方图,或者更常见地称为钟形曲线。

 
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sqlite3
import math
import random
import os
import sys
import traceback
import random

class StdevFunc:
    def __init__(self):
        self.M = 0.0    #Mean
        self.V = 0.0    #Used to Calculate Variance
        self.S = 0.0    #Standard Deviation
        self.k = 1      #Population or Small 

    def step(self, value):
        try:
            if value is None:
                return None

            tM = self.M
            self.M += (value - tM) / self.k
            self.V += (value - tM) * (value - self.M)
            self.k += 1
        except Exception as EXStep:
            pass
            return None    

    def finalize(self):
        try:
            if ((self.k - 1) < 3):
                return None
            
            #Now with our range Calculated, and Multiplied finish the Variance Calculation
            self.V = (self.V / (self.k-2))

            #Standard Deviation is the Square Root of Variance
            self.S = math.sqrt(self.V)

            return self.S
        except Exception as EXFinal:
            pass
            return None 

def Histogram(Population):
    try:
        BinCount = 6 
        More = 0

        #a = 1          #For testing Trapping
        #b = 0          #and Trace Back
        #c = (a / b)    #with Detailed Info
        
        #If you want to store the Database
        #uncDatabase = os.path.join(os.getcwd(),"BellCurve.db3")
        #con = sqlite3.connect(uncDatabase)
        
        #If you want the database in Memory
        con = sqlite3.connect(':memory:')    

        #row_factory allows accessing fields by Row and Col Name
        con.row_factory = sqlite3.Row

        #Add our Non Persistent, Runtime Standard Deviation Function to the Database
        con.create_aggregate("Stdev", 1, StdevFunc)

        #Lets Grab a Cursor
        cur = con.cursor()

        #Lets Initialize some tables, so each run with be clear of previous run
        cur.executescript('drop table 
        if exists MyData;') #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS MyData('Val' FLOAT)")
        cur.executescript('drop table 
        if exists Bins;')   #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS Bins('Bin' 
        UNSIGNED INTEGER, 'Val' FLOAT, 'Frequency' UNSIGNED BIG INT)")

        #Lets generate some random data, and insert in to the Database
        for n in range(0,(Population)):
            sql = "insert into MyData(Val) values ({0})".format(random.uniform(-1,1))
            #If Whole Number Integer greater that value of 2, Range Greater that 1.5
            #sql = "insert into MyData(Val) values ({0})".format(random.randint(-1,1))
            cur.execute(sql)
            pass

        #Now let’s calculate some built in Aggregates, that SQLite comes with
        cur.execute("select Avg(Val) from MyData")
        Average = cur.fetchone()[0]
        cur.execute("select Max(Val) from MyData")
        Max = cur.fetchone()[0]
        cur.execute("select Min(Val) from MyData")
        Min = cur.fetchone()[0]
        cur.execute("select Count(Val) from MyData")
        Records = cur.fetchone()[0]

        #Now let’s get Standard Deviation using our function that we added
        cur.execute("select Stdev(Val) from MyData")
        Stdev = cur.fetchone()[0]

        #And Calculate Range
        Range = float(abs(float(Max)-float(Min)))

        if (Stdev == None):
            print("================================   Data Error ===================================")
            print("                 Insufficient Population Size, Or Bad Data.")   
            print("**********************************************************************************")
        elif (abs(Max-Min) == 0):
            print("================================   Data Error ===================================")
            print(" The entire Population Contains Identical values, Distribution Incalculable.")
            print("**********************************************************************************")            
        else:  
            Bin = []        #Holds the Bin Values
            Frequency = []  #Holds the Bin Frequency for each Bin

            #Establish the 1st Bin, which is based on (Standard Deviation * 3) 
             being subtracted from the Mean
            Bin.append(float((Average - ((3 * Stdev)))))
            Frequency.append(0)
            
            #Establish the remaining Bins, which is basically adding 1 Standard Deviation
            #for each interation, -3, -2, -1, 1, 2, 3             
            for b in range(0,(BinCount) + 1):
                Bin.append((float(Bin[(b)]) + Stdev))
                Frequency.append(0)

            for b in range(0,(BinCount / 1) + 1):
                #Lets exploit the Database and have it do the hard work calculating distribution
                #of all the Bins, with SQL's between operator, but making it left inclusive, 
                 right exclusive.
                sqlBinFreq = "select count(*) as Frequency from MyData 
                where val between {0} and {1} and Val < {2}". \
                             format(float((Bin[b])), float(Bin[(b + 1)]), float(Bin[(b + 1)]))

                #If the Database Reports Values that fall between the Current Bin, 
                 Store the Frequency to a Bins Table. 
                for rowBinFreq in cur.execute(sqlBinFreq):
                    Frequency[(b + 1)] = rowBinFreq['Frequency']
                    sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                                       format(b, float(Bin[b]), Frequency[(b)])
                    cur.execute(sqlBinFreqInsert)

                #Allthough this Demo is not likley produce values that
                #fall outside of Standard Distribution
                #if this demo was to Calculate with real data, we want to know
                #how many non-Standard data points we have. 
                More = (More + Frequency[b])

            More = abs((Records - More))
            
            sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                                format((BinCount + 1), float(0), More)
            cur.execute(sqlBinFreqInsert)
               
            #Now Report the Analysis
            print("================================ The Population ==================================")
            print("             {0} {1} {2} {3} {4} {5}". \
                  format("Size".rjust(10, ' '), \
                         "Max".rjust(10, ' '), \
                         "Min".rjust(10, ' '), \
                         "Mean".rjust(10, ' '), \
                         "Range".rjust(10, ' '), \
                         "Stdev".rjust(10, ' ')))
            print("Aggregates:  {0:10d} {1:10.4f} {2:10.4f} {3:10.4f} {4:10.4f} {5:10.4f}". \
                  format(Population, Max, Min, Average, Range, Stdev))
            print("================================= The Bell Curve =================================")  

            LabelString = "{0} {1}  {2}  {3}". \
                          format("Bin".ljust(8, ' '), \
                                 "Ranges".rjust(8, ' '), \
                                 "Frequency".rjust(8, ' '), \
                                 "Histogram".rjust(6, ' '))

            print(LabelString)
            print("----------------------------------------------------------------------------------")
            
            #Let's Paint a Histogram
            sqlChart = "select * from Bins order by Bin asc"
            for rowChart in cur.execute(sqlChart):
                if (rowChart['Bin'] == 7):
                    #Bin 7 is not really a bin, but where we place the values that did not fit into the
                    #Normal Distribution. This script was tested against Excel's Bell Curve Example
                    #https://support.microsoft.com/en-us/kb/213930
                    #and produces the same results. Feel free to test it.
                    BinName = "More"
                    ChartString = "{0:<6} {1:<10} {2:10.0f}". \
                                format(BinName, \
                                        "", \
                                        More)
                else:
                    BinName = (rowChart['Bin'] + 1)
                    #Scale the Chart
                    fPercent = ((float(rowChart['Frequency']) / float(Records) * 100))
                    iPrecent = int(math.ceil(fPercent))
                
                    ChartString = "{0:<6} {1:10.4f} {2:10.0f} {3}". \
                                  format(BinName, \
                                         rowChart['Val'], \
                                         rowChart['Frequency'], \
                                         "".rjust(iPrecent, '#'))
                print(ChartString)
                
            print("**********************************************************************************")

            #Commit to Database
            con.commit()

            #Clean Up
            cur.close()
            con.close()

    except Exception as EXBellCurve:
        pass
        TraceInfo = traceback.format_exc()       
        raise Exception(TraceInfo)  

            
print("**********************************************************************************")
print("Using SQLite, leverage the create_aggregate(), and SQL's Between Operator to")
print("calculate distribution, then Demonstrate and Store the results in a Database.\n")
print("This demo creates a Population of values between -1 and 1 generated with a")
print("Pseudo Random Number Generator. The Mean value should be very close to zero,")
print("with a Range value of about 2, contained within a Standard Distribution.")

PythonVersion = sys.version_info[0] 
  
#Population = 2             #To Test Insufficient Population Size 
Population = (16 ** 2)      #Population of 256
#Population = (64 ** 2)     #Population of 4096
#Population = (128 ** 2)    #Population of 16384
#Population = (256 ** 2)    #Population of 65536
#Population = (1024 ** 2)   #Population of 1048576

Histogram(Population)       #Do a Distribution Analysis

if (PythonVersion == 3):
    kb = input("Press Enter to continue...")
else:
    kb = raw_input("Press Enter to continue...")    

关于脚本的几点说明。

脚本将演示如何向 SQLite3 数据库添加标准差函数。为了演示,脚本将首先运行一个伪随机数生成器来构建一个待分析的数据集。

此脚本已在 Windows 和 Linux 平台以及 Python 2.6、3.4 上进行了测试。

关于钟形曲线的几点说明。

由于分布分析是钟形曲线的核心,我们将生成一个足够大的数据集,其中包含从 -11 的随机种子数。这将确保一个标准分布,并且结果易于解释。

基本上,如果您生成一个从 -11 的长系列数字,那么期望该数据集产生的平均值将接近 0,范围非常接近 2。结果本质上只是显示了 Python 随机数生成器实际上有多随机。

直方图没有官方的 bin 分配规则。此演示使用基于 Excel 的数据分析附加组件、直方图功能的 bin 分配规则。
 


下载脚本


https://support.microsoft.com/en-us/kb/213930

 

© . All rights reserved.