Python 中的多类逻辑分类器

pi19404

4.88/5 (7投票s)

2014 年 9 月 22 日

CPOL

9分钟阅读

73679

446

用于多类分类的有监督 ML 算法。

引言

在本文中，我们将探讨多类别逻辑回归分类器的基础及其在 Python 中的实现。

背景

逻辑回归是一种判别式概率统计分类模型，可用于预测某个事件发生的概率。
它是一种监督学习算法，可应用于二元或多项式分类问题，其中类别是穷尽且互斥的。
分类算法的输入是固定维度的实值向量，输出是输入向量属于指定类别的概率。
逻辑回归是一种通过将数据拟合到逻辑函数来预测事件发生概率的回归类型。
逻辑回归也可以被视为一种用于分类的线性模型。
逻辑函数定义为

对于输入 z 的任何值，逻辑函数的域都介于 [0,1] 之间。

因此，它可以用来表征累积分布函数。

将逻辑函数应用于任何实值输入向量“X”的函数在 Python 中定义为

   # function applies logistic function to a real valued input vector x
   def sigmoid(X):
      # Compute the sigmoid function
      den = 1.0 + e ** (-1.0 * X) 
      d = 1.0 / den 
      return d

逻辑回归分类器由权重矩阵和偏置向量 $\mathcal{W},\mathcal{b}$ 参数化
分类是通过将数据点投影到一组超平面上来实现的，超平面到它们的距离用于确定类别成员概率。
数学上，这可以表示为 $ \begin{eqnarray*} P(Y=i|x, W,b) =\frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}$
与每个类别 $y_i$ 对应的逻辑分类器由一组参数 $W_i,b_i$ 表征。

上述函数也称为 softmax 函数。逻辑函数适用于二元分类问题，而 softmax 函数适用于多类别分类问题。

    # softmax function for multi class logistic regression
    def softmax(W,b,x):
       vec=numpy.dot(x,W.T);
       vec=numpy.add(vec,b);
       vec1=numpy.exp(vec);
       res=vec1.T/numpy.sum(vec1,axis=1);
       return res.T;

参数 (W,b) 分别是权重向量和偏置向量。令 N 为输入向量的维度，M 为类别的数量。

W 的维度是 $MxN$，而 B 的维度是 (Mx1)。softmax 函数的输出是 $Mx1$ 维向量。

这些参数用于计算类别概率。
给定一个未知向量 (x)，预测执行如下：
$\begin{eqnarray*} y_{pred} = argmax_i P(Y=i|x,W,b) \\ y_{pred} = argmax_i \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \end{eqnarray*}$
Python 中预测函数的代码如下：

    # function predicts the probability of input vector 
    # the output y is MX1 vector (M is no of classes) 
    def predict(self,x):
        y=softmax(self.W,self.b,x);
        return y;

Python 中分类函数的代码如下：

    # function returns the lables corresponding to the input y 
    def lable(self,y):
        return self.labels[y];

    # function classifies the input vector x into one of output lables
    # input is NXD vector then output is NX1 vector

    def classify(self,x):
        result=self.predict(x);
        indices=result.argmax(axis=1);
        #converting indices to lables
        lablels=map(self.lable, indices);
        return lablels;

验证函数接受验证数据并预测模型在数据集上的准确率。

     # validation function to test the accuracy of model 

    def validate(self,x,y):
        #classify the input vector x
        result=self.classify(x);        
        y_test=y                
        #computer the prediction score
        accuracy=met.accuracy_score(y_test,result)    
        #compute error in prediction
        accuracy=1.0-accuracy;
        print "Validation error   " + `accuracy`
        return accuracy;

给定一组标记的训练数据 ${X_i,Y_i}$，其中 $i \text{ in } {1,\ldots,N} $, 我们需要估计这些参数。

机器学习框架的基本组成部分是：

模型（逻辑回归）- 参数化函数族
损失函数 - 性能的量化度量
学习算法 - 训练准则
优化算法 - 优化算法

学习算法

一般来说，任何机器学习算法都包含模型、损失函数、学习算法和优化器。
学习算法估计模型的参数。这是通过最大化某个目标函数来完成的，该目标函数使用优化技术。机器学习算法的目标函数称为损失函数。
优化技术通过定义一个目标函数，然后找到最大化/最小化目标函数的模型参数来工作。

损失函数

理想情况下，我们希望计算参数以最小化 (0-1) 损失
$\begin{eqnarray*} \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}} \\ f(x)= argmax_k P(Y=y_k |x,\theta) \end{eqnarray*}$
$P(Y=y_k |x,\theta)$ 使用逻辑函数建模。
(0-1) 损失函数不可微分，因此对于大型模型进行优化在计算上是不可行的。
在当前应用中，使用负对数似然作为损失函数。
通过最小化损失函数来学习最优参数。

估计技术/学习算法

使用一种称为最大似然估计的估计技术来执行此操作。
该方法估计参数，以便在模型参数下最大化训练数据 $\mathcal{D}$ 的似然度。
假定数据样本是独立的，因此集合的概率是各个样本概率的乘积。

$\begin{eqnarray*} L(\theta={W,b},\mathcal{D}) =argmax \prod_{i=1}^N P(Y=y_i | X=x_i,W,b) \\ L(\theta,\mathcal{D}) = argmax \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ L(\theta,\mathcal{D}) = - \text{argmin} \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ \end{eqnarray*}$

应注意，正确类别的似然度不等于正确预测的数量。
对数似然函数可以被视为 (0-1) 损失函数的微分版本。

优化器 - 基于梯度的最小化方法

让我们考虑一个用于梯度计算的单个数据样本。令数据样本属于类别 $i$。与每个输出类别对应，我们有一个输出矩阵 $W_i$ 和偏置向量 $b_i$。

我们需要计算 $W_i$ 和 $b_i$ 的所有元素的梯度。

如果 N 是输入向量的维度数，M 是输出类别的数量。

$W_i$ 将是 MxN 向量，$b_i$ 将是 Mx1 向量。

在当前应用中，使用基于梯度的函数进行最小化。
成本函数表示为
$\begin{eqnarray*} L(\theta,\mathcal{D}) = - log P(Y=y_i | X=x_i,W,b) \\ L(\theta,\mathcal{D}) = - log \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ L(\theta,\mathcal{D}) = - log {e^{W_i x + b_i}}- log {\sum_j e^{W_j x + b_j}} \\ L(\theta,\mathcal{D}) = - {W_i x + b_i} + log \frac{1}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}$

$\begin{align} \nabla_{\mathbf{w}_i}\ell(\mathbf{w}) &= \frac{\partial}{\partial \mathbf{w}_i}\left(\mathbf{w}_{i}^T\mathbf{x} - \log\left( \sum_{k'}^K \exp(\mathbf{w}_k^T\mathbf{x}) \right) \right) \\ =& \left(\frac{\partial}{\partial \mathbf{w}_i}\mathbf{w}_i^T\mathbf{x} - \frac{\partial}{\partial \mathbf{w}_i}\log\left( \sum_{k'}^K \exp(\mathbf{w}_i^T\mathbf{x}) \right) \right) \\ =& \left(\mathbf{x_i} - \frac{\mathbf{x_i}\exp(\mathbf{w}_i^T\mathbf{x})}{\sum_{k'}^K \exp(\mathbf{w}_i^T\mathbf{x})} \right) \end{align}$
对于所有其他 $W_j$，其中 $j \ne i$
$\begin{align} \nabla_{\mathbf{w}_j}\ell(\mathbf{w}) =& \frac{\partial}{\partial \mathbf{w}_j}\left(\mathbf{w}_{i}^T\mathbf{x}_i - \log\left( \sum_{k'}^K \exp(\mathbf{w}_k^T\mathbf{x}) \right) \right) \\ =& \left(\frac{\partial}{\partial \mathbf{w}_j}\mathbf{w}_i^T\mathbf{x} - \frac{\partial}{\partial \mathbf{w}_j}\log\left( \sum_{k'}^K \exp(\mathbf{w}_i^T\mathbf{x}) \right) \right) \\ =& \left( - \frac{\mathbf{x}_j\exp(\mathbf{w}_j^T\mathbf{x})}{\sum_{k'}^K \exp(\mathbf{w}_k^T\mathbf{x})} \right) \end{align}$

这可以在 Python 中计算为：

    # function computes the negative log likelihood over input dataset 
    #  params is optional argument to pass parameter to classifier 
    #  Useful in cases of iterative optimization routines for function evaluations 
    # like scipy.optimization package 
    def negative_log_likelihood(self,params):
        # args contains the training data
        x,y=self.args;
                 
        self.update_params(params);
        sigmoid_activation = pyvision.softmax(self.W,self.b,x);
        index=[range(0,np.shape(sigmoid_activation)[0]),y];
        p=sigmoid_activation[index]
        l=-np.mean(np.log(p));
        return l;

求和的第一部分是仿射的，第二部分是对指数求和的对数，这是凸的。因此，损失函数是凸的，因此具有唯一的全局最大值/最小值。
因此，我们计算损失函数 $L(\theta,\mathcal{D}) \text{关于} \theta ,\partial{\ell}/\partial{W} \text{ 和 } \partial{\ell}/\partial{b}$ 的导数。
存在不同的基于梯度的最小化方法，如梯度下降、随机梯度下降、共轭梯度下降等。

计算梯度的 Python 代码是：

    # function to compute the gradient of parameters for a single data sample 
    def compute_gradients(self,out,y,x):
        out=(np.reshape(out,(np.shape(out)[0],1)));                
        out[y]=out[y]-1;   
        W=out*x.T;               
        res=np.vstack((W.T,out.flatten()))
        return res;

    # function to compute the gradient of loss function over all input samples     
    #    params is optional input parameter passed to the classifier, which is 
    #   useful in cases  of iterative optimization routines,added for compatiblity with 
    #    scipi.optimization package 
    def gradients(self,params=None):
        # args contains the training data
        x,y=self.args;
        self.update_params(params);
        sigmoid_activation = pyvision.softmax(self.W,self.b,x);        
        e = [ self.compute_gradients(a,c,b) for a, c,b in izip(sigmoid_activation,y,x)]
        mean1=np.mean(e,axis=0);        
        mean1=mean1.T.flatten();
        return mean1;

梯度下降算法的目标是沿着梯度方向迈出一小步以达到全局最大值。

因此，梯度下降算法的特点是更新和评估步骤。在每次迭代中，更新参数值（即 (W,b)），然后根据训练数据集评估逻辑损失函数。重复此过程，直到我们确信获得的参数集导致负对数似然函数的全局最大值。

优化器

我们也不使用自定义的梯度下降算法实现，而是使用 Scipy 优化包中的 fmin_cg 方法实现了共轭梯度下降算法。优化器的输入是评估函数、初始参数和偏导数，输出是最大化输入函数的优化参数。

还指定了回调函数，这些函数执行验证测试以计算训练数据库上的准确率。

要使用 scipy 优化包，我们需要输入给函数的是需要优化的参数值。在优化值处评估的函数是损失函数、损失函数的梯度或 Hessian。优化过程的输出是优化参数值。导数函数的输出是相对于函数在输入参数值指定的点处评估的偏导数。

我们定义了一个名为 Optimizer 的类来执行优化。它目前支持共轭梯度下降和随机梯度下降算法。

在本文中，我们将介绍使用共轭梯度下降算法。

optimizer.py 文件中与共轭梯度下降算法对应的优化函数如下：

            
self.iter=0;
index=self.iter;
batch_size=self.batch_size;  
            
# training data
x=self.training[0];
y=self.training[1];
            
# training data batch for each iteration
train_input=x[index * batch_size:(index + 1) * batch_size];
train_output=y[index * batch_size:(index + 1) * batch_size];
train_output=np.array(train_output,dtype=int);            

self.args=(train_input,train_output);            
args=self.args;
# setting the training data in LogisticRegression class
self.setdata(args);
# getting initial parameter from LogisticRegression class
self.params=self.init();

print "**************************************"
print "starting with the optimization process"
print "**************************************"
print "Executing nonlinear conjugate gradient descent optimization routines ...."
res=optimize.fmin_cg(self.cost,self.params,fprime=self.gradient,maxiter=self.maxiter,callback=self.local_callback);
print "**************************************"
print "completed with the optimization process"

self.cost 对应于 LogisticRegression 类的 negative_log_likelihood 方法。

self.gradient 对应于 LogisticRegression 类的 gradients 方法。

self.callback 方法对应于 Logisitic 注册类的回调方法，该方法在每次迭代时被调用以更新参数值、计算似然度并观察其他指示优化过程进行情况的统计数据。

在 scipy.optimization 包的情况下，它通过将参数值传递给这些函数来计算成本函数和梯度值，并根据函数和梯度评估返回的结果执行参数更新。在调用成本或梯度函数时，参数值会在 LogisticRegression 中更新。参数值也在每次回调迭代循环中更新。

# the function performs training for logistic regression classifier """        
def train(self,train,test,validate):
    if self.n_dimensions==0:
        self.labels=np.unique(train[1]);            
        n_classes = len(self.labels) 
        n_dimensions=np.shape(x)[1];
        self.initialize_parameters(n_dimensions,n_classes);
    # create the optimizer class """       
    opti=Optimizer.Optimizer(10,"CGD",1,600);    
    # set the training and validation datasets"""
    opti.set_datasets(train,test,validate);
    # pass the cost,gradient and callback functions"""
    opti.set_functions(self.negative_log_likelihood,self.set_training_data,
         self.classify,self.gradients,self.get_params,self.callback);
    # run the optimizer"""
    opti.run();

定义了一个名为 LogisticRegression 的类，它封装了用于执行多类别逻辑回归分类器训练和测试的方法。

训练数据集

为演示起见，我们将使用 MNIST 数据集。MNIST 数据集包含手写数字图像，分为 60,000 个训练样本和 10,000 个测试样本。官方的 60,000 个训练集被分为实际的 50,000 个训练样本和 10,000 个验证样本。所有数字图像都经过尺寸归一化并居中放置在一个 28 x 28 像素的固定尺寸图像中。在原始数据集中，图像的每个像素都由 0 到 255 之间的值表示，其中 0 是黑色，255 是白色，中间的任何值都是不同深浅的灰色。
可以在 http://deeplearning.net/data/mnist/mnist.pkl.gz 找到数据集。
数据集是 pickle 格式的，可以使用 Python pickle 包加载。
数据集包括训练集、验证集和测试集。
数据集包含长度为 28 x 28 = 784 的特征向量，类别数为 10。

#  MAIN FUNCTION 
if __name__ == "__main__":    


     model_name1="/home/pi19404/Documents/mnist.pkl.gz"
     data=LoadDataSets.LoadDataSets();
     [train,test,validate]=data.load_pickle_data(model_name1);
     
     x=train[0].get_value(borrow=True);
     y=train[1].eval();     
     train=[x,y];

     x=test[0].get_value(borrow=True);
     y=test[1].eval();
     test=[x,y];
     
     x=validate[0].get_value(borrow=True);
     y=validate[1].eval();
     validate=[x,y];
     
     classifier=LogisticRegression(0,0);
     classifier.Regularization=0;
     classifier.train(train,test,validate);

保存/加载模型文件、将参数传递给优化函数等代码尚未实现，这些将在未来添加。训练过程的输出如下所示。没有正则化的共轭梯度优化器在 5 次迭代中给出 80% 的准确率。

... loading data  /home/pi19404/Documents/mnist.pkl.gz
number of dimensions :  784
number of classes  : 10
number of training samples : 50000
iteration   :  0
Loss function   :  1.98470824149
Validation error   0.39827999999999997
iteration   :  1
Loss function   :  1.35257458245
Validation error   0.32438
iteration   :  2
Loss function   :  1.0896595289
Validation error   0.27329999999999999
iteration   :  3
Loss function   :  0.928568922396
Validation error   0.23626000000000003
iteration   :  4
Loss function   :  0.811558448986
Validation error   0.20977999999999997
Warning: Maximum number of iterations has been exceeded
         Current function value: 0.811558
         Iterations: 5
         Function evaluations: 171
         Gradient evaluations: 171
(50000, 784) (50000,)

代码

代码也可以在 Github 代码仓库中找到。

逻辑回归分类器的 Python 代码可以在 Github 仓库 https://github.com/pi19404/OpenVision/tree/master/ImgML/PyVision/LogisticRegression.py 文件中找到。
LoadDataSets.py 包含从 pickle 文件加载数据集的方法，LogisticRegression.py 是实现多类别逻辑回归分类器的主要文件，而 Optimizer.py 包含处理优化过程的方法。
您可能需要安装 numpy、scipy 等 Python 包。
您可能需要安装 scikit-learn 机器学习包。sklearn.metrics 用于 LogisticRegression 类验证方法中的准确率指标计算。