使用 C# 查找数据中的组 - 凝聚聚类

Rodrigo Costa Camargos

4.96/5 (17投票s)

2016 年 8 月 28 日

CPOL

10分钟阅读

42696

3697

在 C# 中实现著名的凝聚聚类算法。

引言

在数据集中查找数据聚类是一项具有挑战性的任务，因为算法通常取决于所采用的簇间距离以及所使用的簇直径的定义。

本文介绍了如何使用 C# 编程语言来使用和实现一种著名的层次凝聚聚类算法 AGNES。该算法以及本文中呈现的所有代码都已整合到此项目中：https://github.com/rodrigocostacamargos/SharpCluster.NET

图 1. AggloCluster

AggloCluster（图 1）是一个 Windows 窗体应用程序，使用户能够执行 SharpCluster.NET 库提供的聚类算法。我们将使用此应用程序来执行本文所述的层次凝聚聚类算法。其源代码可以在 https://github.com/rodrigocostacamargos/AggloCluster.NET 中找到。

重要提示：只有本文中描述的代码（及其依赖项）可供下载（SharpCluster.zip）。如果您想探索 AggloCluster 的所有功能，则必须从 Github 上的 AggloCluster/Cluster.NET 项目下载。

背景

机器学习 (ML) 是人工智能 (AI) 的一个子领域，主要侧重于对新形式和计算算法的研究和提出，旨在提供理论支持以及实现计算机的自动学习。在过去的几十年里，已经提出了、讨论和实现了许多关于如何实现自动学习的想法。在 ML 可以实现的众多方法中，所谓的聚类算法是一类特殊的（无监督）算法，其目的是将数据点集组织成群组，以数据探索过程为目标。在文献中，人们可以找到许多提出新聚类算法以及不同聚类分类法的工作。尽管分类法旨在将这些算法组织成类别，但由于它们通常采用不同的分组标准，因此它们并不一定相互兼容。

有效的聚类技术被认为是一项挑战，这主要是因为不存在外部监督，这意味着对点集的内部结构（如空间分布、体积、密度、簇的几何形状等）一无所知。在这种情况下，自动学习成为一项探索性任务，旨在识别哪些数据点集在统计上是可分离的（或不可分离的），哪些是最明显的簇，以及它们与旨在区分的内容有何关联，试图仅根据它们的数据描述（通常是属性值向量）来揭示数据的底层结构。

简单地说，聚类算法试图找到数据点群组，使得群组中的数据点彼此相似（或相关），而与其他群组中的数据点不同（或不相关）。本文所述聚类算法采用的相似性概念是欧几里得距离在两个数据点 (Pi 和 Pj) 之间定义的邻近度量。

因此，聚类算法的主要目标是最小化簇内距离并最大化簇间距离，如图 2 所示。

图 2. 同一簇内数据点之间的距离以及不同簇数据点之间的距离。图片摘自 Tan, Steinbach 和 Kumar 的幻灯片。

为了更好地直观理解聚类真正意味着什么以及聚类算法的作用，使用代码部分将解释如何使用 AggloCluster 和 SharpCluster.NET 库执行聚类任务。您将能够获得一些聚类结果，如图 3 和图 4 所示，同时理解其背后的代码。

图 3 - 聚合数据集上的聚类结果（平均链接 UPGMA 聚类策略）。

图 4 - Smile 数据集上的聚类结果（单链接聚类策略）。

图 3 和图 4 分别显示了 Aggregation.arff 和 Smile.arff 数据集上的聚类结果。这些数据集文件包含在 AggloCluster.zip 文件中。层次凝聚算法用来产生该结果的聚类策略（单链接、全链接和平均链接）将在下一节中描述。

AGNES (AGglomerative NESting) 算法

对于包含 N 个待聚类数据点的给定数据集，层次凝聚聚类算法通常从 N 个簇开始（每个单独的数据点都是一个簇）；算法通过合并两个单个簇形成一个更大的簇继续进行，直到获得包含所有 N 个数据点的单个簇。显然，该算法可以有其他停止标准，例如当获得包含用户定义的簇数 (k) 的聚类时停止。

图 5 展示了 AGNES 算法的高层伪代码，该算法在每次迭代时，根据迄今为止形成的簇之间的最短欧几里得距离选择两个簇进行合并。文献中发现的许多层次凝聚算法可以根据簇间距离的定义方式进行组织，即使用什么定义来计算所有簇对之间的最短距离。在各种方式中，有三种特别受欢迎，并在下面定义。给定两个簇 X 和 Y，令 d(x,y) 表示两个数据点之间的距离。

(1) 单链接，两个簇 X 和 Y 之间的距离是两个数据点 x（属于 X）和 y（属于 Y）之间的最短距离，形式上由公式 (1) 表示。

(2) 全链接，两个簇 X 和 Y 之间的距离是两个数据点 x（属于 X）和 y（属于 Y）之间的最远距离，形式上由公式 (2) 表示。

(3) 平均链接，其中两个簇之间的距离是每个簇中数据点到另一个簇中每个点的平均距离，形式上由公式 (3) 表示。

图 5. AGNES 伪代码的自定义版本

为了找到最近的簇对（即最小距离），AGNES 在不相似性矩阵中查找最低的距离值。不相似性矩阵 (DM) 始终在更新，因为算法在每次迭代时都会将最近的簇对合并成一个新簇，而该新簇是矩阵的新条目。更正式地说，在每个级别 t，当两个簇合并成一个时，不相似性矩阵 DM 的大小变为 (N - t) x (N - t)。DMt 由 DMt-1 通过 (a) 删除与合并簇对应的行和列，以及 (b) 添加包含新形成的簇与旧簇之间距离的新行和新列而得到。

前面描述的矩阵更新过程可能听起来令人困惑，但理解实现它的代码并不难，您将在下一节中看到。为了让您清楚不相似性矩阵的样子，请参见表 1 和表 2。

表 1. 数据集矩阵

为了获得表 2 中显示的值，AGNES 计算表 1 中呈现的所有数据点的欧几里得距离。

表 2 - 不相似性矩阵

查看不相似性矩阵，您会发现最低的距离值是 0.41（不考虑对角线元素），AGNES 将开始合并数据点 2 和 4，形成一个新簇。

Using the Code

在本节中，我们将使用 AggloCluster 来执行 SharpCluster.NET 库提供的 AGNES 算法实现。因此，首先下载 SharpCluster.zip，解压文件并编译源代码（它将生成新的 SharpCluster.dll 和 SharpCluster.pdb 文件）。其次下载 AggloCluster.zip，解压文件并将 SharpCluster.dll 和 SharpCluster.pdb 替换为新的文件。然后，执行 AggloCluster.exe 并执行以下步骤：

在“预处理”选项卡中，单击“打开文件”，然后选择 AggloCluster.zip 文件附带的任何数据集文件；
接下来，单击“验证”选项卡，然后单击“AGNES”选项卡；
然后，从下拉列表中选择四种聚类策略之一；
输入簇的数量（COP.arff 有三个簇，Aggregation.arff 有七个簇，Simle.arff 有四个簇）；
最后，单击“开始聚类”按钮。单击“开始聚类”按钮将执行 ExecuteClustering 方法。（如果您想调试，请在 Visual Studio 中打开 SharpCluster.csproj 并将 AggloCluster 进程附加到它）。

如前所述，AggloCluster 调用第一个方法是 SharpCluster.csproj 中 Agnes 类中的 ExecuteClustering 方法。

public Clusters ExecuteClustering(ClusterDistance.Strategy strategy, int k)
{
    // Step 1
    // build a clustering only with singleton clusters
    this._BuildSingletonCluster();
    
    // Step2
    //build the dissimilarity matrix
    this._BuildDissimilarityMatrixParallel();
    
    // Step 3
    // build the hierarchical clustering 
    this.BuildHierarchicalClustering(_clusters.Count(), strategy, k);
    
    return _clusters; //represents the clustering data structure
}

在创建由单个簇组成的初始聚类（步骤 1）后，通过并行执行所有必要的计算来构建包含所有簇对之间距离的不相似性矩阵（步骤 2）。可以顺序执行此计算；然而，我发现并发实现是可行的，以提高算法的整体性能。

private void _BuildDissimilarityMatrixParallel()
{
    double distanceBetweenTwoClusters;
    _dissimilarityMatrix = new DissimilarityMatrix();
    
    Parallel.ForEach(_ClusterPairCollection(), clusterPair =>
    {
        distanceBetweenTwoClusters = 
           ClusterDistance.ComputeDistance(clusterPair.Cluster1, clusterPair.Cluster2);
        _dissimilarityMatrix.AddClusterPairAndDistance
                             (clusterPair, distanceBetweenTwoClusters);
    });
}

_ClusterPairCollection 方法仅构建簇对集合。簇对和距离是不相似性矩阵的条目。

private IEnumerable<ClusterPair> _ClusterPairCollection()
{
    ClusterPair clusterPair;
    
    for (int i = 0; i < _clusters.Count(); i++)
    {
        for (int j = i + 1; j < _clusters.Count(); j++)
        {
            clusterPair = new ClusterPair();
            clusterPair.Cluster1 = _clusters.GetCluster(i);
            clusterPair.Cluster2 = _clusters.GetCluster(j);
            
            yield return clusterPair;
        }
    }
}

在我们深入研究 Agnes 类的主方法之前，了解 DissimilarityMatrix 类的一些方法的工作原理很重要。首先，让我们看看 GetClusterPairDistance 方法。

public double GetClusterPairDistance(ClusterPair clusterPair)
{
   double clusterPairDistance = Double.MaxValue;
   if (_distanceMatrix.ContainsKey(clusterPair))
           clusterPairDistance = _distanceMatrix[clusterPair];
        else
           clusterPairDistance = _distanceMatrix[new ClusterPair
                                 (clusterPair.Cluster2, clusterPair.Cluster1)];
           
        return clusterPairDistance;
}

_distanceMatrix 是一个 ConcurrentDictionary，而 ClusterPair 是它的键。ClusterPair 由两个 Cluster 对象（cluster1 和 cluster2）组成，每个 Cluster 对象都有自己的 ID。因此，ClusterPair 有两个 ID，所以字典有两个键。这里的一个问题是 ClusterPair 可以由 cluster1.id = 5 和 cluster2.id = 6 或 cluster1.id = 6 和 cluster2.id = 5 组成。这就是为什么我们有 if else 语句。另一个问题是如何实现一个带有两个键的字典？我找到的答案是重写 Equals 和 GetHashCode 方法。

 public class EqualityComparer : IEqualityComparer<ClusterPair>
    {
        public bool Equals(ClusterPair x, ClusterPair y)
        {
            return x._cluster1.Id == y._cluster1.Id && 
                   x._cluster2.Id == y._cluster2.Id;
        }

        public int GetHashCode(ClusterPair x)
        {
            return x._cluster1.Id ^ x._cluster2.Id;
        }
    }

DissimilarityMatrix 类的 GetClosestPair 方法代码非常直接，它只是在矩阵中查找最低的距离值。我没有将字典值复制到 List 中进行排序然后获取第一个值，或者做类似的事情，而是更喜欢用老式的方法来查找未排序数组中的 min 值。我必须老实告诉你，我代码的第一行是一个单行 lambda 表达式，忽略了它的 O(n log n) 操作（平均情况）。当然，有一个更好的方法可以做到这一点，比如在添加/删除距离值时存储 min 值，这样 min 值就可以在恒定时间 O(1) 内找到，但我只是没有这样做。

        public ClusterPair GetClosestClusterPair()
        {
            double minDistance = double.MaxValue;
            ClusterPair closestClusterPair = new ClusterPair();

            foreach (var item in _distanceMatrix)
            {
                if(item.Value < minDistance)
                {
                    minDistance = item.Value;
                    closestClusterPair = item.Key;
                }
            }

            return closestClusterPair;
        }

现在让我们看看 ClusterDistance 类的 ComputeDistance 方法。此方法棘手的部分与平均链接 UPGMA 距离有关。我们必须跟踪集群中数据点的总数量（深入整个集群层次结构），并且我必须实现一个递归方法来完成此操作（这里还有另一个提高性能的机会）。

public static double ComputeDistance
(Cluster cluster1, Cluster cluster2, 
         DissimilarityMatrix dissimilarityMatrix, Strategy strategy)
{
    double distance1, distance2, distance = 0;
    //get the distance between cluster1 and subcluster0 of the cluster2
    distance1 = dissimilarityMatrix.GetClusterPairDistance
                (new ClusterPair(cluster1, cluster2.GetSubCluster(0)));
    //get the distance between cluster1 and subcluster1 of the cluster cluster2
    distance2 = dissimilarityMatrix.GetClusterPairDistance
                (new ClusterPair(cluster1, cluster2.GetSubCluster(1)));
    
    switch (strategy)
    {
        case Strategy.SingleLinkage: distance = _MinValue(distance1, distance2); break;
        case Strategy.CompleteLinkage: distance = _MaxValue(distance1, distance2); break;
        case Strategy.AverageLinkageWPGMA: distance = (distance1 + distance2) / 2; break;
        case Strategy.AverageLinkageUPGMA:
            distance = ((cluster2.GetSubCluster(0).TotalQuantityOfPatterns * distance1) / 
            cluster2.TotalQuantityOfPatterns) + 
            ((cluster2.GetSubCluster(1).TotalQuantityOfPatterns * distance2) / 
              cluster2.TotalQuantityOfPatterns);
            break;
        default: break;
    }
    
    return distance;
}

下一个方法负责更新不相似性矩阵，如上一节末尾所述。确保删除旧的距离值以及旧的簇（即合并的簇）很重要。

private void _UpdateDissimilarityMatrix
        (Cluster newCluster, ClusterDistance.Strategy strategie)
{
    double distanceBetweenClusters;
    for (int i = 0; i < _clusters.Count(); i++)
    {
        // compute the distance between old clusters to the new cluster
        distanceBetweenClusters = ClusterDistance.ComputeDistance
        (_clusters.GetCluster(i), newCluster, _dissimilarityMatrix, strategie);
        // insert the new cluster's distance
        _dissimilarityMatrix.AddClusterPairAndDistance
        (new ClusterPair(newCluster, _clusters.GetCluster(i)), distanceBetweenClusters);
        //remove all old distance values of the old clusters 
        //(subclusters of the newcluster)
        _dissimilarityMatrix.RemoveClusterPair(new ClusterPair
                (newCluster.GetSubCluster(0), _clusters.GetCluster(i)));
        _dissimilarityMatrix.RemoveClusterPair(new ClusterPair
                (newCluster.GetSubCluster(1), _clusters.GetCluster(i)));
    }
    
    // finally, remove the distance of the old cluster pair
    _dissimilarityMatrix.RemoveClusterPair(new ClusterPair
                  (newCluster.GetSubCluster(0), newCluster.GetSubCluster(1)));
}

最后，BuildHierarchicalClustering 递归方法只是调用前面讨论的方法，并在簇的数量等于 k（即用户定义的簇数量）时停止。

private void BuildHierarchicalClustering
(int indexNewCluster, ClusterDistance.Strategy strategy, int k)
{
    ClusterPair closestClusterPair = this._GetClosestClusterPairInDissimilarityMatrix();
    Cluster newCluster = new Cluster();
    newCluster.AddSubCluster(closestClusterPair.Cluster1);
    newCluster.AddSubCluster(closestClusterPair.Cluster2);
    newCluster.Id = indexNewCluster;
    newCluster.UpdateTotalQuantityOfPatterns();
       
    _clusters.RemoveClusterPair(closestClusterPair);
    _UpdateDissimilarityMatrix(newCluster, strategy);
    _clusters.AddCluster(newCluster);
    closestClusterPair = null;
        
    if (_clusters.Count() > k)
        this.BuildHierarchicalClustering(indexNewCluster + 1, strategy, k);
}

实际上，在 AggloCluster 输出如图 3 和图 4 所示的聚类结果之前，还需要进行一个步骤；有必要将层次聚类结构（AGNES 是一种层次聚类算法）转换为具有 k 个分区的扁平化聚类。

public Cluster[] BuildFlatClustersFromHierarchicalClustering(Clusters clusters, int k)
{
    Cluster[] flatClusters = new Cluster[k]; 
    for (int i = 0; i < k; i++)
    {
        flatClusters[i] = new Cluster();
        flatClusters[i].Id = i;
        foreach (Pattern pattern in clusters.GetCluster(i).GetAllPatterns())
            flatClusters[i].AddPattern(pattern);
    }
  
    return flatClusters;
}

关注点

机器学习是一个非常有趣的学科，但是，如果您没有扎实的数学/统计学背景，可能会难以理解。为了克服这一点，我找到了一种方法来理解模式识别书籍中描述的许多花哨的数学符号和概念，方法是做一些代码并将其翻译成我能理解的语言。因此，我可以说我对某些无监督机器学习算法的工作原理有更直观的认识，我希望在阅读完本文后，您也能获得同样的直观认识。

如果您对 SharpCluster.NET/AggloCluster 项目感兴趣，请随时在 Github 上加入我。那里有很多工作要做，从提高算法性能到构建更友好的用户界面。