实现机器学习的分步指南 X - K-Means
易于实现的机器学习
引言
KMeans 是一种简单的聚类算法,它通过计算样本与质心之间的距离来生成聚类。K 是用户给定的聚类数量。在初始时,K 个聚类是随机选择的,并在每次迭代中进行调整以获得最佳结果。质心是其对应聚类中样本的均值,因此该算法被称为 K“均值”。
KMeans 模型
KMeans
标准的 KMeans 算法非常简单。将 K 个聚类表示为 ,每个聚类的样本数量表示为
。KMeans 的损失函数是
.
计算损失函数的导数
然后,使导数等于 0,我们可以得到
即,质心是其对应聚类中样本的均值。KMeans 的代码如下所示
def kmeans(self, train_data, k):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
centers = self.createCenter(train_data)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
其中 adjustCluster()
是在确定初始质心后调整质心的函数,旨在最小化损失函数。adjustCluster
的代码是
def adjustCluster(self, centers, distances, train_data, k):
sample_num = len(train_data)
flag = True # If True, update cluster_center
while flag:
flag = False
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
old_label = distances[:, 0].copy()
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
if np.sum(old_label - distances[:, 0]) != 0:
flag = True
# update cluster_center by calculating the mean of each cluster
for j in range(k):
current_cluster =
train_data[distances[:, 0] == j] # find the samples belong
# to the j-th cluster center
if len(current_cluster) != 0:
centers[j, :] = np.mean(current_cluster, axis=0)
return centers, distances
二分 KMeans
由于 KMeans 可能会得到局部最优解,为了解决这个问题,我们引入了另一种 KMeans 算法,称为二分 KMeans。在二分 KMeans 中,所有样本最初被视为一个聚类,然后该聚类被分成两部分。然后,选择分割聚类的一部分再次进行二分,直到聚类数量为 K。我们根据最小的**平方误差和** (SSE) 来二分聚类。将当前的 *n* 个聚类表示为
我们选择聚类 在
中,并使用标准的 KMeans 将其分成两部分
。SSE 是
我们选择使 SSE 最小化的 作为要二分的聚类,即
重复上述过程,直到聚类数量为 K。
def biKmeans(self, train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
initial_center = np.mean(train_data, axis=0) # initial cluster #shape (1, feature_dim)
centers = [initial_center] # cluster list
# clustering with the initial cluster center
distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)
# generate cluster centers
while len(centers) < self.k:
# print(len(centers))
min_SSE = np.inf
best_index = None
best_centers = None
best_distances = None
# find the best split
for j in range(len(centers)):
centerj_data = train_data[distances[:, 0] == j] # find the samples belonging
# to the j-th center
split_centers, split_distances = self.kmeans(centerj_data, 2)
split_SSE = np.sum(split_distances[:, 1]) ** 2 # calculate the distance
# for after clustering
other_distances = distances[distances[:, 0] != j] # the samples don't belong
# to j-th center
other_SSE = np.sum(other_distances[:, 1]) ** 2 # calculate the distance
# don't belong to j-th center
# save the best split result
if (split_SSE + other_SSE) < min_SSE:
best_index = j
best_centers = split_centers
best_distances = split_distances
min_SSE = split_SSE + other_SSE
# save the spilt data
best_distances[best_distances[:, 0] == 1, 0] = len(centers)
best_distances[best_distances[:, 0] == 0, 0] = best_index
centers[best_index] = best_centers[0, :]
centers.append(best_centers[1, :])
distances[distances[:, 0] == best_index, :] = best_distances
centers = np.array(centers) # transform form list to array
return centers, distances
KMeans++
由于初始质心对 KMeans 的性能有很大影响,为了解决这个问题,我们引入了另一种 KMeans 算法,称为 KMeans++。将当前的 *n * 个聚类表示为
当我们选择第 (n+1) 个质心时,样本与现有质心的距离越远,它被选为新质心的概率就越大。首先,我们计算每个样本与现有质心之间的最小距离
然后,计算每个样本被选为下一个质心的概率
然后,我们使用轮盘赌选择法来获得下一个质心。确定 K 个聚类后,运行 adjustCluster()
来调整结果。
def kmeansplusplus(self,train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
# randomly select a sample as the initial cluster
initial_center = train_data[np.random.randint(0, sample_num-1)]
centers = [initial_center]
while len(centers) < self.k:
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
# Roulette Wheel Selection
prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
index = self.rouletteWheelSelection(prob, sample_num)
new_center = train_data[index, :]
centers.append(new_center)
# adjust cluster
centers = np.array(centers) # transform form list to array
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
结论与分析
实际上,在确定如何选择参数 'K' 之后,调整聚类是必要的,存在一些算法。最后,让我们比较三种聚类算法的性能。
可以发现,KMeans++ 具有最佳性能。
本文相关的代码和数据集可以在 MachineLearning 中找到。
历史
- 2019 年 6 月 3 日:初始版本