C#.NET K-Means聚类算法实现推荐系统





5.00/5 (35投票s)
在本文中,我们将演示如何实现 K-Means 聚类算法来生成推荐。
以下文章的 PDF 版本也可在 https://www.pdf-archive.com/2016/12/13/kmeansre/ 或 http://www.filehosting.org/file/details/625473/KMeansRE.pdf 下载
“K-Means 是一种广泛应用于聚类分析的方法。我的理解是,该方法不需要任何假设,即给我一个数据集和一个预先指定的聚类数 k,然后我只需应用这个算法来最小化 SSE(簇内平方误差)。” – David Robinson,StackOverflow 数据科学家。
引言
K-Means 聚类是最可靠的无监督学习数据挖掘算法之一,用作解决之前被认为是不可计算的著名聚类问题的替代方案。以下算法最初由 Hugo Steinhaus 和 Stewart Lloyd 于 1950 年提出,并由 James McQueen 于 1967 年改进。从那时起,K-Means 聚类算法被用于解决各种计算问题,例如回归分析中的平方误差和(SSE)最小化、图形图像分割、推荐引擎开发、维护高效的软件功能以提供互联网公共安全和防盗保护等。
在本文中,我们将讨论如何使用 K-Means 聚类算法来实现一个简单的推荐引擎,该引擎执行用户到项目的协同过滤,根据用户的个人偏好向用户社区推荐在某个社交媒体网站上发布的文章。
背景
在以下文章的本节中,我们将提供对用于解决许多计算问题的经典 K-Means 聚类算法的基本表示的理解。特别是,我们将深入讨论 K-Means 聚类过程的要点以及
K-Means 聚类算法基础
通常,K-Means 算法通过将某个数据集分成 - 簇来解决简单的分类问题,每个簇包含多个最相似的数据项(或简称“观察”),这些数据项根据与最近“均值”的最小距离排列成簇,而“均值”又是该簇的“原型”。换句话说,K-Means 聚类算法从整个数据集中选择
个与特定最近均值距离最近的项目,并构建一个新簇。根据以下算法,K-Means 过程为每个最近均值选择数量最多的相似项目,其值是为每个特定的新创建簇预先计算的。
在这种情况下,对于每个最近均值的值,我们正在计算一个单独的簇,其中包含与当前最近均值最相似的数据项。聚类过程的另一个重要条件是,K-Means 过程从给定数据集中选择的每个项必须与同一簇中包含的其他项具有最近的距离(即必须最相似)。执行 K-Means 聚类的主要思想是,如果两个不同被选择的项与同一最近均值具有最小距离(即相似性),则这些项之间的实际距离也趋于最小。
给定数据集中每个项的值基本上由一组描述该项的属性表示。给定项的属性集通常具有数值表示,其中每个属性都是一个实值。因此,与数据项相关联的整个属性集被表示为 中的某个向量 -
维欧几里得空间,其中
是描述数据项的实际属性数量(
)。
反过来,为每个新簇计算的最近均值的值也表示为属性的数值实向量 ,在
- 维欧几里得空间中,与每个项的属性向量
具有特定距离。实际上,某个最近均值的属性向量
包含其值对簇中每个项最常见的属性。正如我们上面已经提到的,最近均值本身为选入簇的特定数量的项目提供了概括,并作为整个簇的原型。例如,如果我们有两个属于同一簇的项目,并且每个项目都有一个特定的属性向量
,那么与这两个项目的最近均值相关联的属性向量
将由与一个项和另一个项的两个向量中包含的属性值距离最近的值组成。
给定数据集中特定数据项与最近均值之间的距离通过使用目标函数(例如欧几里得、曼哈顿或出租车距离)计算,其值通常基于最近均值 或簇中特定项
的属性向量之间的距离来表示项与最近均值之间的相似性。根据 K-Means 聚类算法,使用欧几里得距离函数比使用其他函数来计算此距离更为可取。欧几里得距离通常计算为描述最近均值和项的特定属性之间部分距离的平方和,并可以公式化如下:
(1)
,其中 - 簇中最近均值
和项
的两个属性向量之间的距离;
- 项
的属性
的值;
- 当前最近均值的属性
的值;
- 实际属性数量;
基本上,有多种方法可以计算每个特定簇的最近均值属性向量 。在大多数情况下,最近均值属性向量被计算为表示特定项目子集质量中心的向量。这个质量中心也称为新构建簇的“质心”。特别是,质心属性向量是通过计算簇内所有项目的属性值的平均值(“均值”)获得的。为了计算平均属性向量,我们实际上需要对簇中所有项目相关联的多个属性向量进行分量加法。以下公式说明了单个质心属性值
的计算,它是给定质心向量
的一个分量:
(2)
,其中 - 给定质心属性向量中属性
的值;
- 项
的属性
的值;
- 簇中项目的实际数量;
通常,K-Means 聚类过程的步骤可以表述如下
0. 从整个项目数据集中构建一个初始簇: 在我们开始之前,我们需要将给定数据集中的所有项目分配给一个初始簇,并将其添加到簇数组中。我们还需要为初始簇选择一组多个质心。请注意,与其他后续将由 K-Means 过程生成的簇不同,为了执行有效的聚类,初始簇应该包含不止一个质心;
1. 使用 Forgy 或随机分区方法随机初始化初始簇的 个质心: 要执行初始化,我们实际上必须通过为向量
的每个属性
、
分配随机值来为每个当前质心
生成一个属性向量。此外,在此步骤中,我们将循环计数器变量
和
分配初始值;
2. 对于从簇数组中检索到的当前簇 ,找到与每个质心
距离最小的项目并生成新簇: 在此步骤中,我们从数组中检索一个簇,并继续执行后续步骤,根据属于当前簇的质心集创建一个或多个新簇;
3. 对于当前质心 ,使用欧几里得距离公式 (1) 计算与每个项目
的实际距离: 此时,我们需要计算与当前质心
相关联的属性向量
与每个项目的属性向量
之间的距离;
4. 选择 与当前质心
距离最近的项目,并通过将每个距离最小的项目分配给新创建的簇来生成新簇: 为此,我们基本上使用以下公式
来找到所有属性向量
满足到当前质心
属性向量
最小距离标准的项目;
5. 根据公式 (2) 计算新的质心值,并将其分配给在步骤 4 中生成的新簇: 在这个关键步骤中,我们将计算新的质心属性向量 ,作为以逐分量属性向量
表示的质量中心,对新构建的簇中包含的每个项目
进行求和。之后,我们将新质心分配给该簇。如果无法为新簇计算新的质心值,则转到算法的下一步,否则继续执行步骤 7;
6. 输出新构建的簇: 此时,我们假设该特定簇不再可能进行聚类。因此,我们将该簇传递给输出。类似地,我们将对每个完全符合步骤 5 中应用条件的新簇执行输出。生成输出后,转到步骤 8 并继续处理下一个质心 ;
7. 将前几步获得的新簇添加到簇数组中,并继续执行以下算法的下一步: 在此步骤中,我们将获得的新簇插入到簇数组中,该数组在执行递归聚类时基本上用作簇队列,在此期间,数组中的每个簇根据其一个或多个质心的值被分成若干个新簇。最后,我们继续执行算法的下一步;
8. 选择下一个质心 我们将对其执行实际聚类,并检查这是否是当前从簇数组中检索到的簇的最终质心 (
): 如果是,请继续执行步骤 9,否则执行步骤 3-7 以为当前簇的每个质心生成其余的新簇;
9. 从簇数组中选择下一个簇 并检查这是否是簇队列中的最终簇 (
),或者当前正在处理的簇是否无法重新计算新的质心: 如果是,则终止 K-Means 聚类过程,否则返回步骤 2;
图 1. 阐释了旨在基于 个质心生成三个新簇的 K-Means 聚类过程。每个项目由二维欧几里得空间 (
,
) 中的属性向量 (
) 表示:
使用 K-Means 聚类生成推荐
除了经典的 K-Means 聚类算法,本文还将详细解释基于一个简单推荐引擎实现 K-Means 聚类算法的示例,该引擎用于向访问社交媒体网站的用户推荐文章。
为了生成推荐,K-Means 过程基本上根据两种类型的对象执行聚类——“用户”或“项目”(即文章),它们与用户 或项目
属性的两个向量相关联。特别是,K-Means 推荐引擎实现了用户到项目的协同过滤方法,并根据用户的个人偏好向特定用户推荐一定数量的项目。通过执行聚类,K-Means 过程实际上通过大量由多个属性描述的文章执行“过滤”,这些属性的值正好与特定用户的偏好集相对应。在这种情况下,K-Means 聚类过程的主要思想是尝试找到用户和项目对象之间的相似性,它们本身具有不同的语义,但位于相同的欧几里得
- 维空间中,它们之间的相关性由用户和项目属性向量之间的实际距离表示。在这种情况下,这些用户和项目属性向量用作简单的潜在参数集,我们通过它们建立用户和项目对象之间的关系。在这种情况下,相似性度量(由用户或项目属性向量之间的欧几里得距离表示)是基于最相似的属性(例如,彼此之间距离最近的属性)计算的。换句话说,我们实际上通过计算特定属性之间部分距离的平方和来获得用户和项目之间的相似性。以下过程通常意味着在回归分析中使用普通最小二乘法 (OLS) 执行平方误差和最小化 (SSE),或在潜在语义分析中进行奇异值分解 (SVD) 计算,能够提供用户到项目的推荐。
为此,K-Means 聚类算法旨在找到与特定用户最相似的项目数量,该用户具有由属性向量 表示的自身偏好,并计算簇,其中与许多项目相关联的特定用户作为整个簇的最近“均值”(或“质心”)。在这种特定情况下,用户和项目的相似性通常由用户
和项目
属性的两个向量之间的欧几里得距离(即“度量”)表示。为此,我们使用略微不同的公式来执行相似性计算:
(3)
,其中 - 用户与项目之间相似性值的加权度量;
- 项
的属性
的值;
- 特定用户的属性
的值;
- 用户和项目属性向量中属性
之间的相关性系数;
- 实际属性数量;
与上面讨论的允许计算两个属性向量之间实际距离的传统欧几里得距离公式不同,相似性度量是作为加权距离值获得的,表示为用户或项目特定属性之间部分距离的平方和,每个部分距离乘以一个特定的权重系数 。在这种情况下,我们基本上处理的是基于距离的相似性,其值取决于多个参数。以下公式 (3) 的多个额外参数集表示为包含特定属性的这些额外参数之和的被除数:
,其中
- 是特定用户或项目的属性
的额外参数
。
在这种情况下,除了用户和项目属性之间的部分距离之外,我们还使用单个参数 ——这些属性之间的相关性。由于我们实现上下文数据模型来生成推荐,我们将采用词典相关性度量作为公式 (3) 的参数
。
(4)
, 其中 - 特定用户和项目的属性
之间的词典相关性值;
- 特定项目的属性
的字符串表示中的字符
;
- 特定用户的属性
的字符串表示中的字符
;
- 表示用户或项目属性
的两个字符串中最短字符串的实际长度;
为了计算用户或项目属性的相关性,我们通常使用一个简单的线性搜索算法,通过遍历表示属性 的两个字符串,并为每个位置执行位于相同位置
的字符比较。如果这些字符
和
完全匹配,则我们将
的值添加到相关性
的估计总值中。参数
是表示用户或项目属性
的最小字符串的实际长度。关于词典表示的用户和项目属性作为上下文数据表示模型的基本部分,将在以下文章的下一节中讨论。
此外,为了提供更灵活的基于距离的相似性计算,使用公式 (3,4),建议通过使用另一个额外参数来修改公式 (3),该参数表示用户或项目属性 中具有相关性值
的实际匹配数量。我们在公式 (3) 中使用以下参数的主要原因是,特定用户和项目的相似性(由用户或项目属性向量之间实际距离的值表示)很大程度上取决于词典上相等的属性数量。在这种情况下,公式 (3) 具有以下表示:
(5)
大量词典上匹配的属性确保了用户和项目属性的两个向量之间的欧几里得距离通常很小,并且特定用户和项目之间的相似性度量通常会增加。然而,在特殊情况下,当选择的特定用户和项目仅仅不相似(例如,用户或项目的属性向量没有任何完全匹配的属性)时,额外参数 的值。在这种情况下,我们假设参数
的值等于 0.01,这意味着用户或项目的属性向量之间的实际距离趋于较大。换句话说,我们使用参数
的值,这反过来提供了不相似用户和项目之间通常较大的距离,以避免最不相似的项目与特定用户之间的距离比其他具有更大基于距离的相似性值的项目更近的情况。
使用代码
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace KMeans
{
class Attrib : ICloneable
{
private string m_Name;
private double m_Value;
public Attrib(string name, double value)
{
Name = name; Value = value;
}
public object Clone()
{
Attrib TargetAttrib = (Attrib)this.MemberwiseClone();
TargetAttrib.Name = Name; TargetAttrib.Value = Value;
return TargetAttrib;
}
public string Name
{
get { return m_Name; }
set { m_Name = value; }
}
public double Value
{
get { return m_Value; }
set { m_Value = value; }
}
}
class IAttribList : ICloneable, IEnumerable<Attrib>
{
private List<Attrib> m_AttribList = null;
public IAttribList()
{
if (m_AttribList == null)
m_AttribList = new List<Attrib>();
}
public void Add(Attrib attrib_item)
{
m_AttribList.Add((Attrib)attrib_item.Clone());
}
public object Clone()
{
IAttribList TargetAttribList = new IAttribList();
foreach (Attrib attrib in m_AttribList)
TargetAttribList.Add(attrib);
return TargetAttribList.Count() > 0 ?
(IAttribList)TargetAttribList.Clone() : null;
}
public Attrib this[int iIndex]
{
get { return m_AttribList[iIndex]; }
set { m_AttribList[iIndex] = value; }
}
public int Count() { return m_AttribList.Count(); }
public IEnumerator GetEnumerator()
{
return m_AttribList.GetEnumerator();
}
IEnumerator<Attrib> IEnumerable<Attrib>.GetEnumerator()
{
return (IEnumerator<Attrib>)this.GetEnumerator();
}
}
class Item : ICloneable
{
private bool m_IsUser;
private bool m_Exists;
private string m_ItemText;
private double m_Distance;
private IAttribList m_AttribList = null;
public Item(string item_text, IAttribList attributes,
double distance, bool is_user, bool exists)
{
if (AttribList == null)
AttribList = (IAttribList)attributes;
IsUser = is_user; Exists = exists;
ItemText = item_text; Distance = distance;
}
public IAttribList GetAttribList()
{
return m_AttribList;
}
public object Clone()
{
Item TargetItem = (Item)this.MemberwiseClone();
IAttribList TargetAttribList = new IAttribList();
foreach (Attrib attrib in this.AttribList)
TargetAttribList.Add(attrib);
if (TargetAttribList.Count() > 0)
TargetItem.AttribList = (IAttribList)TargetAttribList.Clone();
TargetItem.IsUser = this.IsUser; TargetItem.Exists = this.Exists;
TargetItem.ItemText = this.ItemText; TargetItem.Distance = this.Distance;
return TargetItem;
}
private double Relevance(string attrib1, string attrib2)
{
double nRelevance = 0;
// Assigning the nLength variable the value of the smallest string length
int nLength = attrib1.Length < attrib2.Length ? attrib1.Length : attrib2.Length;
// Iterating through the two strings of character, comparing the pairs of items
// from either attrib1 and attrib2. If the two characters are lexicographically equal
// we're adding the value 1 / nLength to the nRelevance variable
for (int iIndex = 0; iIndex < nLength; iIndex++)
nRelevance += (attrib1[iIndex] == attrib2[iIndex]) ? (double)1 / nLength : 0;
return nRelevance;
}
public double EuclDW(Item item)
{
int nCount = 0;
int iIndex = 0; double nDistance = 0;
// Iterating through the array of attributes and for each pair of either users or items
// attributes computing the distance between those attributes. Then, each distance values
// is added to the nDistance variable. During the computation we're also obtaining the
// value of releavance between the lexicographical representations of those attributes
while (iIndex < item.AttribList.Count() && iIndex < AttribList.Count())
{
// Compute the relevance between names of the pair of attributes
double nRel = Relevance(item.AttribList[iIndex].Name, AttribList[iIndex].Name);
if (nRel == 1) nCount++;
// Computing the Eucledean distance between the pair of current attributes
nDistance+= Math.Pow(item.AttribList[iIndex].Value - AttribList[iIndex].Value, 2.0) *
((double)((nRel > 0) ? nRel : 1));
iIndex++;
}
// Returning the value of the distance between two vectors of attributes
return Math.Sqrt(nDistance) * ((double)1 / ((nCount > 0) ? nCount : 0.01));
}
public string ItemText
{
get { return m_ItemText; }
set { m_ItemText = value; }
}
public double Distance
{
get { return m_Distance; }
set { m_Distance = value; }
}
public bool IsUser
{
get { return m_IsUser; }
set { m_IsUser = value; }
}
public bool Exists
{
get { return m_Exists; }
set { m_Exists = value; }
}
public IAttribList AttribList
{
get { return m_AttribList; }
set { m_AttribList = value; }
}
}
class IItemsList : ICloneable, IEnumerable<Item>
{
private List<Item> m_ItemsList = null;
public IItemsList()
{
if (m_ItemsList == null)
m_ItemsList = new List<Item>();
}
public void Add(Item item)
{
m_ItemsList.Add(item);
}
public object Clone()
{
IItemsList TargetItems = new IItemsList();
foreach (Item item in m_ItemsList)
{
IAttribList TargetAttribList = new IAttribList();
foreach (Attrib attrib in item.GetAttribList())
TargetAttribList.Add(new Attrib(attrib.Name, attrib.Value));
if (TargetAttribList.Count() > 0)
TargetItems.Add(new Item(item.ItemText, TargetAttribList,
item.Distance, item.IsUser, item.Exists));
}
return TargetItems;
}
public Item this[int iIndex]
{
get { return m_ItemsList[iIndex]; }
set { m_ItemsList[iIndex] = value; }
}
public int Count() { return m_ItemsList.Count(); }
public void RemoveAt(int iIndex) { m_ItemsList.RemoveAt(iIndex); }
public IEnumerator GetEnumerator()
{
return m_ItemsList.GetEnumerator();
}
IEnumerator<Item> IEnumerable<Item>.GetEnumerator()
{
return (IEnumerator<Item>)this.GetEnumerator();
}
}
class ICluster : ICloneable
{
private IItemsList m_Items;
private IItemsList m_Centroids;
public ICluster(IItemsList cnts_list, IItemsList items_list)
{
Items = (IItemsList)items_list;
Centroids = (IItemsList)cnts_list;
}
public object Clone()
{
IItemsList TargetItemsList = new IItemsList();
IItemsList TargetCentroidsList = new IItemsList();
ICluster TargetCluster = (ICluster)this.MemberwiseClone();
foreach (Item centroid in Centroids)
TargetCentroidsList.Add(centroid);
foreach (Item item in Items)
TargetItemsList.Add(item);
TargetCluster.Items = TargetItemsList;
TargetCluster.Centroids = TargetCentroidsList;
return TargetCluster;
}
public IItemsList Items
{
get { return (IItemsList)m_Items; }
set { m_Items = (IItemsList)value; }
}
public IItemsList Centroids
{
get { return (IItemsList)m_Centroids; }
set { m_Centroids = (IItemsList)value; }
}
}
class IClustersList : ICloneable, IEnumerable<ICluster>
{
private List<ICluster> m_ClustersList = null;
public IClustersList()
{
if (m_ClustersList == null)
m_ClustersList = new List<ICluster>();
}
public void Add(ICluster cluster)
{
m_ClustersList.Add(cluster);
}
public object Clone()
{
IClustersList TargetClustersList = new IClustersList();
foreach (ICluster cluster in m_ClustersList)
{
IItemsList TargetCentroidsList = new IItemsList();
foreach (Item centroid in (IItemsList)cluster.Centroids.Clone())
TargetCentroidsList.Add(new Item(centroid.ItemText, (IAttribList)centroid.AttribList.Clone(),
centroid.Distance, centroid.IsUser, centroid.Exists));
IItemsList TargetItemsList = new IItemsList();
foreach (Item item in (IItemsList)cluster.Items.Clone())
TargetItemsList.Add(new Item(item.ItemText, (IAttribList)item.AttribList.Clone(),
item.Distance, item.IsUser, item.Exists));
TargetClustersList.Add(new ICluster((IItemsList)TargetCentroidsList.Clone(),
(IItemsList)TargetItemsList.Clone()));
}
return TargetClustersList;
}
public ICluster this[int iIndex]
{
get { return m_ClustersList[iIndex]; }
}
public int Count() { return m_ClustersList.Count(); }
public IEnumerator GetEnumerator()
{
return m_ClustersList.GetEnumerator();
}
IEnumerator<ICluster> IEnumerable<ICluster>.GetEnumerator()
{
return (IEnumerator<ICluster>)this.GetEnumerator();
}
}
class KMeans
{
private IItemsList m_Items;
private IItemsList m_Users;
private IClustersList m_Clusters;
private IClustersList m_TargetClusters;
private readonly System.Random rnd = new System.Random();
private double m_MinVal = 0;
private double m_MaxVal = 0;
private void Swap(ref IAttribList attribs, int indexA, int indexB)
{
Attrib temp = attribs[indexA];
attribs[indexA] = attribs[indexB];
attribs[indexB] = temp;
}
private void Normalize(IItemsList items_list, int n_attribs,
bool is_users, ref double min_val, ref double max_val)
{
// Performing a check if the minimum and maximum value are equal to 0
if (min_val == 0 && max_val == 0)
{
// Assigning the initial values to min_val and max_val variable,
// which represent the boundaries at which the value of each attribute is normalized
min_val = (double)1 / n_attribs;
max_val = (double)n_attribs / (n_attribs + 1);
}
// Iterating through the array of items and for each item items_list[iItem]
// performing normalization by distributing the value of each attribute in
// the range of [0;1] using local extremum formula
for (int iItem = 0; iItem < items_list.Count(); iItem++)
{
// For the current item items_list[iItem].AttribList retriving the array of attributes
IAttribList AttribsTarget = items_list[iItem].AttribList;
// Iterating through the array of attributes and for each attribute perform normalization
// by converting its value into the value from the range [0;1] using the following formula
for (int iAttrib = 0; iAttrib < AttribsTarget.Count(); iAttrib++)
// Performing a check if the value of the current attribute AttribsTarget[iAttrib].Value
// exceeding the [0;1] range and this is not the user's attribute
if (AttribsTarget[iAttrib].Value > 1 || is_users == false)
// If so, applying the following formula to normalize the current attribute value
AttribsTarget[iAttrib].Value = ((AttribsTarget[iAttrib].Value /
(n_attribs + 1)) - min_val) / (max_val - min_val) + 0.01;
}
}
public int LoadItemsFromFile(string filename)
{
// Intializing the file stream object and opening the file with the name being specified
using (System.IO.FileStream fsFile = new System.IO.FileStream(filename,
System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
{
// Initializing the stream reader object
using (System.IO.StreamReader fsStream = new System.IO.StreamReader(
fsFile, System.Text.Encoding.UTF8, true, 128))
{
int iItem = 0;
int nAttrib = 1; string textBuf = "\0";
// Retrieving each line from the file until we reach the end-of-file
while ((textBuf = fsStream.ReadLine()) != null)
{
// Performing if the line is not empty and contains the data on a specific item
if (!String.IsNullOrEmpty(textBuf))
{
// If so, initializing the array of attributes TargetAttribList for the current item
IAttribList TargetAttribList = new IAttribList();
// Tokenizing the string according to the regular expression pattern assigned to the sItemPatern variable
string sItemPattern = " => "; string[] sItemTokens;
if ((sItemTokens = Regex.Split(textBuf, sItemPattern)) != null)
{
// Iterating through the array of tokens and for each string
// perform another tokenization to obtain the set of attributes name for the current item
for (int iToken = 0; iToken < 2; iToken++)
{
// For each string sItemTokens[iToken] we're performing tokenization to obtain the set of attributes names
string sPattern = " "; string[] sTokens;
if ((sTokens = Regex.Split(sItemTokens[iToken], sPattern)) != null)
{
// For each particular attribute name token we're performing encoding
// to obtain each attribute value associated with its name
foreach (string token in sTokens)
{
// At this point, we're performing a check if the attribute with similar name
// for a specific item has not been already indexed into the array of attributes
bool bExists = false; int nToken = 0;
int nIndex = iItem; double nCoeff = 0;
// Iterating the array of items to find those items that have
// the attribute with the name which is equal to the name of
// the current attribute attribs[nToken].Name.Equals(token) being processed
while (--nIndex >= 0 && bExists == false)
{
nToken = 0;
// Iterating through the array of attributes of the current item m_Items[nIndex]
// and performing a check if a certain atribute's name of the current item is not equal
// the name of the current attributed being retrieved from the file. If so, we're ending
// the loop execution by assinging the the bExists variable value of true
IAttribList attribs = m_Items[nIndex].AttribList;
while (nToken < attribs.Count() && bExists == false)
bExists = (attribs[nToken++].Name.Equals(token)) ? true : false;
}
// Computing the coefficient value for the current attribute retrieved from the file.
// If an item with the similar name of attribute has already been indexed into the array
// of items we're assigning the its attribute's value to the nCoeff variable, which is
// actually the value of the attribute for the current item fetched from the file, otherwise
// we're assigning the actual index value for the current attributes using nAttrib counter
// variable value
nCoeff = (bExists == true) ?
m_Items[nIndex + 1].AttribList[nToken - 1].Value : nAttrib;
bool bInAttribList = false; int iAttrib = 0;
// Iterating through the array of target attributes and performing a check if the
// attribute with the similar name has not yet been indexed to the following array for current item
while (iAttrib < TargetAttribList.Count() && !bInAttribList)
bInAttribList = (token.Equals(TargetAttribList[iAttrib++].Name)) ? true : false;
// If the current attribute has not yet been indexed, inserting the new attribute
// represented as a pair of two value of either token name or coefficient nCoeff into
// the array of attributes for the current item being retrieved from the file
if (bInAttribList == false)
TargetAttribList.Add(new Attrib(token, nCoeff));
nAttrib++; // Incrementing the value of the attributes loop counter variable by 1
}
}
}
}
// Inserting the current item retrieved from the file into the array of items m_Items
m_Items.Add(new Item(textBuf, TargetAttribList, 0, false, false));
iItem++; // Incrementing the value of the items loop counter variable by 1
}
}
// Performing normalization of the attributes values for each item the array of items
Normalize(m_Items, nAttrib, false, ref m_MinVal, ref m_MaxVal);
// Deallocating the stream reader object
fsStream.Close();
}
// Deallocating the file stream object
fsFile.Close();
}
// Returning the actual value of the number of items retrieved from the file
return m_Items.Count();
}
public int LoadUsersFromFile(string filename)
{
// Intializing the file stream object and opening the file with the name being specified
using (System.IO.FileStream fsFile = new System.IO.FileStream(filename,
System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
{
// Initializing the stream reader object
using (System.IO.StreamReader fsStream = new System.IO.StreamReader(
fsFile, System.Text.Encoding.UTF8, true, 128))
{
int iItem = 0;
int nAttrib = 1; string textBuf = "\0";
// Retrieving each line from the file until we reach the end-of-file
while ((textBuf = fsStream.ReadLine()) != null)
{
// Performing if the line is not empty and contains the data on a specific user
if (!String.IsNullOrEmpty(textBuf))
{
// If so, initializing the array of attributes TargetAttribList for the current user
string sPattern = " => "; string[] sTokens;
IAttribList TargetAttribList = new IAttribList();
// Tokenizing the string according to the regular expression pattern assigned to the sItemPatern variable
if ((sTokens = Regex.Split(textBuf, sPattern)) != null)
{
// For each particular attribute name token we're performing encoding
// to obtain each attribute value associated with its name
foreach (string token in sTokens[1].Split(new char[] { ' ' }))
{
// At this point, we're performing a check if the attribute with similar name
// for a specific user has not been already indexed into the array of attributes
bool bExists = false; int nToken = 0;
int nIndex = 0; double nCoeff = 0;
// Iterating the array of users to find those users that have
// the attribute with the name which is equal to the name of
// the current attribute attribs[nToken].Name.Equals(token) being processed
while (nIndex < m_Items.Count() && bExists == false)
{
nToken = 0;
// Iterating through the array of attributes of the current user m_Users[nIndex]
// and performing a check if a certain atribute's name of the current user is not equal
// the name of the current attributed being retrieved from the file. If so, we're ending
// the loop execution by assinging the the bExists variable value of true
while (nToken < m_Items[nIndex].AttribList.Count() && bExists == false)
bExists = (m_Items[nIndex].AttribList[nToken++].Name.Equals(token)) ? true : false;
nIndex++;
}
// If the users with hat have the attribute with the name which is equal to the name of
// the current attribute attribs[nToken].Name.Equals(token) don't exists, then we're
// iterating through the array of users performing a check if the attribute with similar
// has already been indexed for a particular user into the array m_Users.
if (bExists == false)
{
int nItem = iItem - 1;
bool bUserAttrib = false;
// Iterating through the set of previous users in the array of users m_Users
while (nItem >= 0 && bUserAttrib == false)
{
nToken = 0;
// For each user, iterating through the array of attributes, and for each attribute's
// name we're performing a check if the name of the current attribute of the current user
// is not equal to the name of the current token retrieved from the file.
while (nToken < m_Users[nItem].AttribList.Count() && !bUserAttrib)
bUserAttrib = (m_Users[nItem].AttribList[nToken++].Name.Equals(token)) ? true : false;
nItem--;
}
// Computing the coefficient value for the current attribute retrieved from the file.
// If a user with the similar name of attribute has already been indexed into the array
// of users we're assigning the its attribute's value to the nCoeff variable, which is
// actually the value of the attribute for the current user fetched from the file, otherwise
// we're assigning the actual index value for the current attributes using nAttrib counter
// variable value
nCoeff = (bUserAttrib == true) ? m_Users[nItem + 1].AttribList[nToken - 1].Value : nAttrib;
}
// Otherwise, assigning the nCoeff variable to the value of the attribute of a specific user that
// has already been indexed into the array of users
else nCoeff = m_Items[nIndex - 1].AttribList[nToken - 1].Value;
// Inserting the new attribute represented as a pair of two value of either token name
// or coefficient nCoeff into the array of attributes for the current item being retrieved from the file
TargetAttribList.Add(new Attrib(token, nCoeff));
nAttrib++; // Incrementing the value of the attributes loop counter variable by 1
}
// Inserting the current user retrieved from the file into the array of users m_Users
m_Users.Add(new Item(textBuf, TargetAttribList, 0, true, false));
iItem++; // Incrementing the value of the users loop counter variable by 1
}
}
}
// Performing normalization of the attributes values for each user the array of users
Normalize(m_Users, nAttrib, true, ref m_MinVal, ref m_MaxVal);
// Deallocating the stream reader object
fsStream.Close();
}
// Deallocating the file stream object
fsFile.Close();
}
// Returning the actual value of the number of users retrieved from the file
return m_Users.Count();
}
public KMeans()
{
m_Items = new IItemsList();
m_Users = new IItemsList();
}
public void Compute(int nInitialCentroids, int nItemsPerCluster)
{
// Initializing the array of target clusters for which we'll produce the new clusters
m_TargetClusters = new IClustersList();
// Performing the iteration until we've obtained the array of target clusters
while (m_TargetClusters.Count() < m_Users.Count())
{
// Initializing the array of clusters
m_Clusters = new IClustersList();
// Performing a check if the number of centroids within the initial cluster is not equal to the number of users
if (nInitialCentroids != m_Users.Count())
{
// Obtaining the array of initial centroids based on the values
// retrieved from the array of users by performing k-means++ procedure
// Initializing the array of centroid indexes
List<int> CentroidIndexes = new List<int>();
// Randomly generate the index of the first intial centroid
int nInitialCentroid = rnd.Next(0, m_Users.Count());
// Performing iteration until we've obtained the n-initial centroids
while (CentroidIndexes.Count() < nInitialCentroids)
{
double nDistance = 0, nDistanceSum = 0;
double nDistanceMin = 0; int nCntMin = -1;
// Iterating through the array of users and for each user compute the distance
// to the initial centroid being previously selected
for (int nItem = 0; nItem < m_Users.Count(); nItem++)
{
// Performing a check if the index of the current user is not equal to
// the index of the intial centroid (i.e. user) in the array of users
if (nItem != nInitialCentroid)
{
// If so, computing the actual distance between the two vectors
// of either the current user m_Users[nItem] or initial centroid's m_Users[nInitialCentroid] attributes.
if ((nDistance = Math.Pow(m_Users[nItem].EuclDW(m_Users[nInitialCentroid]), 2.0)) >= 0)
{
// If the following distance is less than the smallest distance to the initial centroid
// m_Users[nInitialCentroid], then we're performing a check if the index of the current
// user has not yet been inserted into the array of the centroids indexes.
if (nDistance < nDistanceMin || nCntMin == -1)
{
bool bFound = false; int iCntIndex = 0;
// Iterating through the array of centroids indexes and for each index CentroidIndexes[iCntIndex]
// in the array we're performing a check if it's not equal to the index of the current user nItem,
// if so, we're ending the loop execution by assigning the value true to the variable bFound.
while (iCntIndex < CentroidIndexes.Count() && bFound == false)
bFound = (CentroidIndexes[iCntIndex++] == nItem) ? true : false;
// If the current user's index is not in the array of the centroids indexes, then
// we're assigning the variable nDistanceMin the value of the previously computed
// distance nDistance. Also, we're assigning the index value of the current user to nCntMin variable
if (bFound == false)
{
nDistanceMin = nDistance; nCntMin = nItem;
}
}
// Computing the sum of the distances to the initial centroid for each user
nDistanceSum += nDistance;
}
}
}
// Modify the value of nDistanceSum variable multiplying it by the randomly generate number
nDistanceSum = rnd.NextDouble() * nDistanceSum;
int nIndex = 0; double nSum = 0;
// Iterating through the array of users until the sum of distances between the vectors of attributes of
// each user and the initial centroid doesn't exceed the total value on the sum of distances for all users
while (nIndex < m_Users.Count() && nSum < nDistanceSum)
{
int iTargetIndex = 0; bool bFound = false;
// For the current user m_Users[nIndex] computing the distance
// to the users that has been previously selected as an initial centroid
double nDist = Math.Pow(m_Users[nIndex++].EuclDW(m_Users[nCntMin]), 2.0);
// Performing a check if the index of the current user m_Users[nIndex] is not in the array CentroidIndexes.
while (iTargetIndex < CentroidIndexes.Count() && !bFound)
bFound = (CentroidIndexes[iTargetIndex++] == nIndex) ? true : false;
// If not, summing the distance value for the current user nDist with the nSum variable
if (bFound == false)
nSum += nDist;
}
// Performing a check if the value of the nCntMin variable representing the actual index
// of the user with the smallest distance to initial centroid is not equal to -1
if (nCntMin != -1)
// If not, inserting the index nCntMin to the array of centroids indexes
CentroidIndexes.Add(nCntMin);
}
// Initializing the array of initial centroids
IItemsList CentroidItems = new IItemsList();
// Iterating through the array of users and inserting each user
// m_Users[CentroidIndexes[iIndex] with index CentroidIndexes[iIndex] to the array of centroids
for (int iIndex = 0; iIndex < CentroidIndexes.Count(); iIndex++)
CentroidItems.Add(m_Users[CentroidIndexes[iIndex]]);
// Inserting the new current initial cluster to the array of clusters.
m_Clusters.Add(new ICluster(CentroidItems, m_Items));
}
// Inserting the initial cluster into the array of clusters
else m_Clusters.Add(new ICluster(m_Users, m_Items));
// Iterating through the array of clusters, retrieving each cluster
// to obtain the new clusters by performing k-means procedure
for (int iCluster = 0; iCluster < m_Clusters.Count(); iCluster++)
{
// Clonning the array of items belonging to the current
// cluster m_Clusters[iCluster] by copying them into array Items
IItemsList Items = (IItemsList)m_Clusters[iCluster].Items.Clone();
// Clonning the array of centroids belonging to the current
// cluster m_Clusters[iCluster] by copying them into array Centroids
IItemsList Centroids = (IItemsList)m_Clusters[iCluster].Centroids.Clone();
// Iterating through the array of centroids (i.e. users) of the current cluster m_Clusters[iCluster]
for (int iCentroid = 0; iCentroid < Centroids.Count(); iCentroid++)
{
// For each centroid Centroids[iCentroid] of the current cluster
// m_Clusters[iCluster] retriving the set of attributes and copy it into array attribsA
IAttribList attribsA = Centroids[iCentroid].AttribList;
// Normalizing the set of attributes of the current centroid Centroids[iCentroid]
// Iterating through the array of attributes of the current centroid Centroids[iCentroid]
for (int iAttrib = 0; iAttrib < attribsA.Count(); iAttrib++)
// For each attribute retrieved from the set of attributes of the current centroid
// Centroids[iCentroid], we're performing a linear search in the array of items Items
// to find those items that have one or more attributes with similar name (e.g. which name
// is lexicographically equal to the name of the current attribute attribsA[iAttrib] of the
// centroid Centroids[iCentroid]).
// Iterating through the array of items and for each item Items[iItem] retrieving a set of attributes
for (int iItem = 0; iItem < Items.Count(); iItem++)
{
// Copying the array of attributes for the current item Items[iItem] into the array attribsB
IAttribList attribsB = Items[iItem].AttribList;
// Iterating through the array of attributes attribB of the current item Items[iItem] and perform a check
// if the name of the current attribute attribsB[nAttrib] is lexicographically equal to the name of the
// current attribute attribsA[iAttrib] of the current centroid (e.g. user)
for (int nAttrib = 0; nAttrib < attribsB.Count(); nAttrib++)
// If the name of current item's attribute attribsB[nAttrib] is lexicographically equal to
// the name of the current centroid's attribute attribsA[iAttrib], then we're performing a
// swap to ensure that the particular attributes of either the centroid (user) or item are
// located at the same position in the arrays representing the vectors of either user or
// items attributes, relative to the position of the centroid's (user) of a particular attributes
// in the array of attributes of the current centroid Centroids[iCentroid]
if (attribsB[nAttrib].Name.Equals(attribsA[iAttrib].Name))
if (iAttrib < attribsB.Count()) Swap(ref attribsB, iAttrib, nAttrib);
}
// Initializing the variable nDistanceAvg to store the value of the average distance
// between the current centroid Centroids[iCentroid] and each item within the current cluster m_Clusters[iCluster]
double nDistanceAvg = 0;
// Initializing the list of indexes of those items that have the smallest distance
// to the current centroid Centroids[iCentroid]
List<int> ItemsIndexes = new List<int>();
int nNeighbors = Items.Count() < nItemsPerCluster ? Items.Count() : nItemsPerCluster;
bool bDone = false;
// Performing the linear search to find all items in the array Items that have the smallest
// distance (e.g. the most similar items) to the current centroid Centroids[iCentroid] until
// we've exactly found nNeightbors items that exactly meet the following creteria
while (ItemsIndexes.Count() < nNeighbors && !bDone)
{
// Initializing the variable nDistanceAvg to store the average distance within a cluster
nDistanceAvg = 0;
// Initializing the nDistanceMin variable used to store the value of the smallest distance
// between an item and the current centroid Centroids[iCentroid]
// Initializing the nItemMin variable to store the index value of the item from the array Items
// having the smallest distance to the current centroid Centroids[iCentroid]
double nDistanceMin = 0; int nItemMin = -1;
// Iterating through the array of items of the current cluster m_Clusters[iCluster]
for (int iItem = 0; iItem < Items.Count(); iItem++)
{
double nDistance = 0;
// For each item being retrieved we're performing a check if the distance for the current item
// has not already been computed and the current item is not the item having the smallest distance
// to the current centroid Centroids[iCentroid]
if (m_Clusters[iCluster].Items[iItem].Exists == false)
{
// If not, computing the distance between two vectors of attributes of either
// the current centroid attribA or the current item Items[iItem]
// Initializing the temporary array of the current centroid's attributes by copying
// the array attribsA to the temporary array item
Item temp = new Item(null, attribsA, 0, false, false);
// Computing the actual distance between the vector of the currentr centroid's attributes
// stored to the array of attributes item, and the vector of attributes for the current item Items[iItem]
if ((nDistance = Items[iItem].EuclDW(temp)) >= 0)
{
// If the value of distance between the either the current centroid's or item's vectors of attributes
// is less than the smallest distance to the current centroid's vector of attributes, then we're performing
// a check if the index of the current item's Items[iItem] has not already been stored into the array ItemsIndexes
if (((nDistance < nDistanceMin) ||
(nItemMin == -1)) && Items[iItem].ItemText != null && nDistance <= 1.0)
{
bool bExists = false;
int nItem = ItemsIndexes.Count() - 1;
// Performing the linear search to check if the index of the current item for which we've previously
// computed the distance value, doesn't exists in the array of indexes.
while (nItem >= 0 && bExists == false)
bExists = (ItemsIndexes[nItem--] == iItem) ? true : false;
// If not, assigning the variable nDistanceMin the distance value between the vector of attributes
// of the current centroid Centroids[iCentroid] and the vector of attributes of the current item Items[iItem]
// Also, we're assigning the index value iItem of the current item Items[iItem] to the nItemMin variable
if (bExists == false)
{
nDistanceMin = nDistance; nItemMin = iItem;
}
}
}
}
// Computing the avarage distance between the vector of attributes of each item
// Items[iItem] and the vector of attributes of the current centroid Centroids[iCentroid]
nDistanceAvg += nDistance / Items.Count();
}
// If nItemMin variable is not equal to -1 and the appropriate value of the item with the smallest distance
// has been found, inserting the value of index nItemMin into the array ItemsIndexes.
if (nItemMin > -1)
ItemsIndexes.Add(nItemMin);
// Otherwise, terminating the process of finding the items with the
// smallest distance to the current centroid Centroids[iCentroid]
else bDone = true;
}
// Iterating through the array of items of the current cluster m_Clusters[iCluster]
// and for each item Items[ItemsIndexes[iIndex] with the index value ItemsIndexes[iIndex]
// stored in the array ItemsIndexes, we're assigning the variable Exists to true, which
// means that the actual distance value for the current item has already been computed
// and the following item has already been included into the current the newly built cluster
for (int iIndex = 0; iIndex < ItemsIndexes.Count(); iIndex++)
m_Clusters[iCluster].Items[ItemsIndexes[iIndex]].Exists = true;
// Updating the value of the variable Centroids[iCentroid].Distance by assigning it
// the value of the current averaged distance nDistanceAvg being computed
Centroids[iCentroid].Distance = nDistanceAvg;
// Initializing the array of target items for the current newly built cluster
IItemsList TargetItems = new IItemsList();
// Iterating through the array of indexes ItemsIndexes being obtained
for (int iItem = 0; iItem < ItemsIndexes.Count(); iItem++)
{
// For the current cluster we're performing a check if the current cluster
// is not the initial cluster in the array of clusters (e.g. the following cluster
// has no more than one centroid).
if (m_Clusters[iCluster].Centroids.Count() <= 1)
// If not, re-compute the distance between the current item with index ItemsIndexes[iItem]
// to the only centroid of the current non-initial cluster and assign this value to the
// Items[ItemsIndexes[iItem]].Distance variable for the current item
Items[ItemsIndexes[iItem]].Distance =
Items[ItemsIndexes[iItem]].EuclDW(m_Clusters[iCluster].Centroids[0]);
// Resetting the value of the exists variable for the current item in the newly built cluster
// by assigning it the variable of false.
Items[ItemsIndexes[iItem]].Exists = false;
// Inserting the current item with index ItemsIndexes[iItem] into the array of target items for the newly built cluster
TargetItems.Add(Items[ItemsIndexes[iItem]]);
}
int nMinAttribs = -1;
// Iterating through the array of target items for the newly built cluster to
// find the item with smallest number of attributes
for (int iItem = 0; iItem < TargetItems.Count(); iItem++)
{
// Obtaining the value of the actual number of attributes count of the current item TargetItems[iItem]
int nAttribCount = TargetItems[iItem].AttribList.Count();
// If the value of attributes count nAttribCount is less than the smallest value of the attributes count for
// the entire array TargetItems, then assigning the value of nAttribCount to the nMinAttribs variable
if (nAttribCount < nMinAttribs || nMinAttribs == -1)
nMinAttribs = nAttribCount;
}
// Initializing the attribs list to store the computed values of attributes
// for the centroid of the newly built cluster
IAttribList attribs = new IAttribList();
// Computing the new value for each attribute of the centroid for the newly build cluster by iterating
// through each attrbutes in each item, inserting each new value of attribute into the array attrib
for (int nAttrib = 0; nAttrib < nMinAttribs &&
nAttrib < Centroids[iCentroid].AttribList.Count(); nAttrib++)
{
// For each attribute of the current centroid Centroids[iCentroid] of the existing cluster
// obtaining the new value by computing the sum of each attribute AttribList[nAttrib] at the
// position nAttrib of the vector of attributes AttribList[nAttrib] for each target item being previously obtained
double nAttribAvg = 0; int nCount = 0;
// Iterating through the array of target items TargetItems
for (int iItem = 0; iItem < TargetItems.Count(); iItem++)
// For each item performing a check if the value of the attribute
// located at the current position nAttrib is greater than zero
if (TargetItems[iItem].AttribList[nAttrib].Value > 0)
{
// If so, adding up the value of the attribute located at the position nAttrib
// with the value of the nAttribAvg variable. Also computing the count of the target
// items that exactly meet the following condition by incrementing the count variable by 1
nAttribAvg += TargetItems[iItem].AttribList[nAttrib].Value;
nCount++;
}
// Since we've obtained the new value of attribute AttribList[nAttrib].Value with index nAttrib,
// for the current centroid Centroids[iCentroid] we're computing the average value by performing
// the division of the nAttribAvg variable's value by the actual number of target items that
// satisfied the criteria commented above. After that, we're compacting the following value by
// performing normalization. Finally we're inserting the average value into the array of
// atributes attrib along with the name of the new attribute which value has been obtained
attribs.Add(new Attrib(Centroids[iCentroid].AttribList[nAttrib].Name,
((nAttribAvg / (double)(nCount + 1)) - m_MinVal) / (m_MaxVal - m_MinVal) + 0.01));
}
bool bDiff = false; int nIndex = 0;
// Iterating through the new vector of attributes attribs to determine if the following vector
// of attributes being obtained has the different values of attributes than the vector of attributes attribsA
// of the current centroid Centroids[iCentroid]
while (nIndex < attribs.Count() && nIndex < attribsA.Count() && bDiff == false)
bDiff = (attribs[nIndex].Value != attribsA[nIndex++].Value) ? true : false;
if (bDiff == true)
{
// If so, initializing the array of new centroids
IItemsList TargetCentroids = new IItemsList();
// Inserting the new centroid with vector of attributes
// attribs into the array of centroids for the new cluster
TargetCentroids.Add(new Item(Centroids[iCentroid].ItemText,
attribs, nDistanceAvg, Centroids[iCentroid].IsUser, false));
// Inserting newly built cluster represented as a pair
// of sets of either TargetCentroids or TargetItems to the array of clusters
m_Clusters.Add(new ICluster(TargetCentroids, TargetItems));
}
else
{
// If not, iterating through the array of newly computed target clusters
// and for each cluster perfoming a check if there's a cluster with centroid which name
// value is equal to the name value of the current centroid Centroids[iCentroid]
bool bExists = false; int iTargetCluster = 0;
while (iTargetCluster < m_TargetClusters.Count() && !bExists)
bExists = m_TargetClusters[iTargetCluster++].Centroids[0].
ItemText.Equals(Centroids[iCentroid].ItemText) ? true : false;
if (bExists == false)
{
// If so, initializing the array of centroids TargetUsers and insert the
// new centroid into the following array.
IItemsList TargetUsers = new IItemsList();
TargetUsers.Add(new Item(Centroids[iCentroid].ItemText,
Centroids[iCentroid].AttribList, Centroids[iCentroid].Distance, true, true));
// Inserting the cluster into the array of target clusters for which we'll further
// be producing the new clusters during the next iteration of the outermost loop
// This fragment of code actually performs filtering to avoid the existance
// of the clusters with similar centroid's vectors of attributes in the array of clusters.
m_TargetClusters.Add(new ICluster((IItemsList)TargetUsers.Clone(),
(IItemsList)TargetItems.Clone()));
}
}
}
}
}
double nDiAvg = 0;
// Computing the sum of the distances values for each target cluster
for (int iCluster = 1; iCluster < m_TargetClusters.Count(); iCluster++)
nDiAvg += m_TargetClusters[iCluster].Centroids[0].Distance / (m_TargetClusters.Count() - 1);
double nD0Avg = 0; int nClustersCount = 0;
// Computing the average distance value between each newly built cluster
for (int iCluster = 1; iCluster < m_TargetClusters.Count(); iCluster++)
for (int nCluster = iCluster + 1; nCluster < m_TargetClusters.Count(); nCluster++)
{
nD0Avg += m_TargetClusters[iCluster].Centroids[0].EuclDW(
m_TargetClusters[nCluster].Centroids[0]);
nClustersCount++;
}
nD0Avg /= nClustersCount; // Computing the average distance between clusters
// Computing the prediction quality value of the array of the newly built clusters.
double nQ = ((m_TargetClusters.Count() - 1) * nDiAvg) / nD0Avg;
// Performing the k-means clustering results verbose output
for (int iCluster = 0; iCluster < m_TargetClusters.Count(); iCluster++)
{
IItemsList ItemsList = m_TargetClusters[iCluster].Items;
IItemsList Centroids = m_TargetClusters[iCluster].Centroids;
Console.WriteLine("\nCluster={0}, Centroid=\"{1}\"", iCluster, Centroids[0].ItemText);
Console.WriteLine("-----------------------------------------------------------");
for (int iAttrib = 0; iAttrib < Centroids[0].AttribList.Count(); iAttrib++)
Console.WriteLine("\"{0,-30}\" => u({1},{2}) = {3}", Centroids[0].AttribList[iAttrib].Name,
iCluster, iAttrib, Centroids[0].AttribList[iAttrib].Value);
Console.WriteLine("-----------------------------------------------------------");
for (int iItem = 0; iItem < ItemsList.Count(); iItem++)
{
Console.WriteLine("\n(cluster={0}, item={1}, distance={2})\n{3}",
iCluster, iItem, ItemsList[iItem].Distance, ItemsList[iItem].ItemText);
Console.WriteLine("-----------------------------------------------------------");
for (int iAttrib = 0; iAttrib < ItemsList[iItem].AttribList.Count(); iAttrib++)
Console.WriteLine("\"{0,-30}\" => i({1},{2}) = {3}", ItemsList[iItem].AttribList[iAttrib].Name,
iCluster, iAttrib, ItemsList[iItem].AttribList[iAttrib].Value);
Console.WriteLine("-----------------------------------------------------------");
}
}
Console.WriteLine("\n===========================================================");
Console.WriteLine("Recommendations:");
Console.WriteLine("===========================================================\n");
for (int iCluster = 0; iCluster < m_TargetClusters.Count(); iCluster++)
{
IItemsList ItemsList = m_TargetClusters[iCluster].Items;
Console.WriteLine("{0}", m_TargetClusters[iCluster].Centroids[0].ItemText);
Console.WriteLine("======================================================");
for (int iItem = 0; iItem < ItemsList.Count(); iItem++)
Console.WriteLine("{0}", ItemsList[iItem].ItemText);
Console.WriteLine();
}
Console.WriteLine("KMeans Statistics:");
Console.WriteLine("===========================================================");
Console.WriteLine("The total number of clusters nClustersCount = {0}\n", m_TargetClusters.Count());
Console.WriteLine("Average distance between clusters nDiAvg = {0}", nDiAvg);
Console.WriteLine("Average distance within a cluster nD0Avg = {0}\n", nD0Avg);
Console.WriteLine("Average quality of KMeans clustering nQ = {0}\n", nQ);
}
}
class Program
{
static void Main(string[] args)
{
string[] filenames = new string[2] { @"ARTICLES.TXT", @"USERS.TXT" };
string[] sBanner = new string[2] { "K-Means++ Recommender Engine v.1.2.65535 (Euclidix) by Arthur V. Ratz",
"=====================================================================\n" };
for (int iBanner = 0; iBanner < 2; iBanner++)
Console.WriteLine(sBanner[iBanner]);
KMeans km = new KMeans();
int nItemsCount = km.LoadItemsFromFile(filenames[0]);
int nUsersCount = km.LoadUsersFromFile(filenames[1]);
int nInitialUsers = 0;
int nItemsPerCluster = 0, nItemsPerClusterMax = 0;
if (nItemsCount > 0 && nUsersCount > 0)
{
do
{
nItemsPerClusterMax = nItemsCount / nUsersCount;
Console.Write("Enter the number of articles per user [2-{0}]: ", nItemsPerClusterMax);
nItemsPerCluster = int.Parse(Console.ReadLine());
} while (nItemsPerCluster < 2 || nItemsPerCluster > nItemsPerClusterMax);
do
{
Console.Write("\nEnter the number of users (initial centroids) [1-{0}]: ", nUsersCount);
nInitialUsers = int.Parse(Console.ReadLine());
} while (nInitialUsers < 1 || nInitialUsers > nUsersCount);
}
km.Compute(nInitialUsers, nItemsPerCluster);
Console.Read();
}
}
}
输出
K-Means++ Recommender Engine v.1.2.65535 (Euclidix) by Arthur V. Ratz ===================================================================== Enter the number of articles per user [2-9]: 5 Enter the number of users (initial centroids) [1-3]: 3 Cluster=0, Centroid="User0 => C++ Linux" ----------------------------------------------------------- "C++ " => u(0,0) = 0,0641355115153619 "Linux " => u(0,1) = 0,135503210492002 ----------------------------------------------------------- (cluster=0, item=0, distance=0,0181865932348965) ShaneMcDonald. Microsoft Visual C++ Static and Dynamic Libraries => C++ Visual-Studio Dev virtual-machine Virtualization ----------------------------------------------------------- "C++ " => i(0,0) = 0,0676382057531689 "Microsoft " => i(0,1) = 0,175408304563312 "Visual " => i(0,2) = 0,17791458593099 "ShaneMcDonald. " => i(0,3) = 0,172902023195634 "Static " => i(0,4) = 0,182927148666345 "and " => i(0,5) = 0,15535805362189 "Dynamic " => i(0,6) = 0,187939711401701 "Libraries " => i(0,7) = 0,190445992769378 "Visual-Studio " => i(0,8) = 0,122776395842079 "Dev " => i(0,9) = 0,0400691107087137 "virtual-machine " => i(0,10) = 0,200471118240089 "Virtualization " => i(0,11) = 0,202977399607767 ----------------------------------------------------------- (cluster=0, item=1, distance=0,0216683871819057) Nish Nishant. Using lambdas - C++ vs. C# vs. C++/CX vs. C++/CLI => C# C++ Windows Visual-Studio Dev QA Architect ----------------------------------------------------------- "C++ " => i(0,0) = 0,0676382057531689 "Nishant. " => i(0,1) = 0,0876884566945908 "Using " => i(0,2) = 0,0901947380622685 "lambdas " => i(0,3) = 0,0927010194299463 "- " => i(0,4) = 0,095207300797624 "Nish " => i(0,5) = 0,0851821753269131 "vs. " => i(0,6) = 0,10021986353298 "C# " => i(0,7) = 0,102726144900657 "C++/CX " => i(0,8) = 0,107738707636013 "C++/CLI " => i(0,9) = 0,112751270371368 "Windows " => i(0,10) = 0,0726507684885244 "Visual-Studio " => i(0,11) = 0,122776395842079 "Dev " => i(0,12) = 0,0400691107087137 "QA " => i(0,13) = 0,127788958577435 "Architect " => i(0,14) = 0,130295239945112 ----------------------------------------------------------- (cluster=0, item=2, distance=0,0216683871819057) Nish Nishant. 7 reasons C++ devs will love the VS 14 CTP => Visual C++ C++11 C++14 VC14.0 C++ Windows Dev Architect ----------------------------------------------------------- "C++ " => i(0,0) = 0,0676382057531689 "Nishant. " => i(0,1) = 0,0876884566945908 "7 " => i(0,2) = 0,243077901490611 "reasons " => i(0,3) = 0,245584182858289 "Nish " => i(0,4) = 0,0851821753269131 "devs " => i(0,5) = 0,250596745593644 "will " => i(0,6) = 0,253103026961322 "love " => i(0,7) = 0,255609308329 "the " => i(0,8) = 0,0576130802824579 "VS " => i(0,9) = 0,260621871064355 "14 " => i(0,10) = 0,263128152432033 "CTP " => i(0,11) = 0,26563443379971 "Visual " => i(0,12) = 0,17791458593099 "C++11 " => i(0,13) = 0,273153277902744 "C++14 " => i(0,14) = 0,275659559270421 "VC14.0 " => i(0,15) = 0,278165840638099 "Windows " => i(0,16) = 0,0726507684885244 "Dev " => i(0,17) = 0,0400691107087137 "Architect " => i(0,18) = 0,130295239945112 ----------------------------------------------------------- (cluster=0, item=3, distance=0,032726256995075) Pablo Aliskevicius. 4350% Performance Improvement with the StringBuilder for C++! => C++ Linux Windows STL Eclipse MFC Dev ----------------------------------------------------------- "C++ " => i(0,0) = 0,0676382057531689 "Linux " => i(0,1) = 0,0701444871208466 "4350% " => i(0,2) = 0,0475879548117469 "Performance " => i(0,3) = 0,0500942361794247 "Improvement " => i(0,4) = 0,0526005175471024 "with " => i(0,5) = 0,0551067989147802 "the " => i(0,6) = 0,0576130802824579 "StringBuilder " => i(0,7) = 0,0601193616501357 "for " => i(0,8) = 0,0626256430178134 "C++! " => i(0,9) = 0,0651319243854911 "Pablo " => i(0,10) = 0,0425753920763915 "Aliskevicius. " => i(0,11) = 0,0450816734440692 "Windows " => i(0,12) = 0,0726507684885244 "STL " => i(0,13) = 0,0751570498562021 "Eclipse " => i(0,14) = 0,0776633312238798 "MFC " => i(0,15) = 0,0801696125915576 "Dev " => i(0,16) = 0,0400691107087137 ----------------------------------------------------------- (cluster=0, item=4, distance=0,0930085673936484) Michael Chourdakis. One line template call for a dynamically loaded DLL function - Extensions to GetProcAddress() and Invoke() => C++ Win32 ----------------------------------------------------------- "C++ " => i(0,0) = 0,0676382057531689 "Chourdakis. " => i(0,1) = 0,343329156197721 "One " => i(0,2) = 0,345835437565398 "line " => i(0,3) = 0,348341718933076 "template " => i(0,4) = 0,350848000300754 "call " => i(0,5) = 0,353354281668431 "for " => i(0,6) = 0,0626256430178134 "a " => i(0,7) = 0,358366844403787 "dynamically " => i(0,8) = 0,360873125771465 "loaded " => i(0,9) = 0,363379407139142 "DLL " => i(0,10) = 0,36588568850682 "function " => i(0,11) = 0,368391969874498 "- " => i(0,12) = 0,095207300797624 "Extensions " => i(0,13) = 0,373404532609853 "to " => i(0,14) = 0,375910813977531 "GetProcAddress() " => i(0,15) = 0,378417095345209 "and " => i(0,16) = 0,15535805362189 "Invoke() " => i(0,17) = 0,383429658080564 "Michael " => i(0,18) = 0,340822874830043 "Win32 " => i(0,19) = 0,38844222081592 ----------------------------------------------------------- Cluster=1, Centroid="User1 => C# .NET" ----------------------------------------------------------- "C# " => u(1,0) = 0,0935222110939783 ".NET " => u(1,1) = 0,15019656028131 ----------------------------------------------------------- (cluster=1, item=0, distance=0,0110986410505081) Rahul Rajat Singh. A Beginner's Tutorial - Type Casting and Type Conversion in C# => C# .NET ----------------------------------------------------------- "C# " => i(1,0) = 0,102726144900657 ".NET " => i(1,1) = 0,170395741827956 "Singh. " => i(1,2) = 0,137814084048146 "A " => i(1,3) = 0,140320365415823 "Beginner's " => i(1,4) = 0,142826646783501 "Tutorial " => i(1,5) = 0,145332928151179 "- " => i(1,6) = 0,095207300797624 "Type " => i(1,7) = 0,150345490886534 "Casting " => i(1,8) = 0,152851772254212 "and " => i(1,9) = 0,15535805362189 "Conversion " => i(1,10) = 0,160370616357245 "in " => i(1,11) = 0,162876897724923 "Rahul " => i(1,12) = 0,13280152131279 "Rajat " => i(1,13) = 0,135307802680468 ----------------------------------------------------------- (cluster=1, item=1, distance=0,0110986410505081) Faisal(mfrony). Access Modifiers for Beginners => C# .NET1.1 .NET2.0 .NET .NET1.0 ----------------------------------------------------------- "C# " => i(1,0) = 0,102726144900657 ".NET " => i(1,1) = 0,170395741827956 "Modifiers " => i(1,2) = 0,295709810211843 "for " => i(1,3) = 0,0626256430178134 "Beginners " => i(1,4) = 0,300722372947199 "Faisal(mfrony). " => i(1,5) = 0,290697247476488 ".NET1.1 " => i(1,6) = 0,305734935682554 ".NET2.0 " => i(1,7) = 0,308241217050232 "Access " => i(1,8) = 0,293203528844166 ".NET1.0 " => i(1,9) = 0,313253779785588 ----------------------------------------------------------- (cluster=1, item=2, distance=0,0110986410505081) V.Lorz. Adding JavaScript scripting support to one application: One easy way => C# .NET ----------------------------------------------------------- "C# " => i(1,0) = 0,102726144900657 ".NET " => i(1,1) = 0,170395741827956 "JavaScript " => i(1,2) = 0,428542722698764 "scripting " => i(1,3) = 0,431049004066441 "support " => i(1,4) = 0,433555285434119 "to " => i(1,5) = 0,375910813977531 "one " => i(1,6) = 0,438567848169475 "application: " => i(1,7) = 0,441074129537152 "One " => i(1,8) = 0,345835437565398 "easy " => i(1,9) = 0,446086692272508 "way " => i(1,10) = 0,448592973640186 "V.Lorz. " => i(1,11) = 0,423530159963408 "Adding " => i(1,12) = 0,426036441331086 ----------------------------------------------------------- (cluster=1, item=3, distance=0,0110986410505081) Florian Rappl. An advanced introduction to C# - Lecture Notes Part 1 of 4 => C# .NET ----------------------------------------------------------- "C# " => i(1,0) = 0,102726144900657 ".NET " => i(1,1) = 0,170395741827956 "An " => i(1,2) = 0,719271361349382 "advanced " => i(1,3) = 0,72177764271706 "introduction " => i(1,4) = 0,724283924084737 "to " => i(1,5) = 0,375910813977531 "Florian " => i(1,6) = 0,714258798614026 "- " => i(1,7) = 0,095207300797624 "Lecture " => i(1,8) = 0,734309049555448 "Notes " => i(1,9) = 0,736815330923126 "Part " => i(1,10) = 0,501224882361418 "1 " => i(1,11) = 0,741827893658482 "of " => i(1,12) = 0,744334175026159 "4 " => i(1,13) = 0,746840456393837 "Rappl. " => i(1,14) = 0,716765079981704 ----------------------------------------------------------- (cluster=1, item=4, distance=0,0110986410505081) B. Ahmed (KU-00). Asynchronous programming and Threading in C# (.NET 4.5) => C# .NET4.5 .NET ----------------------------------------------------------- "C# " => i(1,0) = 0,102726144900657 ".NET " => i(1,1) = 0,170395741827956 "(KU-00). " => i(1,2) = 0,852104273836302 "Asynchronous " => i(1,3) = 0,85461055520398 "programming " => i(1,4) = 0,857116836571658 "and " => i(1,5) = 0,15535805362189 "Threading " => i(1,6) = 0,862129399307013 "in " => i(1,7) = 0,162876897724923 "B. " => i(1,8) = 0,847091711100947 "(.NET " => i(1,9) = 0,869648243410046 "4.5) " => i(1,10) = 0,872154524777724 ".NET4.5 " => i(1,11) = 0,87716708751308 "Ahmed " => i(1,12) = 0,849597992468624 ----------------------------------------------------------- Cluster=2, Centroid="User2 => Java Javascript" ----------------------------------------------------------- "Java " => u(2,0) = 0,030550711996943 "Javascript " => u(2,1) = 0,5603509244 ----------------------------------------------------------- (cluster=2, item=0, distance=0,00499811776007409) Dr. Song Li. A Simple Method to Upload Files by jQuery Ajax Calls => Java Javascript jQuery Ajax Dev Architect ----------------------------------------------------------- "Java " => i(2,0) = 0,027537703870325 "Javascript " => i(2,1) = 0,556363072450329 "Li. " => i(2,2) = 0,639070357583694 "A " => i(2,3) = 0,140320365415823 "Simple " => i(2,4) = 0,64408292031905 "Method " => i(2,5) = 0,646589201686727 "to " => i(2,6) = 0,375910813977531 "Upload " => i(2,7) = 0,651601764422083 "Files " => i(2,8) = 0,654108045789761 "by " => i(2,9) = 0,656614327157438 "jQuery " => i(2,10) = 0,659120608525116 "Ajax " => i(2,11) = 0,661626889892794 "Calls " => i(2,12) = 0,664133171260472 "Dr. " => i(2,13) = 0,634057794848339 "Song " => i(2,14) = 0,636564076216016 "Dev " => i(2,15) = 0,0400691107087137 "Architect " => i(2,16) = 0,130295239945112 ----------------------------------------------------------- (cluster=2, item=1, distance=0,00803725509399426) Sergey Zubarev. MPSC Lock Free Intrusive Linked Queue with state => Java Architect ----------------------------------------------------------- "Java " => i(2,0) = 0,027537703870325 "Zubarev. " => i(2,1) = 0,581425886127106 "MPSC " => i(2,2) = 0,583932167494784 "Lock " => i(2,3) = 0,586438448862462 "Free " => i(2,4) = 0,588944730230139 "Intrusive " => i(2,5) = 0,591451011597817 "Linked " => i(2,6) = 0,593957292965495 "Queue " => i(2,7) = 0,596463574333173 "with " => i(2,8) = 0,0551067989147802 "state " => i(2,9) = 0,601476137068528 "Sergey " => i(2,10) = 0,578919604759428 "Architect " => i(2,11) = 0,130295239945112 ----------------------------------------------------------- (cluster=2, item=2, distance=0,0558027646300975) U. Calakovic. Swing Based GUIs in JRuby => Ruby Java Windows Swing Dev ----------------------------------------------------------- "Java " => i(2,0) = 0,027537703870325 "Calakovic. " => i(2,1) = 0,458618099110897 "Swing " => i(2,2) = 0,0350565479733582 "Based " => i(2,3) = 0,463630661846252 "GUIs " => i(2,4) = 0,46613694321393 "in " => i(2,5) = 0,162876897724923 "JRuby " => i(2,6) = 0,471149505949285 "Ruby " => i(2,7) = 0,473655787316963 "U. " => i(2,8) = 0,456111817743219 "Windows " => i(2,9) = 0,0726507684885244 "Dev " => i(2,10) = 0,0400691107087137 ----------------------------------------------------------- (cluster=2, item=3, distance=0,133988415037227) Salar Hafezi. Understanding Recursion in Java through Basic Number Theory => Java Windows Design Dev ----------------------------------------------------------- "Java " => i(2,0) = 0,027537703870325 "Hafezi. " => i(2,1) = 0,914761308028246 "Understanding " => i(2,2) = 0,917267589395924 "Recursion " => i(2,3) = 0,919773870763601 "in " => i(2,4) = 0,162876897724923 "Salar " => i(2,5) = 0,912255026660568 "through " => i(2,6) = 0,927292714866634 "Basic " => i(2,7) = 0,929798996234312 "Number " => i(2,8) = 0,93230527760199 "Theory " => i(2,9) = 0,934811558969668 "Windows " => i(2,10) = 0,0726507684885244 "Design " => i(2,11) = 0,629045232112983 "Dev " => i(2,12) = 0,0400691107087137 ----------------------------------------------------------- (cluster=2, item=4, distance=0,22911612749109) Christopher Swiderski. JAX-WS: Using Apache CXF to Create a Bottom-Up Web Service, Web Service Client, and Securing the Web Service => Java Windows Apache Dev ----------------------------------------------------------- "Java " => i(2,0) = 0,027537703870325 "Swiderski. " => i(2,1) = 0,789447239644359 "JAX-WS: " => i(2,2) = 0,791953521012036 "Using " => i(2,3) = 0,0901947380622685 "Apache " => i(2,4) = 0,796966083747392 "CXF " => i(2,5) = 0,79947236511507 "to " => i(2,6) = 0,375910813977531 "Create " => i(2,7) = 0,804484927850425 "a " => i(2,8) = 0,358366844403787 "Bottom-Up " => i(2,9) = 0,809497490585781 "Web " => i(2,10) = 0,812003771953458 "Service, " => i(2,11) = 0,814510053321136 "Service " => i(2,12) = 0,819522616056492 "Client, " => i(2,13) = 0,822028897424169 "and " => i(2,14) = 0,15535805362189 "Securing " => i(2,15) = 0,827041460159525 "the " => i(2,16) = 0,0576130802824579 "Christopher " => i(2,17) = 0,786940958276681 "Windows " => i(2,18) = 0,0726507684885244 "Dev " => i(2,19) = 0,0400691107087137 ----------------------------------------------------------- =========================================================== Recommendations: =========================================================== User0 => C++ Linux ====================================================== ShaneMcDonald. Microsoft Visual C++ Static and Dynamic Libraries => C++ Visual-Studio Dev virtual-machine Virtualization Nish Nishant. Using lambdas - C++ vs. C# vs. C++/CX vs. C++/CLI => C# C++ Windows Visual-Studio Dev QA Architect Nish Nishant. 7 reasons C++ devs will love the VS 14 CTP => Visual C++ C++11 C++14 VC14.0 C++ Windows Dev Architect Pablo Aliskevicius. 4350% Performance Improvement with the StringBuilder for C++! => C++ Linux Windows STL Eclipse MFC Dev Michael Chourdakis. One line template call for a dynamically loaded DLL function - Extensions to GetProcAddress() and Invoke() => C++ Win32 User1 => C# .NET ====================================================== Rahul Rajat Singh. A Beginner's Tutorial - Type Casting and Type Conversion in C# => C# .NET Faisal(mfrony). Access Modifiers for Beginners => C# .NET1.1 .NET2.0 .NET .NET1.0 V.Lorz. Adding JavaScript scripting support to one application: One easy way => C# .NET Florian Rappl. An advanced introduction to C# - Lecture Notes Part 1 of 4 => C# .NET B. Ahmed (KU-00). Asynchronous programming and Threading in C# (.NET 4.5) => C# .NET4.5 .NET User2 => Java Javascript ====================================================== Dr. Song Li. A Simple Method to Upload Files by jQuery Ajax Calls => Java Javascript jQuery Ajax Dev Architect Sergey Zubarev. MPSC Lock Free Intrusive Linked Queue with state => Java Architect U. Calakovic. Swing Based GUIs in JRuby => Ruby Java Windows Swing Dev Salar Hafezi. Understanding Recursion in Java through Basic Number Theory => Java Windows Design Dev Christopher Swiderski. JAX-WS: Using Apache CXF to Create a Bottom-Up Web Service, Web Service Client, and Securing the Web Service => Java Windows Apache Dev KMeans Statistics: =========================================================== The total number of clusters nClustersCount = 3 Average distance between clusters nDiAvg = 0,0487435885265023 Average distance within a cluster nD0Avg = 41,496025364381 Average quality of KMeans clustering nQ = 0,0023493136076759
关注点
以下 C#.NET 代码专门用于 ASP.NET Web 应用程序项目,以维护在线网店和其他通过互联网推广其商业分发产品或数字内容的网站的推荐引擎。
历史
- 2016年12月11日 - 文章第一版发布