Advertisement

Data Mining数据挖掘—1. Clustering聚类

阅读量:

1. Introduction

We are drowning in data, but starving for knowledge. (John Naisbitt, 1982)

Data, Information, Knowledge, and Wisdom

Data mining borrows ideas from the fields of machine learning, statistics, and database systems.

Descriptive methods = unsupervised Predictive methods = supervised
with a target (class) attribute no target attribute
Clustering, Association Rule Mining, Text Mining, Anomaly Detection, Sequential Pattern Mining Classification, Regression, Text Mining, Time Series Prediction

在数据挖掘的过程中,并不需要使用计算机。但计算机能够提供可扩展性,并且它们能够帮助避免人类偏见。

基本流程:首先应用数据挖掘方法以提取潜在模式;然后评估生成的结果模型或模式;循环:尝试不同的参数设置、尝试不同的替代方法、改进预处理和特征生成过程、组合多种方法。

2. Clustering

The intra-cluster distances have been optimized: data points within the same cluster exhibit high similarity.
The inter-cluster distances have been maximized: data points from distinct clusters show significant dissimilarity.
The application areas include market segmentation and document clustering.
The types of clustering algorithms include hierarchical clustering, k-means, and DBSCAN.

  1. Partitional Clustering: The partitioning of the dataset into non-overlapping subsets (clusters) ensures that every data point belongs exclusively to one cluster.
  2. Hierarchical Clustering: The clusters are arranged hierarchically, forming a nested structure where each level encompasses broader groupings.

Clustering algorithm: Partition-based、Hierarchical-based、Density-based algorithms
Proximity (similarity or distance measure): Euclidean Distance、Cosine Similarity、Domain-specific Similarity Measures
Application areas include: Product Grouping(产品分组)、Social Network Analysis(社交网络分析)、Grouping Search Engine Results(搜索引擎结果分组)、Image Recognition(图像识别)

2.1 K-Means Clustering

  • 基于划分的聚类算法
  • 每个集群都与一个中心点相关联
  • 每个数据点都会被分配到与其最近中心点所属的集群中
  • 确定集群数量的任务需由人工完成
Algorithm of K-Means Clustering
  • Convergence Conditions
  • No (or minimal) relocation of data entities among clusters
  • No significant shift in centroid positions,
  • Minimal reduction in the sum of squared deviations (SSD),
  • Termination occurs upon achieving convergence after X iterations.
SSE

Weaknesses1: Initial Seeds
Results may show significant variation based on the initial selection of seeds (quantity and location).
Improving:

  • Re-run multiple times using distinct random seeds while maintaining a fixed number of clusters, k.
  • Execute K-Means clustering with varying numbers of clusters, selecting the optimal number where sum of squared errors (SSE) improvement diminishes, marking the 'knee point' in cluster evaluation.
  • Use X-Means: a variant of K-Means that autonomously determines the optimal number of clusters by iteratively splitting larger clusters until further division no longer improves SSE.
  • Execute K-Means with diverse cluster counts and generate silhouette coefficients for each configuration; select the setup yielding the highest silhouette coefficient as it indicates most cohesive and well-separated clusters.

Weaknesses2: Outlier Handling
Remedy:

  1. 排除远离质心的数据点 2. 随机采样 3. 选择一小部分数据点作为样本集。当数据集足够大时, 抽到离群点的可能性非常低。基于这些样本确定质心后分配剩余的数据点。这种方法不仅能够有效减少计算开销还能显著提升运行效率

Evaluation
maximize Cohesion & Separation

  • Cohesion a(x) measures the mean distance between data point x and all other points within its own cluster (intra-cluster mean distance).
  • Separation b(x) calculates the average distance from data point x to all vectors in neighboring clusters, identifying the closest cluster (nearest neighbor cluster).
    Silhouette coefficient The silhouette coefficient represents a measure that remains unaffected by changes in the number of clusters.
Silhouette coefficient

summary
Advantages: Simple, Efficient time complexity: O(t k n) [n: number of data points, k: number of clusters, t: number of iterations]
Disadvantages: Must pick number of clusters before hand; All items are forced into a
cluster; Sensitive to outliers; Sensitve to initial seeds

2.2 K-Medoids

The K-Medoids algorithm represents a variant of the K-Means method, employing cluster medians rather than means for centroid calculation. Medoids function as the central representatives within each cluster, serving as existing data points that typify their respective groups. The K-Medoids approach demonstrates enhanced resilience against outliers due to the median's insensitivity to extreme values, which contributes to its stability in datasets containing anomalous observations.

2.3 DBSCAN

DBSCAN: 一种基于密度的空间聚类噪声容忍算法
该算法以密度为依据进行数据聚类
点数在指定半径(Epsilon)内的数据点数量即为空间密度

  • 核心点:具有比指定数量(MinPts)更多的Eps范围内的相邻核心区域。
    • 边界点:位于核心区域附近但包含较少于MinPts个邻近核心区域内的任意位置。
    • 噪音点:任何不属于核心区域或边界区域的其他位置。

该算法通过消除噪声点来实现数据聚类。
其优点在于抗噪声能力强并能有效处理不同形状和大小的簇。
然而该算法存在处理高密度变化以及在高维空间中表现不佳的问题。
如何确定参数ε和MinPts?

  • 在簇内的点具有大致相同的第k个最近距离。
    对于离群点来说,在其第k个最近邻居的位置上会更加远离其他数据点。
    当计算每个样本与其第k个最近邻居之间的相似性得分时,请注意这些得分通常会在数据集中呈现出明显的分布特征。
    在实际操作中发现,在数据量较大的情况下计算效率会显著下降。
    选择合适的参数至关重要,在此过程中需要注意避免过度拟合或欠拟合的风险。
    关键在于选择合适的参数,在此过程中需要注意避免过度拟合或欠拟合的风险。

2.4 Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a Dendrogram. (A tree like diagram that records the sequences of merges or splits. The y-axis displays the former distance between merged clusters.)
Advantages: We do not have to assume any particular number of clusters. May be used to look for meaningful taxonomies.
Step:
Starting Situation: Start with clusters of individual points and a proximity matrix
Intermediate Situation: After some merging steps, we have a number of clusters.
How to Define Inter-Cluster Similarity?

  1. Single Link (MIN): Similarity of two clusters is based on the two most similar (closest) points in the different clusters. 求解最近距离之间的最小值. Can handle non-elliptic shapes but Sensitive to outliers.
  2. Complete Link (MAX): Similarity of two clusters is based on the two least similar (most distant) points in the different clusters. 求解最远距离之间的最小值。Less sensitive to noise and outliers but biased towards globular clusters and tends to break large clusters.
  3. Group Average: average of pair-wise proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters. Compromise between Single and Complete Link. Less susceptible to noise and outliers but Biased towards globular clusters
  4. Distance Between Centroids

Limitations

  • Greedy algorithm: Decisions made cannot be rolled back
  • Various implementations face challenges in one or more areas: sensitivity to noise and outliers, challenges in managing clusters of varying sizes and non-convex shapes, tendency to split large clusters
  • High computational complexity is observed with a space complexity of O(N^2) and a time complexity of O(N^3)

2.5 Proximity Measures

单一属性:相似性在[0,1]范围内,并且不相似性从0开始至某个上限变化。 多个属性: 欧氏距离

单一属性:相似性在[0,1]范围内,并且不相似性从0开始至某个上限变化。 多个属性: 欧氏距离

Euclidean Distance

Caution
在比较苹果和橙子时往往容易出现误解或错误结论。
转换测量单位...会显著影响聚类结果...
建议在聚类之前应用归一化处理(通常适用于涉及距离的所有数据挖掘算法)。

Similarity of Binary Attributes
Common situation is that objects, p and q, have only binary attributes
1.Symmetric Binary Attributes -> hobbies, favorite bands, favorite movies
A binary attribute is symmetric if both of its states (0 and 1) have
equal importance, and carry the same weights
Similarity measure: Simple Matching Coefficient

SMC

2.Asymmetric Binary Attributes -> (正反)政治言论间的(异同)意见分歧与投票建议
Asymmetric: 在状态重要性或价值上存在差异时
By convention, 按照惯例(或约定俗成),状态1被指定为更重要的那个
通常情况下(或一般而言),状态1是罕见或不常出现的状态
Example: Shopping Basket, Word/Document Vector

Jaccard Coefficient

全部评论 (0)

还没有任何评论哟~