Data Mining数据挖掘—1. Clustering聚类

阅读量：

1. Introduction

We are drowning in data, but starving for knowledge. (John Naisbitt, 1982)

Data, Information, Knowledge, and Wisdom

Data mining borrows ideas from the fields of machine learning, statistics, and database systems.

Descriptive methods = unsupervised	Predictive methods = supervised
with a target (class) attribute	no target attribute
Clustering, Association Rule Mining, Text Mining, Anomaly Detection, Sequential Pattern Mining	Classification, Regression, Text Mining, Time Series Prediction

在数据挖掘的过程中，并不需要使用计算机。但计算机能够提供可扩展性，并且它们能够帮助避免人类偏见。

基本流程：首先应用数据挖掘方法以提取潜在模式；然后评估生成的结果模型或模式；循环：尝试不同的参数设置、尝试不同的替代方法、改进预处理和特征生成过程、组合多种方法。

2. Clustering

The intra-cluster distances have been optimized: data points within the same cluster exhibit high similarity.
The inter-cluster distances have been maximized: data points from distinct clusters show significant dissimilarity.
The application areas include market segmentation and document clustering.
The types of clustering algorithms include hierarchical clustering, k-means, and DBSCAN.

Partitional Clustering: The partitioning of the dataset into non-overlapping subsets (clusters) ensures that every data point belongs exclusively to one cluster.
Hierarchical Clustering: The clusters are arranged hierarchically, forming a nested structure where each level encompasses broader groupings.

Clustering algorithm: Partition-based、Hierarchical-based、Density-based algorithms
Proximity (similarity or distance measure): Euclidean Distance、Cosine Similarity、Domain-specific Similarity Measures
Application areas include: Product Grouping（产品分组）、Social Network Analysis（社交网络分析）、Grouping Search Engine Results（搜索引擎结果分组）、Image Recognition（图像识别）

2.1 K-Means Clustering

基于划分的聚类算法
每个集群都与一个中心点相关联
每个数据点都会被分配到与其最近中心点所属的集群中
确定集群数量的任务需由人工完成

Convergence Conditions
No (or minimal) relocation of data entities among clusters
No significant shift in centroid positions,
Minimal reduction in the sum of squared deviations (SSD),
Termination occurs upon achieving convergence after X iterations.

Weaknesses1: Initial Seeds
Results may show significant variation based on the initial selection of seeds (quantity and location).
Improving:

Re-run multiple times using distinct random seeds while maintaining a fixed number of clusters, k.
Execute K-Means clustering with varying numbers of clusters, selecting the optimal number where sum of squared errors (SSE) improvement diminishes, marking the 'knee point' in cluster evaluation.
Use X-Means: a variant of K-Means that autonomously determines the optimal number of clusters by iteratively splitting larger clusters until further division no longer improves SSE.
Execute K-Means with diverse cluster counts and generate silhouette coefficients for each configuration; select the setup yielding the highest silhouette coefficient as it indicates most cohesive and well-separated clusters.

Weaknesses2: Outlier Handling
Remedy:

排除远离质心的数据点 2. 随机采样 3. 选择一小部分数据点作为样本集。当数据集足够大时, 抽到离群点的可能性非常低。基于这些样本确定质心后分配剩余的数据点。这种方法不仅能够有效减少计算开销还能显著提升运行效率

Evaluation
maximize Cohesion & Separation

Cohesion a(x) measures the mean distance between data point x and all other points within its own cluster (intra-cluster mean distance).
Separation b(x) calculates the average distance from data point x to all vectors in neighboring clusters, identifying the closest cluster (nearest neighbor cluster).
Silhouette coefficient The silhouette coefficient represents a measure that remains unaffected by changes in the number of clusters.

summary
Advantages: Simple, Efficient time complexity: O(t k n) [n: number of data points, k: number of clusters, t: number of iterations]
Disadvantages: Must pick number of clusters before hand; All items are forced into a
cluster; Sensitive to outliers; Sensitve to initial seeds

2.2 K-Medoids

The K-Medoids algorithm represents a variant of the K-Means method, employing cluster medians rather than means for centroid calculation. Medoids function as the central representatives within each cluster, serving as existing data points that typify their respective groups. The K-Medoids approach demonstrates enhanced resilience against outliers due to the median's insensitivity to extreme values, which contributes to its stability in datasets containing anomalous observations.

2.3 DBSCAN

DBSCAN: 一种基于密度的空间聚类噪声容忍算法
该算法以密度为依据进行数据聚类
点数在指定半径（Epsilon）内的数据点数量即为空间密度

核心点：具有比指定数量（MinPts）更多的Eps范围内的相邻核心区域。
- 边界点：位于核心区域附近但包含较少于MinPts个邻近核心区域内的任意位置。
- 噪音点：任何不属于核心区域或边界区域的其他位置。

该算法通过消除噪声点来实现数据聚类。
其优点在于抗噪声能力强并能有效处理不同形状和大小的簇。
然而该算法存在处理高密度变化以及在高维空间中表现不佳的问题。
如何确定参数ε和MinPts？

在簇内的点具有大致相同的第k个最近距离。
对于离群点来说，在其第k个最近邻居的位置上会更加远离其他数据点。
当计算每个样本与其第k个最近邻居之间的相似性得分时，请注意这些得分通常会在数据集中呈现出明显的分布特征。
在实际操作中发现，在数据量较大的情况下计算效率会显著下降。
选择合适的参数至关重要，在此过程中需要注意避免过度拟合或欠拟合的风险。
关键在于选择合适的参数，在此过程中需要注意避免过度拟合或欠拟合的风险。

2.4 Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a Dendrogram. (A tree like diagram that records the sequences of merges or splits. The y-axis displays the former distance between merged clusters.)
Advantages: We do not have to assume any particular number of clusters. May be used to look for meaningful taxonomies.
Step:
Starting Situation: Start with clusters of individual points and a proximity matrix
Intermediate Situation: After some merging steps, we have a number of clusters.
How to Define Inter-Cluster Similarity?

Single Link (MIN): Similarity of two clusters is based on the two most similar (closest) points in the different clusters. 求解最近距离之间的最小值. Can handle non-elliptic shapes but Sensitive to outliers.
Complete Link (MAX): Similarity of two clusters is based on the two least similar (most distant) points in the different clusters. 求解最远距离之间的最小值。Less sensitive to noise and outliers but biased towards globular clusters and tends to break large clusters.
Group Average: average of pair-wise proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters. Compromise between Single and Complete Link. Less susceptible to noise and outliers but Biased towards globular clusters
Distance Between Centroids

Limitations

Greedy algorithm: Decisions made cannot be rolled back
Various implementations face challenges in one or more areas: sensitivity to noise and outliers, challenges in managing clusters of varying sizes and non-convex shapes, tendency to split large clusters
High computational complexity is observed with a space complexity of O(N^2) and a time complexity of O(N^3)

2.5 Proximity Measures

单一属性：相似性在[0,1]范围内，并且不相似性从0开始至某个上限变化。多个属性：欧氏距离

Caution
在比较苹果和橙子时往往容易出现误解或错误结论。
转换测量单位...会显著影响聚类结果...
建议在聚类之前应用归一化处理（通常适用于涉及距离的所有数据挖掘算法）。

Similarity of Binary Attributes
Common situation is that objects, p and q, have only binary attributes
1.Symmetric Binary Attributes -> hobbies, favorite bands, favorite movies
A binary attribute is symmetric if both of its states (0 and 1) have
equal importance, and carry the same weights
Similarity measure: Simple Matching Coefficient

2.Asymmetric Binary Attributes -> （正反）政治言论间的（异同）意见分歧与投票建议
Asymmetric: 在状态重要性或价值上存在差异时
By convention, 按照惯例（或约定俗成），状态1被指定为更重要的那个
通常情况下（或一般而言），状态1是罕见或不常出现的状态
Example: Shopping Basket, Word/Document Vector

全部评论 (0)

还没有任何评论哟~

Data Mining数据挖掘—1. Clustering聚类

1\.Introduction Wearedrowningindata,butstarvingforknowledge.JohnNaisbitt,1982 Dataminingdrawsideasfr...

Data Mining：（1）走进数据挖掘

该博客为数据挖掘系列博客的第一部分：走进数据挖掘。该博客主要介绍数据挖掘的相关概念以及参考资料。文章目录 DMTheory 缘何兴起正式提出如何定义走个流程相似概念相关领域应用场景 D...

数据挖掘data mining

数据挖掘是从大量数据集中提取有用信息和知识的过程。它通常涉及使用算法和技术来分析数据，以发现数据中的模式、趋势和关联。数据挖掘可以帮助企业和组织理解客户行为，预测市场趋势，优化运营流程等。数据挖掘的...

数据挖掘(data mining)

1\.数据挖掘的定义数据挖掘是从大量数据中提取隐含的、未知的、但潜在有用的信息和知识的过程。它就像从矿山中挖掘金子一样，从海量数据中挖掘出有价值的信息。核心目标发现知识：从数据中找到隐藏的模式和...

数据挖掘：Data Mining Techniques

文章目录 Classification分类 Clustering聚类 AssociationRuleDiscovery关联规则发现 SequentialPatternDiscovery顺序模式发现 R...

Introduction to Data Mining 数据挖掘

OriginsofDataMining•Drawsideasfrommachinelearning/AI,patternrecognition,statistics,visualizationandd...

Introduction to Data Mining 数据挖掘

KDDProcess:ATypicalViewfromMLandStatistics•Thisisaviewfromtypicalmachinelearningandstatisticscommuni...

2024年最全Introduction to Data Mining 数据挖掘(1)

DataMining:ClassificationSchemes•Generalfunctionality—Descriptivedatamining—Predictivedatamining一般功能...

Data Mining数据挖掘课程记录

今年是我第一次数据挖掘，数据挖掘是指从大量的数据中通过算法搜索隐藏于其中信息的过程。然而今年的课程并不是特别关注挖掘的算法而是概念。下面做个总结，也当为我的期末复习。 Introduction 数据挖...

数据挖掘（Data Mining）扫盲笔记

知识框架来源：人工智能之数据挖掘其他补充来源：拿下Offer数据分析师求职面试指南、数据分析实战45讲 DataMining 概述篇基础认知挖掘对象常见任务：模型分类问题与挑战十大经典算...

是否确定退出登录?

Data Mining数据挖掘—1. Clustering聚类

1. Introduction

2. Clustering

2.1 K-Means Clustering

2.2 K-Medoids

2.3 DBSCAN

2.4 Hierarchical Clustering

2.5 Proximity Measures

全部评论 (0)

相关文章推荐

Data Mining数据挖掘—1. Clustering聚类

Data Mining：（1）走进数据挖掘

数据挖掘data mining

数据挖掘(data mining)

数据挖掘：Data Mining Techniques

Introduction to Data Mining 数据挖掘

Introduction to Data Mining 数据挖掘

2024年最全Introduction to Data Mining 数据挖掘(1)

Data Mining数据挖掘课程记录

数据挖掘（Data Mining）扫盲笔记