数据挖掘中的聚类分析

阅读量：

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。

第一周

-Considerations for Cluster Analysis

partitioning criteria (single level vs. hierarchical partitioning)
separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])

Four issues:
-Quality

deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
clusters with arbitrary shape
deal with noisy data

-Scalability

clustering all the data instead of only on samples
high dimensionality
incremental or stream clustering and insensitivity to input order

-Constraint-based clustering

user-given preferences or constraints

-Interpretable and usability

Cluster Analysis Categorization:
-Technique-centered

distance-based
density-based and grid-based methods
probabilistic and generative models
leveraging dimensionality reduction methods
high-dimensional clustering
scalable tech for cluster analysis

-Data type-centered

clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.

-Additional insight-centered

visual insights, semi-supervised, ensemble-based, validation-based.

Typical Clustering Methods:
-Distance-based

partitioning algo.: k-means, k-medians, k-medoids
hierarchical algo.: agglomerative vs. divisive method

-Density-based and grid-based

density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
grid-based: individual regions are formed into a grid-like structure

-Probabilistic and generative models

Assume a specific form of the generative model (比如：mixture of Gaussian)
Model parameters are estimated with EM algo.
Then estimate the generative probability of the underlying data points.

-High-dimensional clustering

subspace clustering (bottom-up, top-down, correlation-based method vs. $\delta$ -cluster method)
dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)

Lecture2:
Good clustering:

High intra-class similarity (Cohesive)
Low inter-class similarity (Distinctive between clusters)

proximity : similarity or dissimilarity

-Dissimilarity Matrix

triangle matrix (symmetric)
distance functions are usually different for different types of data

-Distance on numeric data: Minkowski Distance

A popular distance measure:
$d(i,j) = \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p}$ ，
其中， $i=(x_{i1},x_{i2},\dots,x_{il})$ ， $j=(x_{j1},x_{j2},\dots,x_{jl})$ 为 $l$ 维数据， $p$ 为order (这种距离也常被成为 $l-p$ norm)。
Property:
positivity; symmetry; triangle inequality.
$p=1$ : Manhanttan (or city block) distance
$p=2$ : Euclidean distance
$p\to\infty$ : “supremum” distance
这种情况下，
$d(i,j)=\lim_{x\to\infty} \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p}=\underset{f=1,\dots,l}{\mathrm{max}}~|x_{if}-x_{if}|$ 。

-Proximity measure for binary attribute
Draw a contingency table for binary data.

for symmetric binary variables: $d(i,j)={r+s}\over{q+r+s+t}$
for asymmetric binary variable: $d(i,j)={r+s}\over{q+r+s}$
Jaccard coefficient (similarity measure): $sim_{Jaccard}(i,j)=q\over{q+r+s}$ ,跟“ $coherence(i,j)$ 计算一样”。

-Proximity measure for categorical attribute

simple matching: $d(i,j)={p-m}\over p$
user a large number of binary variables

-Proximity measure for ordinal attribute
compute ranks $z_{if}$ as interval-scaled. (其中， $z_{if}={r_{if}-1}\over{M_f-1}$ )。

-Attributes of mixed type

a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
use a weighted formula to combine their effects:
$d(i,j)={\sum_{f=1}^{p} w_{ij}^{(f)}d_{ij}^{(f)}}\over{\sum_{f=1}^{p} w_{ij}^{(f)}}.$

-Covariance for two variables
Covariance between two variables $X_1$ and $X_2$ :
$\sigma_{12}=E[X_1X_2]-E[X_1]E[X_2]$

-Correlation coefficient
$\rho_{12}={\sigma_{12}\over {\sqrt{\sigma_1^2\sigma_2^2}}}$

-Covariance matrix
$\sum = E[(\mathbf{X}-\mu)(\mathbf{X}-\mu)^\mathrm{T}]$
$~=\begin{pmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{pmatrix}$

Lecture3:

Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.

$K$ -partitioning method:

Partitioning a dataset $D$ of $n$ objects into a set of clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)

**

A typical objective function:
Sum of Squared Errors (SSE)
$SSE(C)=\sum_{k=1}^{K}\sum_{x_{i\in C_k}} ||x_i-c_k||^2$

Problem Definition:

Given , find a partition of clusters that optimizes the chosen partitioning criterion.

Global optimal: Needs to exhaustively enumerate all partitions.

Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.

K-Means：

**

**

Each cluster is represented by the center of the cluster

Efficiency: $O(tKn)$ ，normally, $K,t\ll n$

often terminate at a local optimal

Need to specify
**

**objects in a continuous n-dimensional space
复制代码
* **use the-modes for **categorical data****
**
* Sensitive to noisy data and outliers
复制代码
  * variations: use -medians, -medoids, etc.
**
* ****Not suitable to discover clusters withnon-convex shapes
复制代码
  * use density-based clustering, kernel -means, etc.
Variations of-Means

**
* choose better initial centroid estimates
复制代码
  * -means++, Intelligent -means, Genetic -means

* choose different representative prototypes for the clusters


  * -medoids, -medians, -modes
* applying feature transformation techniques


  * weighted -means, kernel -means
Initialization of-means
复制代码
* ** -means++**  



  * The first centroid is selected at random
  * The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
  * The selection continues until $k$ centroids are obtained
The-medoids clustering methods

**

-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

PAM (Partitioning Around Medoids)

复制代码

* starts from an initial set of medoids

* iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering

* works effectively for small data sets but does not scale well for large data sets (**due to the computational complexity**)

* $O(K(n-K)^2)$ (quite expensive!)

Efficiency improvements on PAM

CLARA (Kaufmann & Rousseeuw, 1990):
- PAM on samples: $O(Ks^2+K(n-K))$ , $s$ is the sample size.
Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality

** -modes:**

An extension to -means by replacing means of clusters with modes. For categorical value.

Dissimilarity measure between object $X$ and the center of cluster $Z$ .
$\Phi(x_i,z_j)=1-n_j^r\over n_l$ ,
其中， $n_l$ 为类中的对象数目， $n_j^r$ 是属性值为 $r$ 的对象数目。说明，dissimilarity为frequency-based 。

A mixture of categorical and numerical data: using a ** -prototype method**

全部评论 (0)

还没有任何评论哟~

数据挖掘中的聚类分析

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——ClusterAnalysisinDataMining。第一周 ConsiderationsforClusterAnalysis par...

数据挖掘之聚类分析

1.聚类分析是依据样本间关联的量度标准将样本自动分成几组，使同一群组内的样本相似，而不同群组的样本相异的一组方法，聚类分析系统的输入是一组样本和一个度量两个样本间相似度（或相异度）的标准，聚类分析的输...

数据挖掘之聚类分析

聚类分析是在没有给定划分类别的情况下，根据数据相似度进行样本分组的一种方法。与分类模型需要使用有类标记样本构成的训练数据不同，聚类模型可以建立在无类标记的数据上，是一种非监督的学习算法。

数据挖掘之聚类分析

聚类分析（ClusteringAnalysis）是数据挖掘中的一项重要技术，旨在根据对象间的相似性或差异性，将对象分为若干组（簇）。同一簇内的对象相似性较高，而不同簇间的对象差异性较大。聚类分析广泛应...

数据挖掘中的分层聚类

分层聚类方法的工作原理是将数据分组到集群树中。分层聚类首先将每个数据点视为单独的聚类。然后，它会重复执行后续步骤： 1.确定可以最接近的2个集群，以及 2.合并最多2个可比较集群。我们需要继续这些步骤...

数据挖掘Part 3 聚类分析

数据挖掘Part3聚类分析 3.1聚类分析基本概念和方法 3.2度量数据的相似性与相异性 3.3Kmeans算法 3.4离群点检测 3.4.1离群点：数据对象——明显有别于其他的——异常数据 3.4....

数据挖掘笔记:聚类分析

什么是聚类分析 1.聚类（簇）：数据对象的集合在同一个聚类（簇）中的对象彼此相似不同簇中的对象则相异 2.聚类分析:将物理或抽象对象的集合分组成为由类似的对象组成的多个类的过程 3.聚类是一种无指...

数据挖掘算法--聚类分析

1\.聚类分析的提出物以类聚，人以群分！志同而道合，即具有相同特点的物体（或人类）往往更容易走近，从而形成自己的一个“圈子”。在现代零售行业，顾客群细分是最为常见的一种业务需求，一般情况下，会从客...

数据挖掘与分析——聚类模型

一、人工数据集 1.使用makeblobs产生3个簇的数据，使用makemoons产生2个簇的数据； 2.分别使用Kmeans和DBSCAN算法对数据进行聚类。Kmeans的k值设定分别为3和4； 3...

数据挖掘导论之聚类分析

这篇文章对《数据挖掘导论》中聚类分析这部分做一个总结。聚类含义：根据在数据中发现的描述对象及其关系的信息，将数据对象分组目标：组内的对象相互之间是相似的，而不同组中的对象是不同的。组内的相似性越...

是否确定退出登录?

数据挖掘中的聚类分析

第一周

全部评论 (0)

相关文章推荐

数据挖掘中的聚类分析

数据挖掘之聚类分析

数据挖掘之聚类分析

数据挖掘之聚类分析

数据挖掘中的分层聚类

数据挖掘Part 3 聚类分析

数据挖掘笔记:聚类分析

数据挖掘算法--聚类分析

数据挖掘与分析——聚类模型

数据挖掘导论 之 聚类分析

数据挖掘导论之聚类分析