数据挖掘中的聚类分析
记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。
第一周
-Considerations for Cluster Analysis
- partitioning criteria (single level vs. hierarchical partitioning)
- separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
- similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
- clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])
Four issues:
-Quality
- deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
- clusters with arbitrary shape
- deal with noisy data
-Scalability
- clustering all the data instead of only on samples
- high dimensionality
- incremental or stream clustering and insensitivity to input order
-Constraint-based clustering
- user-given preferences or constraints
-Interpretable and usability
Cluster Analysis Categorization:
-Technique-centered
- distance-based
- density-based and grid-based methods
- probabilistic and generative models
- leveraging dimensionality reduction methods
- high-dimensional clustering
- scalable tech for cluster analysis
-Data type-centered
- clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.
-Additional insight-centered
- visual insights, semi-supervised, ensemble-based, validation-based.
Typical Clustering Methods:
-Distance-based
- partitioning algo.: k-means, k-medians, k-medoids
- hierarchical algo.: agglomerative vs. divisive method
-Density-based and grid-based
- density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
- grid-based: individual regions are formed into a grid-like structure
-Probabilistic and generative models
- Assume a specific form of the generative model (比如:mixture of Gaussian)
- Model parameters are estimated with EM algo.
- Then estimate the generative probability of the underlying data points.
-High-dimensional clustering
- subspace clustering (bottom-up, top-down, correlation-based method vs. \delta-cluster method)
- dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)
Lecture2:
Good clustering:
- High intra-class similarity (Cohesive)
- Low inter-class similarity (Distinctive between clusters)
proximity : similarity or dissimilarity
-Dissimilarity Matrix
- triangle matrix (symmetric)
- distance functions are usually different for different types of data
-Distance on numeric data: Minkowski Distance
-
A popular distance measure:
d(i,j) = \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p},
其中,i=(x_{i1},x_{i2},\dots,x_{il}),j=(x_{j1},x_{j2},\dots,x_{jl})为l维数据,p 为order (这种距离也常被成为 l-p norm)。 -
Property:
positivity; symmetry; triangle inequality. -
p=1: Manhanttan (or city block) distance
-
p=2: Euclidean distance
-
p\to\infty: “supremum” distance
这种情况下,
d(i,j)=\lim_{x\to\infty} \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p}=\underset{f=1,\dots,l}{\mathrm{max}}~|x_{if}-x_{if}|。
-Proximity measure for binary attribute
Draw a contingency table for binary data.
- for symmetric binary variables: d(i,j)={r+s}\over{q+r+s+t}
- for asymmetric binary variable: d(i,j)={r+s}\over{q+r+s}
- Jaccard coefficient (similarity measure): sim_{Jaccard}(i,j)=q\over{q+r+s},跟“coherence(i,j)计算一样”。
-Proximity measure for categorical attribute
- simple matching: d(i,j)={p-m}\over p
- user a large number of binary variables
-Proximity measure for ordinal attribute
compute ranks z_{if} as interval-scaled. (其中,z_{if}={r_{if}-1}\over{M_f-1})。
-Attributes of mixed type
- a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
- use a weighted formula to combine their effects:
d(i,j)={\sum_{f=1}^{p} w_{ij}^{(f)}d_{ij}^{(f)}}\over{\sum_{f=1}^{p} w_{ij}^{(f)}}.
-Covariance for two variables
Covariance between two variables X_1 and X_2:
\sigma_{12}=E[X_1X_2]-E[X_1]E[X_2]
-Correlation coefficient
\rho_{12}={\sigma_{12}\over {\sqrt{\sigma_1^2\sigma_2^2}}}
-Covariance matrix
\sum = E[(\mathbf{X}-\mu)(\mathbf{X}-\mu)^\mathrm{T}]
~=\begin{pmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{pmatrix}
Lecture3:
Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.
K-partitioning method:
Partitioning a dataset D of n objects into a set of clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)
**
A typical objective function:
Sum of Squared Errors (SSE)
SSE(C)=\sum_{k=1}^{K}\sum_{x_{i\in C_k}} ||x_i-c_k||^2Problem Definition:
Given , find a partition of clusters that optimizes the chosen partitioning criterion.
Global optimal: Needs to exhaustively enumerate all partitions.
Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.K-Means:
**
**
- Each cluster is represented by the center of the cluster
- Efficiency: O(tKn),normally, K,t\ll n
- often terminate at a local optimal
- Need to specify
**- **objects in a continuous n-dimensional space
* **use the-modes for **categorical data******
* Sensitive to noisy data and outliers
* variations: use -medians, -medoids, etc.**
* ****Not suitable to discover clusters withnon-convex shapes
* use density-based clustering, kernel -means, etc.
Variations of-Means
**
* choose better initial centroid estimates
* -means++, Intelligent -means, Genetic -means * choose different representative prototypes for the clusters * -medoids, -medians, -modes * applying feature transformation techniques * weighted -means, kernel -meansInitialization of-means
* ** -means++** * The first centroid is selected at random * The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score) * The selection continues until $k$ centroids are obtainedThe-medoids clustering methods
**
-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
PAM (Partitioning Around Medoids)
* starts from an initial set of medoids * iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering * works effectively for small data sets but does not scale well for large data sets (**due to the computational complexity**) * $O(K(n-K)^2)$ (quite expensive!)
Efficiency improvements on PAM
-
CLARA (Kaufmann & Rousseeuw, 1990):
- PAM on samples: O(Ks^2+K(n-K)), s is the sample size.
-
Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality
** -modes:**
**
An extension to -means by replacing means of clusters with modes. For categorical value.
Dissimilarity measure between object X and the center of cluster Z.
\Phi(x_i,z_j)=1-n_j^r\over n_l,
其中,n_l为类中的对象数目,n_j^r是属性值为r的对象数目。说明,dissimilarity为frequency-based 。
A mixture of categorical and numerical data: using a ** -prototype method**
