Advertisement

数据挖掘中的聚类分析

阅读量:

记录Coursera上由数据挖掘大牛韩家伟教授开的一门课程——Cluster Analysis in Data Mining。

第一周

-Considerations for Cluster Analysis

  • partitioning criteria (single level vs. hierarchical partitioning)
  • separation of clusters (exclusive vs. non-exclusive [e.g.: one doc may belong to more than one class])
  • similarity measure (distance-based vs. connectivity-based [e.g., density or contiguity])
  • clustering space (full space [e.g., often when low dimensional] vs. subspace [e.g., often in high-dimensional clustering])

Four issues:
-Quality

  • deal with different types of attributes: numerical, categorical, text, multimedia, networks, and mixture of multiple types
  • clusters with arbitrary shape
  • deal with noisy data

-Scalability

  • clustering all the data instead of only on samples
  • high dimensionality
  • incremental or stream clustering and insensitivity to input order

-Constraint-based clustering

  • user-given preferences or constraints

-Interpretable and usability

Cluster Analysis Categorization:
-Technique-centered

  • distance-based
  • density-based and grid-based methods
  • probabilistic and generative models
  • leveraging dimensionality reduction methods
  • high-dimensional clustering
  • scalable tech for cluster analysis

-Data type-centered

  • clustering numerical data, categorical data, text, multimedia, time-series data, sequences, stream data, networked data, uncertain data.

-Additional insight-centered

  • visual insights, semi-supervised, ensemble-based, validation-based.

Typical Clustering Methods:
-Distance-based

  • partitioning algo.: k-means, k-medians, k-medoids
  • hierarchical algo.: agglomerative vs. divisive method

-Density-based and grid-based

  • density-based: at a high-level of granularity and then post-processing to put together dense regions into an arbitrary shape.
  • grid-based: individual regions are formed into a grid-like structure

-Probabilistic and generative models

  • Assume a specific form of the generative model (比如:mixture of Gaussian)
  • Model parameters are estimated with EM algo.
  • Then estimate the generative probability of the underlying data points.

-High-dimensional clustering

  • subspace clustering (bottom-up, top-down, correlation-based method vs. \delta-cluster method)
  • dimensionality reduction (co-clustering [column reduction]: PLSI, LDA, NMF, spectral clustering)

Lecture2:
Good clustering:

  • High intra-class similarity (Cohesive)
  • Low inter-class similarity (Distinctive between clusters)

proximity : similarity or dissimilarity

-Dissimilarity Matrix

  • triangle matrix (symmetric)
  • distance functions are usually different for different types of data

-Distance on numeric data: Minkowski Distance

  • A popular distance measure:
    d(i,j) = \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p}
    其中,i=(x_{i1},x_{i2},\dots,x_{il})j=(x_{j1},x_{j2},\dots,x_{jl})l维数据,p 为order (这种距离也常被成为 l-p norm)。

  • Property:
    positivity; symmetry; triangle inequality.

  • p=1: Manhanttan (or city block) distance

  • p=2: Euclidean distance

  • p\to\infty: “supremum” distance
    这种情况下,
    d(i,j)=\lim_{x\to\infty} \sqrt[p]{|x_{i1}-x_{j1}|^p+|x_{i2}-x_{j2}|^p+\dotsb+|x_{il}-x_{jl}|^p}=\underset{f=1,\dots,l}{\mathrm{max}}~|x_{if}-x_{if}|

-Proximity measure for binary attribute
Draw a contingency table for binary data.

  • for symmetric binary variables: d(i,j)={r+s}\over{q+r+s+t}
  • for asymmetric binary variable: d(i,j)={r+s}\over{q+r+s}
  • Jaccard coefficient (similarity measure): sim_{Jaccard}(i,j)=q\over{q+r+s},跟“coherence(i,j)计算一样”。

-Proximity measure for categorical attribute

  • simple matching: d(i,j)={p-m}\over p
  • user a large number of binary variables

-Proximity measure for ordinal attribute
compute ranks z_{if} as interval-scaled. (其中,z_{if}={r_{if}-1}\over{M_f-1})。

-Attributes of mixed type

  • a dataset may contain all attribute types: nominal, symmetric binary, asymmetric binary, numeric, and ordinal;
  • use a weighted formula to combine their effects:
    d(i,j)={\sum_{f=1}^{p} w_{ij}^{(f)}d_{ij}^{(f)}}\over{\sum_{f=1}^{p} w_{ij}^{(f)}}.

-Covariance for two variables
Covariance between two variables X_1 and X_2:
\sigma_{12}=E[X_1X_2]-E[X_1]E[X_2]

-Correlation coefficient
\rho_{12}={\sigma_{12}\over {\sqrt{\sigma_1^2\sigma_2^2}}}

-Covariance matrix
\sum = E[(\mathbf{X}-\mu)(\mathbf{X}-\mu)^\mathrm{T}]
~=\begin{pmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{pmatrix}

Lecture3:

Partitioning method: Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions.

K-partitioning method:

Partitioning a dataset D of n objects into a set of clusters so that an objective function is optimized (e.g., the sum of squared distances is minimized.)

**

A typical objective function:
Sum of Squared Errors (SSE)
SSE(C)=\sum_{k=1}^{K}\sum_{x_{i\in C_k}} ||x_i-c_k||^2

Problem Definition:

Given , find a partition of clusters that optimizes the chosen partitioning criterion.

Global optimal: Needs to exhaustively enumerate all partitions.

Heuristic methods (i.e., greedy algo.):
- K-means, K-medians, K-Medoids, etc.

K-Means:

**

**

  • Each cluster is represented by the center of the cluster
  • Efficiency: O(tKn),normally, K,t\ll n
  • often terminate at a local optimal
  • Need to specify
    **
  • **objects in a continuous n-dimensional space

复制代码
* **use the-modes for **categorical data****

**
* Sensitive to noisy data and outliers

复制代码
  * variations: use -medians, -medoids, etc.

**
* ****Not suitable to discover clusters withnon-convex shapes

复制代码
  * use density-based clustering, kernel -means, etc.


Variations of-Means

**
* choose better initial centroid estimates

复制代码
  * -means++, Intelligent -means, Genetic -means

* choose different representative prototypes for the clusters


  * -medoids, -medians, -modes
* applying feature transformation techniques


  * weighted -means, kernel -means

Initialization of-means

复制代码
* ** -means++**  



  * The first centroid is selected at random
  * The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score)
  * The selection continues until $k$ centroids are obtained

The-medoids clustering methods

**

-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

PAM (Partitioning Around Medoids)

复制代码
* starts from an initial set of medoids

* iteratively replaces one of the medoids by one of the non-medoids if it improves the total sum of the squared errors (SSE) of the resulting clustering

* works effectively for small data sets but does not scale well for large data sets (**due to the computational complexity**)

* $O(K(n-K)^2)$ (quite expensive!)


Efficiency improvements on PAM

  • CLARA (Kaufmann & Rousseeuw, 1990):

    • PAM on samples: O(Ks^2+K(n-K)), s is the sample size.
  • Clarans (Ng & Han, 1994): randomized re-sampling, ensuring efficiency + quality

** -modes:**

**

An extension to -means by replacing means of clusters with modes. For categorical value.

Dissimilarity measure between object X and the center of cluster Z.
\Phi(x_i,z_j)=1-n_j^r\over n_l,
其中,n_l为类中的对象数目,n_j^r是属性值为r的对象数目。说明,dissimilarity为frequency-based

A mixture of categorical and numerical data: using a ** -prototype method**




全部评论 (0)

还没有任何评论哟~