Cluster Analysis

Used to identify clusters in the data
Clusters generally are identified as
    • Relative distance between points
    • Relative homogeneity of each cluster
    • Degree of grouping separation
Three commonly used clustering types
    • Agglomerative hierarchical methods
    • $k$-Means methods
    • Classification maximum likelihood methods
Agglomerative Hierarchical Methods
Classification from a hierarchy of from 1 to n features
Number of clusters predetermined
Partition operation
    1. One feature for in one cluster giving n clusters
    2. Nearest pair of distinct clusters merge and remove individuals forming cluster
    3. If the number of clusters = 1, stop, else got to step 2
Each stage merges nearest clusters into pairs
Uses Euclidian distance to locate nearest pairs
    • Single linkage: $d_{AB} = \min\limits_{i \in A, \ j \in B} (d_{ij})$
    • Complete linkage: $d_{AB} = \max\limits_{i \in A, \ j \in B} (d_{ij})$
    • Average linkage: $d_{AB} = \frac{1}{n_A n_B} \sum_{i \in A} \sum_{j \in B} d_{ij}$
 $\boldsymbol{k}$-Means Clustering
Partition into $K$ groups by minimizing numerical criterion
Often, minimize within group sum of squares over all variables
Consider all possible partitions, choose the one with lowest SS
All possible partitions not tractable, generally
    1. Form initial partitions (often with hierarchical method)
    2. Move each observation from to a different partition, obtain SS
    3. Keep new clusters if SS reduced, else use the old
    4. Repeat steps 2 and 3 until no move reduces SS
Standardize variables if scales are very different
Number of groups visualized as SS vs. number of groups plot

 

Model-Based Classification (e.g., MLE)
Model-based clustering assumes $c$ subpopulations combine into the entire population
Each subpopulation constitutes a cluster
Each $j$th subpopulation has $q$-dimensional observation density $f_j(\boldsymbol{x},\boldsymbol{\theta}_j)$ for unknown vector of parameters $\boldsymbol{\theta}_j$
Each observation is tagged to a cluster by a vector element $\ni \ \boldsymbol{\gamma} = [\gamma_1,\gamma_2,\ldots,\gamma_c]$
Clusters derived by choosing the $\boldsymbol{\theta} = (\boldsymbol{\theta}_1,(\boldsymbol{\theta}_2,\ldots,(\boldsymbol{\theta}_c)$ and the $\boldsymbol{\gamma}$ which maximize the likelihoods of the $f_j(\boldsymbol{x},\boldsymbol{\theta}_j)$
Model types
        • spherical
          • equal volume (EII)
          • unequal volume (VII)
        • diagonal
          • equal volume and shape (EEI)
          • varying volume, equal shape (VEI)
          • equal volume, varying shape (EVI)
          • varying volume and shape (VVI)
        • ellipsoidal
          • equal volume, shape, and orientation (EEE)
          • equal shape and orientation (VEE)
          • equal volume and orientation (EVE)
          • equal orientation (VVE)
          • equal volume and equal shape (EEV)
          • equal shape (VEV)
          • equal volume (EVV)
          • varying volume, shape, and orientation (VVV)