Cluster? I 'ardly know 'er!

2013.08.15   3:54PM   machine learning, numerical analysis

Cluster analysis is some powerful voodoo, and has been applied to a ton of different areas. Market analysis? Cluster it. Social network analysis? Clustering. Medical diagnostics? More clusters.

In a nutshell, cluster analysis is a tool for classification. It involves sorting observations into "natural" groupings. For small data sets in 2 or 3 dimensions, clustering can easily be performed without a computer. However, large data sets with many dimensions necessitate automated means.

I've recently had the opportunity to explore cluster analysis as part of my research. Here are a few concepts that I've found particularly awesome.

  1. Spectral Clustering

    Spectral clustering is easily implemented, quickly solved, and usually outperforms methods like traditional k-means. For datasets like that below, spectral clustering much better than k-means.

    Spectral clustering is really more of a pre-processing step than a clustering algorithm. It consists of computing the eigenvalues of the Laplacian of the similarity matrix, and then reducing the dimensionality of the data according to the resulting eigenvectors. The transformed data is then clustered with a traditional algorithm (usually k-means).

  2. NERF c-means Clustering

    This clustering method has the coolest name ever. The authors even point it out in their paper their paper:

    While this acronym does not emphasize the important duality between NERF and OFCM, it is just too tempting to resist!
    NERF stands for Non-Euclidean Relational Fuzzy. Relational means that the algorithm doesn't even need the full data, just a distance matrix. Even cooler, Non-Euclidean means that the distance matrix doesn't even need to correspond to a typical Euclidean space. The distance metric can even be something odd, like the Hamming Distance or the Damerau-Levenshtein Distance.

  3. How many clusters?

    How many clusters should the clustering algorithm make? This is not the trivial issue that it might seem. For large, complicated data sets this question can be approached semi-quantitatively by observing the plot of the within-cluster sum squared error (WSSE) as a function of the number of clusters (k). For k = 1 the value of WCSS is equal to the SSE of the data set. We then expect the WCSS to decrease rapidly until the appropriate number of clusters is reached. After that, additional clusters should not decrease the WCSS very much. Therefore, the appropriate number of clusters occurs at the "elbow" on the WCSS v. k plot. For instance, the plot might look like this:

    Using the heuristic described above, this plot indicates that approximately 4 clusters exist. One method that tries to numerically determine this elbow is presented in this paper.

  4. Rand Index

    Different clustering algorithms can produce different clustering patterns. For that reason, its sometimes beneficial to compare two partitions. Luckily, the Rand Index allows us to do just that. The Rand Index varies between a value of 0 when the partitions are very different, to 1 when they are identical. However, the Rand Index tends to increase with the number of clusters. Hubert et al remedied this shortcoming, proposing the Adjusted Rand Index. The adjusted Rand Index also varies between 0 and 1 (in expectation).

Cluster analysis is a broad field, and is very much alive. To learn more, check out this wikipedia article, this toolbox, or this textbook.

by Chris McComb







RSS Feed

© Chris McComb 2013