*Engineering, Tech, TeX, and Numbers. Lots of Numbers.*

- 2014.01.31: Latent Semantic Analysis
- 2014.01.25: On Cutting Firewood
- 2014.01.08: Dragons!
- 2013.10.13: A Haiku For Winter
- 2013.09.28: ShareLaTeX: TeX from anywhere!
- 2013.09.23: Google Trends Scipt

Cluster analysis is some powerful voodoo, and has been applied to a *ton* of different areas. Market analysis? Cluster it. Social network analysis? Clustering. Medical diagnostics? More clusters.

In a nutshell, cluster analysis is a tool for classification. It involves sorting observations into "natural" groupings. For small data sets in 2 or 3 dimensions, clustering can easily be performed without a computer. However, large data sets with many dimensions necessitate automated means.

I've recently had the opportunity to explore cluster analysis as part of my research. Here are a few concepts that I've found particularly awesome.

**Spectral Clustering**Spectral clustering is easily implemented, quickly solved, and usually outperforms methods like traditional

*k*-means. For datasets like that below, spectral clustering much better than*k*-means.Spectral clustering is really more of a pre-processing step than a clustering algorithm. It consists of computing the eigenvalues of the Laplacian of the similarity matrix, and then reducing the dimensionality of the data according to the resulting eigenvectors. The transformed data is then clustered with a traditional algorithm (usually

*k*-means).**NERF***c*-means ClusteringThis clustering method has the coolest name ever. The authors even point it out in their paper their paper:

NERF stands for Non-Euclidean Relational Fuzzy.*While this acronym does not emphasize the important duality between NERF and OFCM, it is just too tempting to resist!**Relational*means that the algorithm doesn't even need the full data, just a distance matrix. Even cooler,*Non-Euclidean*means that the distance matrix doesn't even need to correspond to a typical Euclidean space. The distance metric can even be something odd, like the Hamming Distance or the Damerau-Levenshtein Distance.**How many clusters?**How many clusters should the clustering algorithm make? This is not the trivial issue that it might seem. For large, complicated data sets this question can be approached semi-quantitatively by observing the plot of the within-cluster sum squared error (

*WSSE*) as a function of the number of clusters (*k*). For*k = 1*the value of*WCSS*is equal to the*SSE*of the data set. We then expect the*WCSS*to decrease rapidly until the appropriate number of clusters is reached. After that, additional clusters should not decrease the*WCSS*very much. Therefore, the appropriate number of clusters occurs at the "elbow" on the*WCSS v. k*plot. For instance, the plot might look like this:Using the heuristic described above, this plot indicates that approximately 4 clusters exist. One method that tries to numerically determine this elbow is presented in this paper.

**Rand Index**Different clustering algorithms can produce different clustering patterns. For that reason, its sometimes beneficial to compare two partitions. Luckily, the Rand Index allows us to do just that. The Rand Index varies between a value of 0 when the partitions are very different, to 1 when they are identical. However, the Rand Index tends to increase with the number of clusters. Hubert et al remedied this shortcoming, proposing the Adjusted Rand Index. The adjusted Rand Index also varies between 0 and 1 (in expectation).

Cluster analysis is a broad field, and is very much alive. To learn more, check out this wikipedia article, this toolbox, or this textbook.

by Chris McComb