Bridging Clustering Theory and Practice: Overview and Recommended Readings
Clustering is one of the most popular data mining tools, used in a wide range of applications, from astronomy to zoology. However, despite its popularity, there is a substantial gap between theory and practice in clustering. Many basic questions, such as `what is clustering?' and 'which clustering algorithm should I use for my application?' remain difficult to answer. The emphasis of my work is to bridge this gap by developing theoretical foundations of clustering that are genuinely useful in practice.
The following papers contribute to bridging the gap between theory and practice in clustering, and will help students and researchers get to the frontier of research in this relatively new area. There is an abundance of important open problems in this line of work. If you are a student who is interested in working with me, please have a look at this overview and some of the papers it mentions before contacting me.
It should be noted that there are many other papers on the theory of clustering (such as work concerned with computational issues and algorithms under certain generative assumptions). The focus here is on bridging theory and practice.
This overview is intentionally brief, and in particular does not include all of my contributions to this line of work, which you can find in my
complete list of my publications. Check out my research page for information on my current projects.
In addition to this overview, you may also want to check out my talk on formal foundations of clustering.
The User's Dilemma
Perhaps the most important branch of research aiming to bridge theory and practice is concerned with solving "the user's dilemma," which is the problem of finding an appropriate clustering algorithm for a specific task. This is a particularly difficult problem because clustering algorithms vary radically in their input-output behaviour. That is, different clustering methods often output substantially different partitions given the same input. The following papers study differences in the input-output behaviour of clustering techniques in order to allow users to select a suitable clustering algorithm for their application.
For a high level description of this branch of research, see the following paper, which presents an approach for solving the user's dilemma by finding succinct, user-friendly properties that capture differences in the input-output of clustering techniques.
Some of the most important properties in this framework were discovered in the following paper:
A related set of properties, capturing differences in how algorithms respond to small adversarial sets, is discussed here:
Following the framework for algorithm selection discussed above, there have been several property-based characterizations of clustering algorithms and paradigms.
Here is a characterization of Linkage-based clustering:
The following paper characterizes the single-linkage algorithm:
The most fundamental problem in clustering is concerned with what is it. Unfortunately, there is no consensus on the definition of clustering, beyond the vague notion that is should group similar items together. In particular, there are currently no axioms that define clustering.
In 2002, Klienberg published the following highly influential paper, which argued that there are inherent limitations to axiomatizing clustering. In particular, he showed that no clustering function can satisfy a specific set of three natural properties. This paper was instrumental in reigniting interest in this area of research.
In 2008, Ben-David and I published a paper that what has since been referred to as a "rebuttal" to the above impossibility result, showing that Klienberg's axioms are consistent in the closely related framework of clustering quality measures, which are functions that quantify how good or 'cogent' is a given clustering.
- Jon Kleinberg. An Impossibility Theorem for Clustering. Advances in Neural Information Processing Systems (NIPS 2002).
Clusterability is a fundamental concept in clustering as it captures the degree of inherent cluster structure in data. For instance, random data sets typically have no inherent clustering structure. As we cannot expect algorithms to perform well when data is not clusterable, clusterability becomes an important consideration when evaluating the both the qualitative and quantitative performance of an algorithm.
For an in-depth discussion of this concept see:
There have been many notions of clusterability proposed in previous work. Some of the most interesting notions were proposed by Balcan, Blum, and colleagues. Their line of work focuses on notions that allow for efficient discovery of the desired clustering. Here are two papers in this line of work that I often refer to:
As the amounts of data avaliable for analysis reach unprecedented levels, users are often forced to make the transition from batch to incremental clustering methods, where data arrives one element at a time and there is not enough resources to store the entire set. New algorithms for this crucial settings are being proposed, but theoretical analysis has only began very recently.
This new direction of research is explored in the following recent paper, where it is shown (among other things) that incremental clustering techniques are strickly weaker than batch methods in their ability to discover cluster structure. That is, there are clusterings that are readily detectable using classical, batch techniques, that provably cannot be found by any incremental clustering method.
Other Recommended Readings
- One of the main objectives in bridging the gap between theory and practice is to foster interactions between research communities that apply and study clustering. For instance, Phylogeny is a prime application of hierarchical clustering, and yet, the Phylogeny community has very little interaction with the clustering community. My recent paper with Brown and Ben-David starts to bridge the gap between these two fields:
Margareta Ackerman, Dan Brown, and David Loker. Effects of Rooting via Outgroups on Ingroup Topology in Phylogeny. International Journal of Bioinformatics Research and Application (IJBRA), 2014.
- Meila has some interesting work on properties of functions that measure the distance between clusterings. See, for example:
Marina Meila. Comparing clusterings -- an axiomatic view. International Conference on Machine Learning (ICML 2005).
- Another interesting line of work explores the theoretical implications of stability. The following is an influencial paper in this direction:
Shai Ben-David, Urlike von Luxburg, and David Pal. A Sober Look at Stability of Clustering. Conference on Learning Theory (COLT 2006).