Cluster Analysis - 101

The current Wikipedia page on Cluster Analysis, excerpted below, is correct, detailed and makes absolute sense.  Then again, if you do not have a background in statistical modeling, I'm guessing these two paragraphs leave you no wiser.

 
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.
Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
 
Wikipedia 4/2012 

In this post I hope to provide a workable introduction for people that need to be educated consumers of cluster analysis.
(If you want more technical detail, I suggest you go back to the Wikipedia link above, and follow up on the range of hyperlinks embedded in the document -  it's really very good.)

Let's put this in context with an example.  Assume we are working with a retailer that has 1000 stores and they want to decide which dairy products to put in each store to maximize sales.

One option would be to treat all stores as being the same, come up with one assortment list and put it everywhere.  This has the singular advantage of being easy to execute while leaving all non-mainstream products off the shelf.

At the other extreme, we could try to tailor product assortment individually by store.  Did I mention there are 1000 of them?  Apart from the work involved in building 1000 individual analyses, do we have the discipline to execute such analyses consistently across 1000 stores? Would we have sufficient organization to execute these unique selections in 1000 stores?

Most teams will end up working with groups of stores that they consider "similar" as a compromise.   These groupings may be based off single store features (e.g. stores with more than 30% of sales from premium products) or maybe geographical features (e.g. everything in the South East).  Bear in mind that, if you are trying to do this without statistical help, they do need to be very simple groupings .

For assortment selection, we really want to group together stores where people buy similar products.  In this case we want to find groups of stores that have similar sales patterns.  For dairy, these sales patterns could be related to the % of sales associated with various product characteristics:

  • premium vs. value, 
  • single-serve vs. multi-serve, 
  • cheese vs. yogurt vs. creamer vs. drinks vs. milk
  • yogurt styles

(Note: I do eat a lot of dairy but I haven't worked that category yet so forgive me if I missed something big)

Cluster analysis (actually a family of statistical algorithms, not just one) is used to scan across multiple features that you think are important, to find groups (clusters) so that:

  • stores within a cluster are similar to each other
  • stores in different clusters are dissimilar

It sounds rather like magic doesn't it?  You just throw data at the algorithm and (big fanfare) it finds clusters!  Well perhaps it's not quite that easy.

  • It does take some care to prepare the data, ensuring it's clean, accurate and in a form that works for this process (see Data Cleansing: boring, painful, tedious and very, very important).  
  • In reviewing the results you may decide to drop some features and split out others (e.g. "Premium" is split into "Premium" and "Super Premium").
  • You need to determine how many clusters are correct for your data.
  • You may want to bring in some additional data to help describe clusters after they are created
    • demographics of people living near the store (ethnicity, income, household size etc. )
    • geography (maps work well)
    • local competition
  • Really an extension from basic clustering, but you could build predictive models to explain why , for example, super-premium, greek yogurt is so very popular in Cluster 4.  If you can tie high sales of this product group to specific demographics, you may find other stores with similar demographics that have not previously sold it. (Could be a big opportunity).

I'll return to this topic in future posts, but for today, your takeaways are simple:

  1. Cluster Analysis finds better groups (clusters) of similar things.  
  2. Clusters help you target your offering without dying under the weight of work.