Data Cleansing: boring, painful, tedious and very, very important

I've been working recently on a category management project and I'm reminded of just how essential clean, well-organized data is.  We are working to group stores into "clusters" of similar stores; later we will see what geographic and demographic data best helps us to predict cluster membership and optimize product assortment by cluster.

As a first pass, and under a severe time crunch, we took the data available, ran it through the model and while it processed, I was unhappy with the predictive power we found.  Of course, this approach was ridiculously optimistic: so, back to look at the product characteristics we were using.  While the data were cleaner than expected they still suffered from a range of problems visible even to someone who does not know the products that well:

  • missing, invalid and inconsistent values
  • inconsistency across related products (flavor variations with different weights and pricing). 
  • product characteristics that should really be split into multiple characteristics (because the options are not mutually exclusive)

It's taken a few iterations and a few days to get this cleaned up and to embed some of the new product characteristics in the system but its worth every minute.  Even simple analysis  now shows more meaningful results.  Predictive models will benefit even more.

Clustering is a relatively simple statistical process.  Once set up, I can teach someone with limited predictive modeling skills to re-run models with sensible defaults and to interpret the outputs.  Cleaning the data and presenting it correctly to the modeling tools (so you get useful answers) takes more skill.

So, if you are knee-deep in a modeling project and have not paused to check your data quality, perhaps now is the time...