Data handling - the right tool for the job

Microsoft Excel must be the most widely used analytic tool available today but, it’s not the only tool in my tool-bag and it should not be the only one in yours either.  It’s perfect for building small models, has broad embedded functionality, good charting capabilities, pivot-tables (a superb tool worthy of its own post), comes with a programming language so you can extend its capability even further and almost everybody has a copy already.  It’s awesome and I use it every day.

But...the data we analyze is getting much bigger and a lot more complex.  Even with newer versions of Excel allowing over 1 million records in a sheet, what can you do usefully in Excel with 1 million records?  Certainly, you don’t want to print it, it’s near impossible to chart or model against, bottom line you are using the wrong tool for the job.  To do the job well you need to find, learn and use the right tool.  Using Excel on large or complex datasets is like chopping down a tree with a hammer!

It may be tempting to aggregate your data to reduce data size and continue using Excel.  Beware!  When you aggregate (Sum, Average, etc.) you lose detail you might need.  As an example, let's look at retailer point-of-sale data: the record of every purchase in every store:  

  • Aggregate individual shopper transactions into daily summaries (by product and store) and you can no longer see product associations in the basket or product substitution  from basket to basket;  
  • Aggregate daily to weekly data and you will be blind to monthly sales patterns (which can be extreme).  
  • Aggregate store to national data and you will be unable to see regional patterns or relate sales profiles to demographics by store.  

If you throw away the detail you need (just to fit in a smaller space) then to quote a friend of mine "everything just looks vanilla".

At this point, most business analysts will open up Microsoft Access.  This can handle more data (up to about 2 Gigabytes) and the complexity of multiple tables better than Excel does it in multiple worksheets.  It also uses the same programming language so there is less to learn.   It’s a very good tool when used appropriately and I have a lot of experience with it, even designing custom data architectures and programming extensions to drive faster performance from it.

It seems easy to go beyond 2 Gigabytes these days but even before you reach 2GB you really should move to a more industrial-strength database (SQL Server, Oracle, DB2, MySQL…) just to handle the data volume effectively.  As data heads towards Terabytes and the data complexity increases you will need much more knowledge about database design and programming to do it well.

Analysis tools are not nearly so easy to apply in database environments as in Excel, requiring access to more high powered and analytical, data mining and optimization tools (R, SAS, SPSS, ILOG) or, for simpler analytics and my occasional preference, a capable programmer that can embed the analytics directly into your database applications so they are usable on demand.

I have recently been working on a 3 Terabyte database, around 1500 times larger than the biggest Access database you can build and substantially larger than most SQL Server databases.  At this point, we are talking ‘Big Data’.  In the analytical world you’ll see this phrase ‘Big data’ a lot right now.  If, for example, you’re collecting high frequency sensor data for weather modeling, or collecting data on for a web search engine, 3TB may not seem that big, but by current CPG standards it certainly is and importantly, it's big enough that it becomes challenging to use regular, even industrial-strength, database packages. 

In this case (for Point of Sale data analysis) we use a novel and application-specific, data-architecture and associated query engine developed by Shiloh Technologies (www.shilohtech.com) on a SQL Server database.  The Shiloh application, enhanced now by a variety of analytic tools handles the huge data volumes for us, runs analysis, filters, summarizes and presents the data back to the user in a volume that a mere human can manage (certainly less than 100,000 records, 10,000 would be better).  And guess where we put that data?  Back into Excel of course – it’s a phenomenal tool!