The right tools for (structured) BIG DATA handling (update)

A couple of weeks ago, I ran a somewhat rough benchmark to show just how much faster large database queries can run if you use better tools.
The right tools for (structured) BIG DATA handling  Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal.  Sounds familiar?
I looked at the value of upgrading hard-drives (to make sure the CPU is actually busy) and the benefit of using columnar storage which let's the database pull back data in larger chunks and with fewer trips to the hard-drive.  The results were ..staggering.  A combined 4100% increase in processing speed so that I could read and aggregate 10 facts from a base table with over 40 million records on my laptop in just 37 seconds.

At the time I promised an update on a significantly larger data-set to see whether the original results scaled well.  I also wanted to see whether query times scaled well to fewer facts.  Ideally querying against 5 facts should take about 50% of the original 10 fact aggregation queries.



Test environment

My test environment remains the same,  a mid-range laptop, quad-core AMD CPU, with 8 GB of RAM running Windows 7 (64 bit) and with a relatively cheap (<$400) fast solid-state drive.

This time though I increased the data-quantity 10-fold to 416 million records

Then I ran the same aggregation SQL to pull back summaries of 10 facts from this table.
SELECT Item.Category, Period.Year, SUM(POSFacts.Fact1) AS Fact1, SUM(POSFacts.Fact2) AS Fact2, SUM(POSFacts.Fact3) AS Fact3, SUM(POSFacts.Fact4) AS Fact4, SUM(POSFacts.Fact5) AS Fact5, SUM(POSFacts.Fact6) AS Fact6, SUM(POSFacts.Fact7) AS Fact7, SUM(POSFacts.Fact8) AS Fact8, SUM(POSFacts.Fact9) AS Fact9, SUM(POSFacts.Fact10) AS Fact10 FROM Item INNER JOIN POSFacts ON Item.ItemID = POSFacts.ItemID INNER JOIN Period ON POSFacts.PeriodID = Period.PeriodID GROUP BY Item.Category, Period.Year

I repeated this timed exercise 5 times for:
  • standard (row-based) SQL Server 2012
  • SQL Server 2012 with the ColumnStoreIndex applied
  • InfiniDb (a purpose built column-store database)

Finally I ran it again on each configuration but just summarizing for 1 fact:
SELECT Item.Category, Period.Year, SUM(POSFacts.Fact1) FROM Item INNER JOIN POSFacts ON Item.ItemID = POSFacts.ItemID INNER JOIN Period ON POSFacts.PeriodID = Period.PeriodID GROUP BY Item.Category, Period.Year

Results

Before we get to the query timing let's look at what was happening to my machine while queries were running.  

The first screen-shot below (click to enlarge) was taken while running queries with base SQL-Server (no column store indexes).  You can see that the CPU is just not busy.  In fact it's averaging only 30%  and that's with the solid-state disk installed.  The drive is busy, but only serving up about 50MB/s.  (I say "only" but of course that's much better than the old hard-drive.)
System resources running base SQL Server query

The next screenshot shows system resources while running a query with the ColumnStore Index applied.  The CPU is now busy on average 77% of the time and peaking at 100% on occasion.  The disk utilization chart may be misleading because it's now plotted on a much larger scale but the same disk is now hitting 200MB/s.  I think we can expect great things !
System resources running SQL Server with the ColumnStore Index


So, on to the timed results.  I ran each scenario 5 times and all results were very consistent, within +/-10% of the average.
Test results against 416 million records

Again, SQL Server 2012 with the Columnstore Index is the clear  winner.  Just 217 seconds to aggregate all 10 facts and, amazingly, just 21 seconds to aggregate 1 fact across the same 416 million records.  InfiniDb takes over twice as long against 10 facts and does not scale nearly as well with the single fact query.


Now compare with the results we got last time to see how well each database scaled with the increase in data volume.



Data volume increase by a factor of 10 and:

  • InfiniDb and base SQL server both increased query time by about a factor of about 10, roughly in proportion.
  • SQL server with the ColumnStore index only increased by a factor of 5.9 !   


To be fair I am comparing the (free) community edition of InfiniDB against (decidedly not free) SQL Server and neither tool is really intended to be run on a laptop.  But if you need rapid aggregation of data and do not have access to a cluster of commodity servers - it is clear that columnar storage helps you get that data out fast.

The other thing you may want to consider is that it took me substantially less time to load the data into InfiniDB (sorry I did not time it but we're talking minutes not seconds), and building that ColumnStore index in SQL actually took longer than the base query ~ 4500 seconds.  You may not want to go to this trouble if you just need a couple of quick aggregations.

Remember also that the table with the ColumnStore index is read-only after the index is applied. Want to make some updates?  That would be easier in InfiniDB.

Conclusions

Ultimately I'm not trying to sell you on either option, but if you have a lot of structured data to feed your analytic project, a columnar database may well be the way to go right now.