The right tools for (structured) BIG DATA handling

Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal.  Sounds familiar?

Let's use Point of Sale data as an example as POS data can easily  generates more data-volume than the ERP system.  The data is simple and easily organized in conventional relational database tables -  you have a number of "facts" (sales-revenue, sales-units, inventory,  etc.) defined by product, store and day going back a few years and then some additional information about products, stores and time stored in master ("dimension") tables,

The problem is that you have thousands of stores, thousands of products and hundreds (if not thousands) of days - this can very quickly feel like "big data".    Use the right tools and my rough benchmarks suggests you can not only handle the data but see a huge increase in speed.


Let's see just how big this data could be:
If on each day, you collect 10 facts for 1,000 products at 1,000 stores that would be 10 million facts every day (10 x 1000 x 1000) .  Look at it annually , that's 3.65 billion facts every year.  
Is it big compared to an index of the world-wide-web? No it's tiny, but in comparison to the data a business analyst normally encounters it's not just "big" its "enormous".  Just handling basic data manipulation (joins, filters, aggregation etc,) is a problem.  Trying to handle this in desktop tools like Excel, or Access is completely impossible.

As usual, there are better tools and worse tools - you must use a database, but even with a conventional server-based database like Microsoft's SQL*Server, you may have problems with speed.   I wanted to see how speed is impacted, firstly by upgrading the hard-drive and second by using two varieties of column-store databases.  

A couple of relatively simple changes and bench-marking shows a 4100% increase in speed.  If a 4100% increase does not indicate to you that there may be a better tool for the job, I don't know what will.

Running analytics against this data (once delivered from a tool that has joined, filtered and aggregated) appropriately is another challenge that we will get to in a later post.

First a little disclosure: I am first and foremost an analyst: my technologies of choice are statistics, mathematics, data-mining, predictive-modeling, operations-research,... NOT databases and NOT hardware-engineering. To feed my need for data I have become adept in a number of programming languages and relational database systems. I'm most comfortable in SQL Server just because I'm more familiar with that tool though I have used other databases too. Bottom line, I'm a lot better than "competent" but I am not "expert".

Test environment

I built a test database in SQL Server 2012 with 4 tables in a simple "star schema": 1 "fact" table with 10 facts per record and 3 associated "dimension" tables as follows:


Inline image 1

The data itself is junk I generated randomly in SQLServer with appropriate keys and indexes defined.  

This represents approximately 8 GB of data.  Not enormous (and as you will see later) perhaps not big enough to test one of the options fully, but big enough to get started and much bigger than many analysts ever see.

I'm testing this on a mid-range laptop, quad-core AMD CPU, with 8 GB of RAM running Windows 7 (64 bit) that cost substantially less than $1000 new. You probably have something very like it sat on your desk.

I then wanted to see how long it would take to take to perform a simple aggregation. My test SQL (below) joins the fact table to both the product and period dimension tables then adds each fact (1 thru 10) for each year and brand. Not very exciting perhaps but a very common question "what did I sell by brand by year".

SELECT Item.Category, Period.Year, SUM(POSFacts.Fact1) AS Fact1, SUM(POSFacts.Fact2) AS Fact2, SUM(POSFacts.Fact3) AS Fact3, SUM(POSFacts.Fact4) AS Fact4, SUM(POSFacts.Fact5) AS Fact5, SUM(POSFacts.Fact6) AS Fact6, SUM(POSFacts.Fact7) AS Fact7, SUM(POSFacts.Fact8) AS Fact8, SUM(POSFacts.Fact9) AS Fact9, SUM(POSFacts.Fact10) AS Fact10 FROM Item INNER JOIN POSFacts ON Item.ItemID = POSFacts.ItemID INNER JOIN Period ON POSFacts.PeriodID = Period.PeriodID GROUP BY Item.Category, Period.Year

Each run was repeated 5 times and the elapsed time for each run averaged to get the results shown below.  While there was variation in run times, this was typically within about 10% of the average for each test.

Baseline

This is my starting point: SQL Server 2012 in its regular row storage mode.   


Faster Storage

This database query is going to need a lot of data from the hard-disk; actually almost all the data in the database.  The hard-drive that came with my laptop was not especially slow but it was clear that it was a bottleneck on my system.  While running this query the standard disk could not deliver data fast enough to keep the CPU busy - in fact the CPU was rarely operating at even 50% capacity.   An option to upgrade the hard-drive seemed to be in order.  ($350 for a 480 GB Solid State Disk).

SQL Server 2012 ColumnStore Indexes

SQL 2012 has a new feature called a Columnstore Index.  Per the Microsoft website:
 "An xVelocity memory optimized columnstore index, groups and stores data for each column and then joins all the columns to complete the whole index. This differs from traditional indexes which group and store data for each row and then join all the rows to complete the whole index. For some types of queries, the SQL Server query processor can take advantage of the columnstore layout to significantly improve query execution times... Columnstore indexes can transform the data warehousing experience for users by enabling faster performance for common data warehousing queries such as filtering, aggregating, grouping, and star-join queries."
To put that in plainer English - for data warehousing applications (like reporting and analytics) a columnar database structure can pull its data with fewer trips to the disk - and that's faster, potentially a LOT faster.  (By the way if you want your database to support a transactional system where you will repeatedly be hitting it with a handful of new records or record changes - this could be an excellent way to slow it down  )

Now adding a ColumnStoreIndex does take a while but it's not exactly difficult.  It's just a SQL statement that you run once:
CREATE NONCLUSTERED COLUMNSTORE INDEX [ColIndex_POSFacts] ON [dbo].[POSFacts] ([Fact1],[Fact2],[Fact3],[Fact4],[Fact5],[Fact6],[Fact7],[Fact8],[Fact9],[Fact10])WITH (DROP_EXISTING = OFF) ON [PRIMARY].
Note: Once the ColumnStoreIndex is applied the SQL Server table is effectively read-only unless you do some clever things with partitioning.  For one-off projects this doesn't matter at all of course.  For routine reporting projects you may need a DBA to help out.

InfiniDB columnar database

Columnar databases are not really "new" of course, just new to SQL Server so I also wanted to test against a "best of breed", purpose-built columnar database.  

Why Infinidb?  From my minimal research it seems to test very well against other columnar databases, it's open source (based on MySQL), will run on Windows and comes with a free community edition.  I actually found the learning curve relatively simple, in fact, as Infinidb handles it's own indexing needs it's perhaps even simpler than SQL Server .  Frankly, the hardest part was remembering how to export 40 million records neatly from SQL Server so they could easily be read into InfiniDB using their (very fast) data importer.

The Results

Here are the (average) elapsed times to run this query under each disk and database configuration.  



So the basic SQL Server 2012 configuration on a regular hard-drive took... 1,535 seconds to run my query.  That's over 25 minutes.  I can drink a lot of coffee in 25 minutes.

Upgrade to the Solid State Disk (SSD) and it runs 460% faster in 5 minutes and 32 seconds.  Now understand that my laptop does not use the fastest connection to this SSD, it's spec says it can handle 2.5 Gb per second.  I believe newer laptops run at 6 Gbps.  That being said at least now the quad-core CPU was being kept busy.

If instead of upgrading the disk we add a ColumnStoreIndex to the fact table, we do even better reducing from 1,535 seconds to 126 - that's over 1200% faster !

So which option should we use?   Both of course!  I can now run a query that used to take 25 minutes in 37 seconds.  That's 4100 % faster than when I started.

Now let's take a look at that InfiniDb number.  (I did not test with InfiniDB before swapping out the hard-drive so I only have data for it on the SSD.)   Surprisingly it was not quite as good as the SQL Server speed with the Columnstore Index .  I talked to the folks at Calpont that develop InfiniDB and they kindly explained that a key part of their optimization splits large chunks of data into smaller ones for processing.  Sadly my 41 million record table was not even big enough to be worth splitting into 2 "small" chunks so this particular feature never engaged in the test.  Still it's almost 3000% faster than base SQL even on this "small" dataset and the community edition is free.  

Based on the success of this test I think it's time to scale up the test data by a factor of 10 - watch this space.
(Check out the following update post for more details.)

Conclusions

My test-bed for this  benchmark was a mid-range laptop with a few nice extras  (more RAM, solid state disk and 64 bit OS) but certainly not an expensive piece of equipment and it managed to handle an enormous amount of data with very little effort.  This opens up possibilities for analyzing and reporting on much more data than was possible previously on your desk.

The implications are not just relevant to desktop tools though or to tools we think of as databases.  Numerous other tools now claim to handle data storage in columnar form (see Tableau and PowerPivot for Excel).

Is this the best tool for the job?  Perhaps, perhaps not: there is an enormous amount of activity and innovation in the database space right now and many. many other software providers.  It's certainly a lot faster for this specific purpose and a major step forward over more traditional approaches.

Look hard at columnar databases to speed up your raw data processing and don't spend any longer waiting on slow hard-drives.