The right tools for (structured) BIG DATA handling - columnar, mpp and cloud - AWS Redshift

Today, I'm coming back a little closer to the series of promised posts on the Next Generation DSR  to look at some benchmark results for the Amazon Redshift database.   Some time ago I wrote a couple of quite popular posts on using columnar databases and faster (solid state) storage to dramatically (4100%) improve the speed of aggregation queries against large data sets.  As data volumes even for ad-hoc analyses continue to grow though, I'm looking at other options.

Here's the scenario I've been working with: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal.

Previously, (here)   I looked at the value of upgrading hard-drives (to make sure the CPU is actually busy) and the benefit of using columnar storage which let's the database pull back data in larger chunks and with fewer trips to the hard-drive. The results were ..staggering. A combined 4100% increase in processing speed so that I could read and aggregate 10 facts from a base table with over 40 million records on laptop in just 37 seconds.  (I'm using simulated Point of Sale data at item-store-week level just because it's an environment I'm used to and it's normal to have hundreds of millions or even billions of records to work with)

I then increased the data volume by a factor of 10 (here), repeated the tests and got very similar results without further changing the hardware.   The column-storage databases were much faster, scaling well to both extra records (the SQL 2012 column-store aggregating 10x the data volume in less than 6x the elapsed time) and to more facts (see below).

400 million records (the test set I used) is not enormous but it's certainly big enough to cause 99.2% of business analysts to come to a screeching halt and to beg for help.    It's also enough to tax the limits of local storage on my test equipment when I have the same data replicated across multiple databases.

I've been considering Amazon Redshift for some time - it's cloud-based, columnar, simple to set up, uses standard SQL and it enables parallel execution and storage across multiple computers (nodes) in the cloud.

First let's look at a simple test - the same data as before but now on Redshift.  I tested 2 configurations using their smallest available "dw1.xlarge" nodes currently costing $0.85 per hour per node.  These nodes each have 2 processor cores, 2TB of (non SSD) storage and 15GB of RAM.    I'm going to drop the "SQL 2012 Base" setup that I used previously from the ongoing comparison - it's just not in the race.

SQL Server 2012 (with the ColumnStore Index) was the clear winner in the previous test and for a single fact query it still does very well indeed.  The 2-node Redshift setup takes almost twice as long for a single fact, but, remember that these AWS nodes are not using fast SSD storage (and together cost just $1.70 per hour) so 41 seconds is a respectable result.  Note, also, that it scales to summarizing 10 facts very well indeed, taking about 50% of the time that SQL Server did on my local machine.

How performance scales to more records and more facts is key and, ideally, I want something that scales linearly (or better): 10x the data volume should result in no more than 10x the time.  Redshift here is doing substantially better than that - is that suggesting a better than linear scaling ? Let's take a closer look.  

For this test I extended the base table to include 40 fact fields against the same 3 key fields (Item, Store and week).  I then ran test aggregation queries against the full database for 1, 5, 10, 20 and 30 facts

The blue dots show elapsed time (on the vertical axis) against the number of facts summarized in each query for the 2 node setup.

The red dots show the same data but for the 4 node setup.

For both series, I have included a linear model fit and they are very definitely linear.  (R-squared values of 0.99 normally tell you that you did something wrong, it's just too good, but this data is real.)  However, there appears to be a substantial "setup" time for query processing:- 31.943 seconds in the case of the 2 node system and 10.391 seconds for the 4 node system.  These constants are the same whether you pull 1 fact, 5 facts or 30 on this basic aggregation query.  Now, as all these queries join to the same  item, and period master tables and aggregate on the same category and year attributes from those tables that should not be a big surprise.  Change that scope and this setup time will change too.  (more on that later)

Note also that as the number of nodes was doubled,  processing speed (roughly) doubled too.

Redshift is a definite contender for large scale ad-hoc work  It's easy to setup, scales well to additional data and when you need extra speed you can add extra nodes directly from the AWS web console.  (It took about 30 minutes to resize my 2 node cluster to 4 nodes.)  

When the work is done, shut down the cluster, stop paying the hourly rate and take a snapshot of the system to cheap AWS S3 storage.   You can then restore that snapshot to a new cluster whenever you need it.

Is it the only option?  Certainly not, but it is fast, easy  to use and to scale out.  That may be hard to beat for my needs, but I will also be looking at some SQL on Hadoop options soon.