Analytics are for everyone !

Analytics are for everyone! Well, not building analytics, no. That needs a high level of expertise in statistics, machine-learning, optimization, programming, database skills, a healthy does of domain knowledge for the problem being addressed and a pretty wide masochistic streak too. Using analytics, now that is for everyone, or at least it should be. 

We all use analytics, and, I think, the best examples, we use without thinking about just how complex it is.

Is there anyone out there that hasn't used an electronic mapping service (GPS) for directions? Even ignoring the electronics, these are remarkable pieces of engineering! An extensive, detailed database of road systems and advanced routing analytics to help you find the best route from A to B without sending you backwards down one-way roads or across half-finished bridges.

Perhaps you're thinking it's not that hard? Could you build it? What if I got the data for you? No? But you can use it right? They are not perfect, mostly I think because of data cleanliness problems, but they are close enough that I don't travel far from home without one.

More examples. Anyone used a search engine? How about an on-line weather forecast? How about web-sites that predict house-values? Recommendation engines like those used by Amazon and Netflix? All heavy analytic cores wrapped in an easy to consume, highly usable front-end.

These are, I think, among the exceptions in analytic applications - good analytics AND good delivery.

I talked about pseudo-analytics in a recent post: shams with no basis in science wrapped in a User Interface with the hope that nobody asks too many questions about what's under the hood. This is not good analytics.

Unfortunately even good analytic tools get under-used if they have not been made accessible to the poor people that have to use them. Spreadsheet tools probably top the list for unusable analytic applications: unusable that is by anyone except the person that wrote them. Sadly though, I have seen many examples both in reporting and applications where so little effort was put in to User Experience that any good analytics is almost completely obscured.

Building new analytic capability is a highly skilled job. Delivering analytic results in an easy to consume format so that it gets used is also a highly skilled and, frankly, often forgotten step in the process. After all we do build analytic tools so that they get used. Don't we? Sometime I wonder.

Averages work ! (At least for ensemble methods)

After an early start, I was sitting at breakfast downtown enjoying a burrito and an excellent book on "ensemble methods".  (Yes, I do that sometimes... don't judge)

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data...by Giovanni Seni, John Elder and Robert Grossman(Feb 24, 2010)

For those who have built a few predictive models: regression , neural-nets, decision trees,...  I think this is an excellent read, outlining an approach that can deliver big improvements on hard to predict problems.  The introduction provides a very good overview:

Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges...

Ensemble models use teams of models.  Each model uses a different modeling approach or different samples of the available data or emphasizes different features of your data-set and each is built to be as good as it can be.  Then we combine ("average") the prediction results and,  typically,  get a better prediction than any of the component team members.

When I was first learning predictive modeling as an under-graduate the emphasis was on finding the

best

model from a group of potential candidates.  Embracing ensemble methods, initially, just felt wrong, but the proof is in the performance.

It sounds easy, but, clearly, this is more complex than building a single model and if you can get a good-enough result using simple approaches you should.  You'll know when it's worth trying something more high powered.

With thanks to my friend Matt for this simplification, this may be one of the few contexts where we can say

"Averages work!!"

As a reminder that working with averages (or aggregations of any kind) is generally dangerous to your insight, take another look at this post on why you should be using

daily point-of-sale data

.

Or, consider this...

Business Analytics - finding the balance between complexity and readability

In this blog I try to present analytic material for a non-analytic audience.  I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.  

So, I was inspired by a recent post on Nathan Yau's excellent blog 

FlowingData

 to look at the "readability" of my own posts and apply some simple analytics to the results.

I've followed Nathan's blog for a couple of years now for the many and varied examples of data-visualization he builds and gathers from other sources. One that particularly caught  my eye was this one published by the  Guardian just before the recent State of the Union address in the United States (click to enlarge).

The Guardianplotted the Flesch-Kincaid grade levels for past addresses. Each circle represents a state of the union and is sized by the number of words used. Color is used to provide separation between presidents. For example, Obama's state of the union last year was around the eighth-grade level, and in contrast, James Madison's 1815 address had a reading level of 25.3.

Neither the original post nor Nathan's go into much detail around why the linguistic standard has declined.  Within this period, the nature of the address and the intended audience has certainly changed.   Frankly, having scanned a few of the earlier addresses I think we can all be grateful not to be on the receiving end of one of them.

 So, 

I was inspired to find out the reading level of my own blog

.  It's intended to present analytic concepts to a non-analytic audience.  I can probably go a little higher than recent presidential addresses (8th-10th grades, roughly ages 13-15) but I don't want to be writing college-level material either.

All the books my kids read are graded in this (or a very similar) way but I had never thought about how such a grading system could be constructed.   The

Flesch-Kincaid

grade level estimate is based on a simple formula:


0.39 \left ( \frac{\mbox{total words}}{\mbox{total sentences}} \right ) + 11.8 \left ( \frac{\mbox{total syllables}}{\mbox{total words}} \right ) - 15.59

That's just a linear combination of : 

  • average words per sentence;
  • average syllables per word
  • a constant term.

In fact (though I have not yet  found details of how it was constructed) it looks to be the result of a regression model.  (Simple) data science in action from the 1970's.

Note that Flesch-Kincaid says nothing about the length of the book or the nature of the vocabulary it's all down to long sentences and the presence of multi-syllabic words.

(BTW - the preceding sentence has a Flesch-Kincaid grade score of 

13.63,

calculated with this online

utility

).  Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students.    The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.

Approach

I could have used copy/paste into the online utility I used above, recorded the results in a spreadsheet and pulled some stats from that. That would work, but if I ever want to repeat the exercise or modify it, perhaps to use a different readability index, I must do all that work again.   At the time of writing, there are currently 44 published posts on this blog - there must be a better way.

Actually there are probably many better ways but as I also wanted to flex some

R

-programming muscle I built a web-scraper in R to do the work for me and analyze the results (more on this in a later post).

Results

Let's start with some simple summaries of the results I collected.

Histograms showing the % of posts from this blog (prior to 2/14/13)

, the average (mean) value shown in red.

There is some variety in the grade reading level indicated by Flesch-Kincaid for my blog posts, averaging around 10 but ranging from 7 through 14.  I average about 750 words, but occasionally go much longer and have a number of very short "announcement" style posts.  Average words per sentence of 18.

OK, so now I know, but is that good?  I don't know that I have a definitive source but according to at least one 

source

  the target range on  Flesch-Kincaid for Techical or Industry readers is 7-12, so I'm feeling pretty good about that.

I did wonder whether there was any other, hidden, structure to the data though.  I know the equation is based on words per sentence and syllables per word so there is no point looking at those, obviously I'll find a relationship.   But is my writing style influenced by anything else?

Flesch-Kincaid grade level vs. the number of words by post on this blog.

Other than

 a h

andful of long posts that rate lower in the range 8-10,  I don't see much going on here.

Flesch-Kincaid grade level vs. the publication date by post on this blog. 

 The size of each post (in words) is shown by the area of each point, color is used purely to help visually differentiate each of the points.  Apart from a couple of recent "complex" posts  this does seem to be showing a trend, so I added a regression line and labeled the more extreme posts.  Point (b) is a very short "announcement" style post (you can hardly see the point at all) and I could probably ignore it completely.  Point (e) is a more fun piece I did around using pie-charts that's probably not very representative of the general material either.

If you want to compare readability for yourself here are the top (and bottom) posts ranked by Flesch-Kincaid grade level

Rank

Post

 Flesch-Kincaid grade level

words

sentences

1

Analytic tools "so easy a 10 year-old can use it" 

13.3

784

33

2

Point of Sale Analytics - newsletter released 

13.1

82

4

3

Point of Sale Data – Category Analytics 

12.8

676

29

4

How to save real money in truckload freight (Part I) 

12.8

723

31

5

The Primary Analytics Practitioner 

12.7

541

29

6

Reporting is NOT Analytics 

12.4

891

43

7

Point of Sale Data – Sales Analytics 

12.1

478

24

8

Data handling - the right tool for the job 

11.9

762

38

9

Data Cleansing: boring, painful, tedious and very, very important 

11.8

297

16

10

Point of Sale Data – Supply Chain Analytics

11.6

958

41

35

The right tools for (structured) BIG DATA handling

  9.0

1878

114

36

Better Point of Sale Reports with "Variance Analysis": Velocity...

  8.9

1264

78

37

Better Point of Sale Reports with Variance Analysis (update)

  8.5

177

10

38

Better Business Reporting in Excel - XLReportGrids 1.0 released

  8.4

70

5

39

What's driving your Sales? SNAP?

  8.3

651

42

40

Do you need daily Point of Sale data?...

  8.2

1395

83

41

SNAP Analytics (1) - Funding and spikes.

  8.1

531

32

42

SNAP Analytics (2) - Purchase Patterns

  7.9

773

44

43

Business Analytics - The Right Tool For The Job

  7.6

483

36

44

Are pie charts truly evil or just misunderstood ?

 7.1

1097

70

Conclusions

It appears that my material is (largely) written at a level that should be accessible to the reader.

 And I am using more readable language in recent blogs which sounds like a good thing.

But there remains a key question for me that these stats can't really answer.

 Am I getting better at explaining the 

complex (my goal) or just explaining simpler things ? What do you think ?

In case you are wondering, this post has a Flesch-Kincaid grade level of about 8.  So if you can follow the "State of the Union" address you should have been just fine with this.

Better Business Analytics - 2013 New Year's Resolutions

10 resolutions for Better Business Analytics

 

Firstly - thank you Santa for reading my Christmas list. I love the T-shirt - "Statistics means never having to say you're certain".  With the holiday season coming to a close my thoughts are turning to the New Year and even a certain excitement about getting back to work.  Time for some new year's resolutions !

 

1. Remember that reporting is NOT analytics

One of the key misunderstandings of analytics in the business world and perhaps part of the reason good analytics is not well recognized for the value it creates is that managers think reporting IS analytics.

Per my previous post "Reporting is about what happened; Analytics is about answering your questions ... well-built Analytics or Predictive Models can find insights and opportunities that you will never find by any other means."

Do good analytics, shout about your success, spread the word.

 

 

2. Never, ever build to a manager's spec.

If when you visit your doctor he simply writes the prescriptions you ask for, I suggest you go find a new one, quickly. Unless you possess this expertise yourself, you must rely on your physician to diagnose your symptoms and prescribe action. To do otherwise would be very foolish.

When you (the analytic expert) are asked to build reports or models without explanation as to what it is for - don't do it. Business managers can help you enormously to understand the business context and the killer-questions that need to be answered, but unless they are also skilled in analytics you must bring that expertise to decide how to provide an effective solution.

 

 

3. Make sure your project is worth doing

This is really easy and yet so often overlooked. If you are looking for a million-dollar savings opportunity you are unlikely to find it in part of your operation that totals $2 million in cost. However analytically interesting a project may be, business analysts are paid because they generate a good return on investment. Do some simple estimation to see how big an opportunity could be before unpacking the big-guns. See some examples here

 

4. Use the right tools for the job (or everything looks like a nail)

To recount the old adage: if all you have is a hammer, everything looks like a nail. You must have met people like this, an expert in one particular technological or analytic approach, every single project they undertake is somehow ideally suited for that approach. I've known a few: Excel-experts, database-divas, simulation-specialists, statistics-sages, optimization-??,...

Do you build everything in Excel or in SQL or your favorite BI tool? Are you writing database code in your favorite (non-database) programming language? Has it been a while since you acquired the skill to wield a new tool effectively. Perhaps it's time to extend your skill-set.

 

 

5. Learn your craft (or know when to call for an expert)

The difference in skill-level between an analytic-expert and an amateur is huge. In my experience that skill-gap does not result in +10% incremental return on investment, it's the difference between a successful project and a relative failure.

Know when you are getting beyond your skill-level and either set yourself to learning quickly or call in the cavalry.

 

 

6. Remember that you are first and foremost in sales/change-management

In all honesty, the "analytics" may be the easy part of business-analytics. Even the most technically-adept analytics fail from a business standpoint unless action is enabled. Knowing that implementation of your project could save 20% in cost may give you a nice warm-fuzzy but unless it is implemented you wasted your time and the company's money. Do that repeatedly and you should brush-up your resume.

To be implemented, a result must be repeatedly and effectively sold-in to an organization. Take the time to present your results as a simple compelling argument for change and deliver that message consistently and often.

 

 

Analytics is a lot of fun, but leave a little space for other things in life

7.     Lose that 20 lbs
8.     Exercise more
9.     Eat healthy
10.   Have fun !!


What do you think should be on the list ?


Wishing you an Analytic New Year
 

Better Business Analytics - Christmas list

It's that time of year again: my kids have written, re-written and re-re-written their Christmas lists now so we all hope Santa will read them carefully and take notice.

With just a few days left before the holiday season hits I wanted to do something a little more fun, so I've pulled together a list of things that I think every Business Analyst should want.  Some are free to acquire, just costing your time to learn, others you may wish to ask Santa (or your CFO) to provide.

64 bit Operating system, applications and more memory

32 bit operating systems cannot recognize more than 4 GB of computer memory (RAM), regardless of how much you load onto your hardware.

 Forget "Big Data" for a moment - you can fill 4GB relatively easy with desktop tools - if you want to do any serious data processing on your laptop/desktop environment you will need more than that.  RAM is cheap, easy to upgrade and most modern laptops will take up to 8GB without any problem.

Max out your RAM

Note: 8GB of RAM for the laptop I am writing this on is currently $41.99 at www.crucial.com

Hard-drive upgrade

Solid State Disks (SSD) provide huge speed improvements over their spinning, hard-drive counterparts.  If you are are crunching numbers from file or a local database the hard-drive may well be slowing you down.  A modest investment (the 480 GB Sandisk SSD is currently available on Amazon for $360) can save you a lot of time.

Serious computing power

If you have access to a high powered server this may be of less use, but I was surprised recently to find out just how much computing power you can now get in a desktop workstation.  This beast with 2 CPUs (32 cores), 128 GB of RAM and 2TB of super-fast SSD hard-drive costs a little over $6000.  That's an awful lot of power (more than most servers I've worked with)  for the price of 2-3 good laptops.

Excel 2010 with PowerPivot.

Excel is a superb tool for interacting with your data for prototyping and occasionally for delivering models and results.  It can't do everything and for many problems you will need to turn to more specialized tools but any analyst that tells you they don't use it at all is probably not a very effective analyst.  With the ability to write VBA, to customize and embed your own calculations, it can be a very powerful modeling tool indeed.

With Excel 2010 and the free  PowerPivot  add-in, Excel now has the capability to embed  powerful BI capabilities too.  Integrate data from different sources, load enormous amounts of data, vastly more than the worksheet limit of about 1 million records and define more complex measures and functions to apply to this data with DAX (the Data Analytic eXpressions language).  

If you are not already there upgrade now to 2010 - make sure its the 64 bit version to blow past the 4GB memory limitation.

Note: Office 2013 may be just around the corner, but with Microsoft's latest offer if you buy now (Oct'12 thru April '13), you get Office 2013 for free when it's released.  Microsoft's Office Pre-Launch Offer

Time to learn R for statistical computing

OK - I know - those of you who are heavily invested in another package (SAS, SPSS, Statistica, ...) do not get this.  Those are great tools and why should you change?  Well, I'm not suggesting you swap one package for another - I'm suggesting that there is room in your head for more than one tool and R has a lot going for it.

Pros

  • According to Revolution Analytics R is now available as an embedded, in-database analytics engine for Oracle,  SAP HANA., Neteeza, TeraData and others.  This is a very exciting development allowing advanced statistical capabilities to be called from within the database from SQL. Handle routine analytics in-line and on-demand.
  • It's free - really - free.  R is open source tool  you can download from http://www.r-project.org/
  • It has packages to do an enormous variety of statistical analyses, you will probably not need or use even 90% of them.
  • It has great graphical capabilities for visualization.
  • It's callable from .NET languages too via RDotNet.  The interface is not well documented, but it does work.

Cons

  • The user interface is "ugly" unless you love command line applications - I don't.  RStudio does help with this.
  • If you are not used to vector based programming languages it may take you a while to grasp.  (try "The Art of R Programming" for help.)
  • There are a lot of commands to remember, I use my cheat sheet a lot.

Time to learn SQL

 In reality the time to learn  SQL was probably about 20 years ago, but if you have not yet done so, catch up quickly.  SQL (structured query language) is the programming language for relational databases.  Whenever you interact with a database it is some dialect of SQL that interacts with the database to select update or insert your data efficiently.

Most of your data will now come to you from a database of some form.  It is a common requirement to integrate data from multiple sources (without losing any), filter, aggregate and sort as a precursor to more advances analytic routines.  This is what the databases does superbly well - if you're doing this in Excel or indeed any non-SQL environment, you are working to hard.

While you are at it, get a good grasp of database design principles and normalization too.

2013 - the year for Column Oriented Databases ?

Column storage is not a new idea in relational databases, but perhaps it's an idea that is about to mature.  Regular, relational databases physically store rows of data together which is very efficient for transactional databases that update and retrieve a few rows at a time.

column-oriented database stores entire columns of data together.   This is very efficient for systems  where aggregates are computed over large numbers of similar rows - such as in data warehouse reporting and analytics Notably, SQL Server 2012 released its first direct support for column storage this year.  It's not perfect, particularly as the implementation limits how you can write to the database, but it is fast.  My own testing on a large database showed a 10-fold increase in speed on large aggregation queries and almost linear scaling as the number of fields chosen (and hence the number of columns that must be accessed) changed.

I've also had excellent performance from InfiniDb available as a free "community edition" though I suspect I have not fully tested it's capability with tables that only have a few hundred million records :-)

Column Storage is exceptionally fast and could allow for significant changes in system architecture for reporting and analytic tools.  How about POS analytics?  Straight from database to reports without the need for any custom-architecture, pre-aggregation, cubes or waiting on batch processing?  Want to update a few Item attributes and re-run your reports now?  No problem !  (Embed in-database analytic capability so you can do more than add/subtract/multiply/divide and you have a serious number-crushing platform).

The #1 item for your list - a (much) bigger dry erase board.

At times, it seems I can't think clearly without a dry-erase board and pen.  My own workhorse is a 8'x4' board that sees heavy daily use, but bigger ideas need more space.  Amazon has a dry-erase paint on offer that I may have to try - enough for about 30 linear feet of wall for $500.

Jumbo Whiteboard Dry Erase Paint Clear 240 Sq Ft

What do you think?

So what do you think should be on every good Business Analyst's Christmas list?   Am I missing this year's top gift?  What should a "naughty" analyst receive - more bad data?  Let me know in the comments section.

P.S.  Santa - if you're reading this, I rather like this T-shirt from CafePress.com too.

Analytic tools "so easy a 10 year-old can use it"

When you search the web you'll find lots of analytic tools to support your business: tools to help with forecasting, inventory-optimization, risk-analysis, simulation for production lines and warehouses, production-scheduling, supply-chain network design, vehicle-loading, price-sensitivity modeling and planogram building - and that is very, very far from being an exhaustive list.

Some of these tools are bought as a service that includes expertise to prepare your data, do the modeling work for you and configure the system to meet your needs. These tools will be much more expensive than the "your on your own" variety and the more frequently that expertise is required, the more you will pay.

If the business problem you are working on is relatively stable (meaning that the same issue comes up repeatedly and your business rules around how to solve it stay the same) such a system can have a long a fruitful life once set up. Vehicle loading applications are a good example - once configured to your needs they should need infrequent updates to work well.

Many analytic applications are more one-off in nature and if you use an expert to help you with every use it will cost you a substantial amount of money. So, do you really need all that expertise or is it simple enough that you can pick it up? Desktop analytic tools that could help you deliver millions in savings and can be bought for a few thousand dollars. Many of them are marketed as being simple to use, and imply that you'll be able to "pick it up" as you go with, perhaps, a little training,

Let's look at an analogy - driving a car. Cars are very complicated things and yet almost every adult can drive one. Why would business analytic tools be any different? Cars are certainly complicated but I see no reason a 10 year old could not master driving one because they do not need to understand or master most of that complexity.

  • Cars have been around for a long time and have seen billions, perhaps trillions of dollars invested to make them robust and hide almost all the complexity from the driver.
  • Cars have relatively few controls that you need to understand to operate them: ignition, steering wheel, brake, accelerator, turn-signals, lights, mirrors, windscreen-wipers and gear selector. I know there are more gadgets but master these few (and hoping I missed nothing important) you can drive a car.
  • You get instant feedback when you use these controls badly and that really helps you learn
  • Most of us get a lot of practice driving a car and some of us are still quite bad at it :-)

In contrast most business users are ill-equipped to take on analytic tools where the complexity remains very apparent: there can be hundreds of options to choose from in building and executing your models and these must be understood to use the tool effectively.

Let's make a distinction here between getting a tool to work and getting it to work effectively. I've seen a lot of spreadsheet using Excel's Solver add-in which solves optimization models: you build a spreadsheet that models some aspect of your business then ask it to find the combination of input values that maximize your output (profit, sales or some such value). If you built a model with no errors and avoided a handful of functions that this optimizer can't handle it will run and most likely return an answer. It may even be able to guarantee (depending which algorithm it used) that this is the absolutely best answer available - the optimal solution to your model. Unfortunately, many of the models bear little resemblance to the real-world issue you are trying to optimize and consequently the optimal solution for the model is not particularly good (and certainly not optimal) in the real world.

There's one good example of an analytic application that really is "so easy a 10 year-old can use it" and it's not a business application at all. It came to prominence in the last few years and is now available in cars, on low-cost specialized devices, even on your phone - GPS navigation. Behind the scenes there is a large, accurate database of road networks and locations along these roads and an efficient routing algorithm (the analytic part) to find the shortest route between any 2 points The whole thing is wrapped into a well-designed user interface so that all this complexity is hidden from the user and we don't care because it just works, all of the time. Well, almost all of the time :-)

In my opinion, there are very, very few analytic tools that are truly accessible to business users without some configuration, packaging or guidance from an expert. If you are serious about solving a business issue that needs some analytic work, you need the tool and the mechanic to wield it.

The Primary Analytics Practitioner

The field of Business Analytics can be very complex. Top level analysts are experts; just like medical specialists, they have undergone years of additional training and know their area of specialty (perhaps price sensitivity, multivariate statistical modeling, survey analysis or mathematical optimization) backwards. Keeping with this analogy, most business managers are as well informed as to what business analytics can do for them as a patient heading in to see their primary physician; perhaps less so.

In the medical world there is a wealth of information specifically written for non-experts available on the web. We see legions of ads for pharmaceuticals that inform us about new conditions and their chemical solution. Many of us will routinely see a primary care physician (that’s ‘General Practitioner’ for those of you in the U.K.) and at least be familiar with our own complaints. When we have an issue with our body, we head to the doctor and let him direct us.

In the business analytic world we have done less well at making it accessible to managers. Most of the published material, easily accessible via the web, is not intended for general consumption. It’s very useful to those already initiated into the craft, but if you do not have a very strong mathematical/statistical background, good experience applying these skills in a business environment and some idea what to search for it may as well not exist at all. An excellent example, there is no equivalent of WebMD’s SymptomChecker. Perhaps also why I have seen no analytical version of a hypochondriac :-).

As a business manager when you encounter a tough problem you often just try and ‘work it out’. After all that’s what managers are paid for, right, to "manage" ? You self–medicate, potentially causing harm or, at least, not addressing the real issue.

Well, the web is a relatively new phenomenon, whereas people going to see a Primary Care Physician is certainly not. What we are missing most, I think, is the role of Primary Analytics Practitioner. Someone a manager turns to who can:

  • Help you identify the real issues/issues and frame potential solutions         
  • Personally handle 90% of the issues that arise       
  • Source technically adept specialists when the need arises

The business analytics world is new – it has no certification process, no minimum educational requirements and no clear terminology to identify practitioners by what they do best. By my experience there are many good analytical providers out there, most of whom have expertise in specific areas. Sadly, you cannot open the Yellow Pages or search online and find a  Primary Analytics Practitioner. (I just Googled the term and found nothing useful to me).  In finding one, you're on your own for now but take heart, these people are out there, it's not just me.

When you do stumble across someone with a breadth of analytical capability, an understanding of your business problem, the ability to directly handle many of your analytical needs and the willingness to say “I don’t know this well enough, but I can find someone who does” use them wisely and stop self-medicating.

Bringing your analytical guns to bear on Big Data – in-database analytics

I've blogged before about the need to use the right tools to hold and manipulate data as data quantity increases (Data Handling the Right Tool for the Job).  But, I really want to get to some value-enhancing analytics and as data grows it becomes increasingly hard to apply analytical tools.

Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength database (Oracle, SQL*Server, MySQL, DB2, …)  - one that can handle the data volume without choking.  Each of these databases has its own dialect of the querying language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.
Here's an example:

SELECT
(COUNT(*)*SUM(x.Sales*y.Sales)-SUM(x.Sales)*SUM(y.Sales))/( SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))* SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales))))
correlationFROM BulbSales x JOIN BulbSales y ON x.month=y.monthWHERE x.Year=1997 AND y.Year=1998

extracted from the O'Reilly Transact SQL Cookbook

This calculates just one correlation coefficient between 2 years of sales.  If you want to calculate a correlogram showing correlation coefficients across all pairs of fields in a table this could take some time to code as you are re-coding the math every time you use it with the distinct possibility of human error.  It can be done, but it’s neither pretty nor simple.  Something slightly more complex like regression analysis is seriously beyond the capability of SQL.

Currently, we would pull the data we need into an analytic package (like SAS or R) to run analysis with the help of a statistician.  As the data gets bigger the overhead/delay in moving it across into another package becomes a more significant part of your project, particularly if you do not want to do that much with it when it gets there.   It also limits what you can do on-demand with your end user reporting tools. 

So, how can you bring better analytics to bear on your data in-situ?   This is the developing area of in-database analytics:   Extending the analytical capability of the SQL language so that analytics can be executed, quickly, within the database.  I think it fair to say that it’s still early days but with some exciting opportunities:

  • SAS, the gold standard for analytical software, has developed some capability but, so far, only for databases I'm not using (Teradata, Neteeza, Greenplum, DB2)  SAS in-database processing
  • Oracle recently announced new capability to embed R (an open source tool with a broad range of statistical capability) which sounds interesting but I have yet to see it. Oracle in database announcement
  • It’s possible to build some capability into Microsoft’s SQL Server using .NET/CLR   and I have had some direct (and quite successful) experience doing this for simpler analytics.  Some companies seem to be pushing it further still and I look forward to testing out their offerings.  (Fuzzy Logix, XLeratorDB).

No doubt there are other options that I have not yet encountered, let me know in the feedback section below.  

 

For complex modeling tasks, I am certain we will need dedicated, offline analytic tools for a very long time.  For big data that will mean similarly large application servers for your statistical tools and fast connections to your data mart.

For simpler analysis, in-database analytics appears to be a great step forward, but I’m wondering what this means in terms of the skills you need in your analysts: when the analysis is done in a sophisticated statistics package, it tends to get done by trained statisticians who should know what they are doing and make good choices around which tools to deploy and how.

Making it easier to apply analytical tools to your data is very definitely a good thing.  Applying these tools badly because you do not have the skills or knowledge to apply them effectively could be a developing problem.

Cluster Analysis - 101

The current Wikipedia page on Cluster Analysis, excerpted below, is correct, detailed and makes absolute sense.  Then again, if you do not have a background in statistical modeling, I'm guessing these two paragraphs leave you no wiser.

 
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.
Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
 
Wikipedia 4/2012 

In this post I hope to provide a workable introduction for people that need to be educated consumers of cluster analysis.
(If you want more technical detail, I suggest you go back to the Wikipedia link above, and follow up on the range of hyperlinks embedded in the document -  it's really very good.)

Let's put this in context with an example.  Assume we are working with a retailer that has 1000 stores and they want to decide which dairy products to put in each store to maximize sales.

One option would be to treat all stores as being the same, come up with one assortment list and put it everywhere.  This has the singular advantage of being easy to execute while leaving all non-mainstream products off the shelf.

At the other extreme, we could try to tailor product assortment individually by store.  Did I mention there are 1000 of them?  Apart from the work involved in building 1000 individual analyses, do we have the discipline to execute such analyses consistently across 1000 stores? Would we have sufficient organization to execute these unique selections in 1000 stores?

Most teams will end up working with groups of stores that they consider "similar" as a compromise.   These groupings may be based off single store features (e.g. stores with more than 30% of sales from premium products) or maybe geographical features (e.g. everything in the South East).  Bear in mind that, if you are trying to do this without statistical help, they do need to be very simple groupings .

For assortment selection, we really want to group together stores where people buy similar products.  In this case we want to find groups of stores that have similar sales patterns.  For dairy, these sales patterns could be related to the % of sales associated with various product characteristics:

  • premium vs. value, 
  • single-serve vs. multi-serve, 
  • cheese vs. yogurt vs. creamer vs. drinks vs. milk
  • yogurt styles

(Note: I do eat a lot of dairy but I haven't worked that category yet so forgive me if I missed something big)

Cluster analysis (actually a family of statistical algorithms, not just one) is used to scan across multiple features that you think are important, to find groups (clusters) so that:

  • stores within a cluster are similar to each other
  • stores in different clusters are dissimilar

It sounds rather like magic doesn't it?  You just throw data at the algorithm and (big fanfare) it finds clusters!  Well perhaps it's not quite that easy.

  • It does take some care to prepare the data, ensuring it's clean, accurate and in a form that works for this process (see Data Cleansing: boring, painful, tedious and very, very important).  
  • In reviewing the results you may decide to drop some features and split out others (e.g. "Premium" is split into "Premium" and "Super Premium").
  • You need to determine how many clusters are correct for your data.
  • You may want to bring in some additional data to help describe clusters after they are created
    • demographics of people living near the store (ethnicity, income, household size etc. )
    • geography (maps work well)
    • local competition
  • Really an extension from basic clustering, but you could build predictive models to explain why , for example, super-premium, greek yogurt is so very popular in Cluster 4.  If you can tie high sales of this product group to specific demographics, you may find other stores with similar demographics that have not previously sold it. (Could be a big opportunity).

I'll return to this topic in future posts, but for today, your takeaways are simple:

  1. Cluster Analysis finds better groups (clusters) of similar things.  
  2. Clusters help you target your offering without dying under the weight of work.

Reporting is NOT Analytics

Reporting is about what happened; Analytics is about answering "what if" and "what's best" questions.  Most of the materials that land on a VP/Director’s desk (or inbox) are just reports with no analytical value added.

Reporting tells us what has happened: sales; orders; production; system downtime; labor utilization; forecast accuracy. Reports leave it up to the reader to digest this information and based off their experience and expertise about the world around them construct a ‘story’ as to why it may have happened that way.

Really good reports will look for known causes of routine issues. For example, if I know a store is low on inventory of a specific product, I could just report that. If I flag it as an exception the person receiving the report may even see it among the sea of other facts. To go the extra mile, it would be wise to see whether I can (automatically) find any disruptions back in the supply chain (either with inventory or flow of goods) and include that information to answer the routine questions that will be raised by my report. But, the report builder must anticipate these needs at the point they are writing the report and for more complex issues that’s just not realistic.

Analytics is all about finding a better story or, if you prefer, insight from data. We’ll talk about tools for finding insights in a moment, but much of this is about approach: develop a working theory about what may be happening and test it out with the data you have available. Revise your theory if needed, rinse and repeat: this is an iterative and interactive process.

At the simplest level a lot of really good analytics is enabled by being able to interact with the data: filter it to see specific features; sort it to find your exceptions, drill-down into more detail to see (for example) which stores are causing the issue in your region, chart it to see trends across time or perhaps even see relationships between variables (like temperature and sales of ice-cream by region). Generally available tools like Excel can get you a long way to intuitively understanding your data and finding some insight.

A further step (and one I fully understand most analysts cannot take - see  What is ‘analysis’ and why do most ‘analysts’ not do it? ) would be to run some descriptive statistics around the data.

  • Measures of ‘average’ (mean, median, mode)
  • Measures of ‘spread’ (Standard deviation, percentile ranges, Min/Max)
  • Frequency histograms and boxplots to visually show the distribution of data
  • Scatter plots to view interaction
  • Correlation matrices to spot high level interactions
  • Outlier detection
  • Dealing with missing values

If these options seem strange perhaps even archaic and of little relevance to the business world, you may need to trust me when I say that these are exceptionally valuable capabilities that increase understanding of the data, uncover insights and prepare you to step into the world of (predictive) modeling. Thinking back over the last 2 weeks of work (store clustering, system diagnostics, algorithm development and even some "reporting") I can confirm that I have used every one of these multiple times and to good effect.

In Predictive Modeling we build a mathematical model around the problem you are trying to solve.  Once the model is built, and (importantly) validated to confirm that it really does behave as you expect the world too, you can start asking questions like:

  • what happens if I add another warehouse to this network?
  • what is the best combination of price and promotion to maximize profitability?
  • how much inventory do I need in each warehouse?
  • what is the best way to load this truck?
  • what is the best assortment for each of my stores?
  • why are sales dropping in the Western region?

Everyone is familiar with the phrase “Garbage In, Garbage Out” relating to how computers will happily calculate garbage when you throw bad inputs at them. With modeling, the structure of the model is one of those inputs and many of you may have experienced the complete junk that comes out of a bad model even when you put good data into it. Picking the right modeling tools for the right job and applying them correctly is a very skilled task. Predictive modeling covers an extraordinarily broad field of statistics, mathematics and operations research. Just as for data handling (see The Right Tool for the Job), this is not something you are likely to do well without appropriate training: understanding the field you are trying to apply these tools to helps enormously too.

So why go to the trouble and expense of ‘Analytics’ rather than ‘Reporting’? Reporting is essential, but well-built Analytics or Predictive Models can find insights and opportunities that you will never find by any other means.

A really good analyst will do this work and while being willing, ready and able to take you to whatever depth of complexity you wish, will furnish you with a simple (perhaps even one page) report that answers your questions and makes the end result look … obvious.