Balancing safety-stocks across DCs

Earlier today I saw and responded to a question posted on the IBF (Institute of Business Forecasting & Planning) LinkedIn Group.  It's a question I come across often so I thought I would repost it here (with a few edits).

How do I go about preparing an aged inventory analysis? I need to show fast moving,slow moving item, then I want to transfer product with in DC's that are slow so that the amount of SS is balance

I would be wary of moving stock between DCs ("re-deployment") as a means of balancing safety-stocks, it can be very expensive while providing no no real gain.  Balancing stock is typically best done by rebalancing inventory levels when you next acquire production rather than through re-deployment.  As a general rule don't move anything until you have to: if you do, you will certainly incur freight cost and you may not fix a real problem.

If your inventory levels are very low you may risk cutting orders.  So how do you know if your inventory levels are too low?  In my experience, most businesses operate based on rule-of-thumb guidelines and in doing so risk both excessive stocks and cut orders (across a range of products and locations).  Do you know what your safety-stock really needs to be?  I'm talking about a valid, statistical inventory-model taking into account demand uncertainty (related to your forecast accuracy), supply uncertainty and lead-times?  Or is it more of an educated guess ?

(FYI - the lead-time to move product between DCs is typically very much less than the lead-time for acquiring new production so if you are willing to include re-deployment as part of your ongoing solution you can manage with even less safety stock than a basic models may suggest.)

If you're projecting inventory to drop below this safety stock level within your lead-time, it's time to move your inventory.

At the other end of the scale, you may want to move product because otherwise it may expire through age or otherwise become obsolete.  Again though don't move it unless you have to.  A very similar calculation to that used for safety stock, that takes into account your demand uncertainty can tell you when you are danger of being over-stocked.   Assuming you will lose more money by throwing it away then you will by moving it across country, when inventory exceeds this maximum target, load up a truck and ship it to the closest/cheapest place that will sell through it.  

Clustering with a destination in mind

I've posted before on Cluster Analysis, in an attempt to demystify one of the more accessible and useful analytical approaches for CPG/Retail teams (see Cluster Analysis - 101) .

Finding groups of similar stores (for example) can be a very effective way to manage the complexity of offering each store group what they really need without having to deal with each one individually, a mammoth task.   Whether you are looking to find groups of stores, shoppers, regions, products or even sales patterns a very similar approach can work for you.

Clustering is part of the journey it's not a destination.  If you don't know and understand what decisions your analytic work should enable  (your destination) how can you build a good model?

Let's take a real-life example again around store clustering.  Much of this work is done to help define the assortment offering at each store.  Get the right products in each store and sales should increase: good news for the retailer and good news for the manufacturer.

I've seen a number of approaches to this work and many start out with "we need a cluster analysis" - good idea.  Sadly, many then forget why they needed that cluster analysis, pull all the data they have on stores, throw it at a clustering model and hope something useful falls out the other end.   What sort of data would that be?  Well, assuming you have access to Point of Sale data, its easy enough to get a measure of overall sales for your category (to split the big stores from the small ones).  Many of you will also have access to demographic data (ethnicity, income-levels, household size, age etc.) that you can map to store locations.  Maybe you also have geographic data like latitude/longitude by store and you could throw that in too.

As long as your data is (largely) error-free and complete the cluster analysis will run, but the output may or may not support your decision process.  Remember that we started with a goal of wanting to understand what products to put on the shelf.  Is this supported by knowing the demographic mix?  Only indirectly - if we know that we a  group of stores serving an, on average, older population we can make assumptions about the sorts of products they would prefer and guess at how much more of these products to put on the shelf.  Analytically speaking, "assumption" is a rather bad word: "guess" is a curse-word.

(BTW - there is some very good work being done to more accurately map local demographic data to  stores.  It gets complicated very quickly, feels cutting-edge and even "cool", but, if it does not support you in reaching your particular destination, its not useful to you right now.)

We can do so much better than this by keeping the destination in mind throughout the process.  We want to improve assortment by store so let's start by ignoring demographics, ignoring geography and let's look at sales.  We want to know what sells well and what sells badly by store.  Even for one category there can be hundreds of products in the mix so we really can't analyze this by product.  We can get sales profiles though based on key product characteristics.  Here's one such profile (mocked up for this example) for prepared food products.

Just visually looking at these 8 stores, you can see that:
  • some stores sell relatively less "American"  food (3 and 8)and a higher proportion of Italian and/or Asian foods.
  • some stores sell relatively more chicken and less beef (4,5,7)
  • some stores are heavily value oriented (1,4,7) , some mid-range (3,5) and some heavy on premium (2,8)
Deciding which product characteristics are most relevant to your category is still a challenge and you may go through a few cycles of exploration in determining what really characterizes the difference between stores but I'm hoping you can see that this is already a much better approach that the one we started with and it's tied directly to the decisions we need to make on product assortment.

Getting the sales profile data may be a bit more challenging: you still need Point-of-Sale data and potentially a lot of it as you will need to start with at least item:store level detail.  You will most likely need a DSR or a very friendly and flexible database administrator with time on his hands to get it into this format.  (If you do not know what a DSR is check out my posts Point of Sale Data – Basic Analytics and Data Handling - the right tool for the job)

Based off sales-profile data and a few hours work we can have clusters that are built on product  preferences of shoppers in these stores - what better to drive your assortment decisions ?

Of course, even if clustering is the right analytic tool, if you have a different destination in mind, this may not be how you want to get there.

A last thought: this does not need to be the end of our journey.  We can use geographic and demographic data not to build the clusters but to help us describe them or to find out which geo/demographic features are related to specific sales profiles.   More on this in future posts.

Better Business Reporting in Excel - XLReportGrids beta released

In my last Blog entry, I talked about reporting in Excel, some of the reasons I choose to make heavy use of  them, and some of the issues that stop me using them even more. (see Better Business Reporting in Excel)
  • Pivot-tables can only show you data (although lots of it)
  • Pivot-charts show you a chart, but only 1 per pivot-table
  • If you want the same report for multiple grouping (e.g months, years, brands or locations) you add these groups to your pivot-table and select them one at a time to print.
XLReportGrids is a free Excel add-in, now available in beta test, that builds grids of reports with multiple copies of a template sized to fit the page.   Templates are just a range of cells in a worksheet that are driven by a pivot-table.  Build templates with: charts, formulas, images, pivot-tables, text boxes, anything that can be added to a worksheet.


  • Lightweight, Microsoft-Excel add-in (2007, 2010) 
  • Easy to install and easy to use.
  • Templates are just a range of cells in a worksheet.  
  • Build templates with:  charts, formulas, images, pivot-tables, text boxes, anything that can be added to a worksheet.
  • Layout tools make it easy to build clean, professional templates.
  • Select the number of rows and columns for your
  • Automatically scale to fit the page-size and orientation.
  • Generate multiple 'Grids in the same workbook
  • Work with multiple page-fields so the rows and columns have meaning
  • Optionally add outlines, border gaps and shadows
  • Refreshes in seconds, on demand.
  • Write the chart legend to a header section, printed on each page
  • Lock value axes so they all show the same range
  • Easy to automate

Download the beta test version now from our site: 
and help us build a better tool.  The beta version is intended for testing, you will find bugs and usability issues that we have not.  Please report them to us in as much detail as you can so  we can fix them.
The beta version will be active though the end of October.  We will probably make frequent updates to it during the next 3 months so be sure to check back here often.

Have fun and when you build something you like, please send us a copy to include in our gallery.

Better Business Reporting in Excel

What do you think is the most-used reporting tool in the world?  There are a lot of them available, at times it can be hard to move without running into a new Business Intelligence (BI) tool, but I'm going to hazard a guess with a high degree of confidence that the most used tool is still Excel.

Excel may not be where the data originated from and there may well have been a database involved to crunch numbers and aggregate to the point that the data would fit comfortably in Excel, but I do believe that it's still the most widely used tool for final 'analysis' and presentation of data.

This is not because Excel is the best tool for the job as almost every feature of Excel is handled better by another tool.  Joining separate tables of data with Lookup functions is a very pale shadow of using a properly defined database and the SQL language.  There are substantially better graphics, charting and visualization tools available.  The analysis tools (Correlation, Regression, ANOVA, Linear-programming) are the poor relation to more industrial strength packages.

So why do we (almost) all use Excel?  Perhaps because it is so very accessible?   It's cheap enough that everyone can have a copy.  Simple enough that everyone can make at least some use of it and the tools are good enough for most of the uses that it's put to.

I worked recently with a Demand Signal Repository (DSR) that is used to analyse and report on 3 Terabytes of Point of Sale data.   (FYI - 3TB is equivalent to around 6 million average sized spreadsheets).  The data is held and crunched in Microsoft's SQL Server database using a highly unusual (but very effective) data structure combined with a application specific querying engine that rapidly dumps out the data you need for reporting into ... Excel!   Once in Excel I have routinely turned to pivot-tables for 90% of the reporting.

  • They are very fast to build
  • They can be very flexible to change.  A good pivot-table report built for one brand or store or manager will run just as effectively for any other, just change the data and re-run.
  • They typically require no coding at all.  Reports that need custom code are always going to be more fragile than ones that use well-designed, heavily tested and heavily used code.
  • They are immediately drill-able to the lowest level of detail that you pulled into the spreadsheet
  • With a few tricks up your sleeve to deal with some of the limitations of pivot-tables, you can get a very long way, fast.
Pivot-tables are not perfect, they have a number of foibles, but, there are 2 things that pivot-tables absolutely will not do that cause me the most problems:
  • Pivot-tables do not implement so called "semi-additive" measures.  I know that sounds nerdy but its a real problem when you are dealing with inventory.  A semi-additive measure is one that is additive in some dimensions but not all..  If I am working with Inventory data, I can reasonably add across locations and products and get a sensible answer.  If I have $100 of inventory for each of 10 products at each of 5 stores , I do have 100*10*5 = $5,000 worth of inventory.  You cannot do this across time.  If I have $5,000 of inventory at the start of the week and $6,000 at the end of the week I do not have $11,000 of inventory.  What this means for Pivot-Table reporting is that I cannot aggregate data across time directly in the pivot-table, neither can I drill-down on that dimension.
  • The second issue is with graphical display.  Pivot-charts seems so weak in comparison to the pivot-tables used to create them.  
There is not much I can do about the first one other than be aware of it,  await a better though probably much more complicated tool and warn others . (I''ve just checked that off my list)

The graphical display issue  I can do something about.  The main problem is this: once I get a pivot-chart set up as I need it for one group of products or locations, then to see the next group I must select it from a  Page Field filter. Look at the example below. (The Page Field filter is in cell B1).  I've built a pivot-chart on top of the pivot-table (not one of my best, this was just for testing), embellished it with a few formulas that update with the pivot-table and formatted the 2 into one area that should print well.

If there are 10 of these groups I must change the filter 10 times to see them all.  If I want to compare and contrast between them I must flip-flop back and forth or print them all, one to a page.  Not very satisfactory and severely limiting for reports that people expect to just print.

What I've wanted for some time now (and finally built) was an add-in that would take care of this.   An Excel add-in:
  •  that builds grids of reports with multiple copies of a template sized to fit the page. 
  • where  report templates are just a range of cells in a worksheet so that you can build templates with: charts, formulas, images, pivot-tables, text boxes, anything that can be added to the worksheet.   
  • where you specify the template and output in a dialog box (without coding)
The original example now looks more like this and, in Excel, its ready to print, sized just right for the paper-size (letter) and orientation (landscape) I selected.

Refresh the Pivot-table with new data and with 1 click of the mouse (or 1 line of code in a subroutine if you prefer to automate) and it will rebuild adding or removing copies of the template as required to fit the data that's now there.

We've actually added rather more functionality to this that I originally planned, the full feature list is shown below:
  • Lightweight, Microsoft-Excel add-in (2007, 2010) 
  • Easy to install and easy to use.
  • Templates are just a range of cells in a worksheet. 
  • Build templates with: charts, formulas, images, pivot-tables, text boxes, anything that can be added to a worksheet.
  • Layout tools make it easy to build clean, professional templates.
  • Select the number of rows and columns per page in your report
  • Automatically scale to fit the page-size and orientation.
  • Generate multiple 'Grids in the same workbook
  • Work with multiple page-fields so the rows and columns have meaning
  • Optionally add outlines, border gaps and shadows
  • Refreshes in seconds, on demand.
  • Write the chart legend to a header section, printed on each page
  • Lock value axes so they all show the same range
  • Easy to automate
This tool will be released soon as a free add-in that you can download from (  For the next few weeks, it's going through beta testing .  

Here are a few more examples of the output:

Bar Charts

Scatter Plots with regression lines and a data driven text box

New product tracker, Distribution, Units per Store per Week and Total Sales
3 charts per cell, all from 1 pivot-table

Histograms with summary statistics

WARNING: Bad business analytics may be hazardous to your wealth !

You paid handsomely for the software, perhaps for consulting too and have had some bright sparks working on it for months: the results of your analytics project are in and the answer is ... useless without some understanding of how good the models are it's built on.  If the analyst cannot give you detail on how 'good" the model is for its purpose, all results should come with a wealth warning. 


Let's take a few real-life examples:

  • A project to improve sales forecasting where the accuracy of the forecast was not measured either before or after the project.
  • A project to maximize trailer loading (get more tonnage into freight trailers) with such a bad optimization model that it missed most of the opportunity.
  • A system to improve On-Shelf-Availability (the % of product actually on shelf in grocery stores) built entirely from arbitrary rules with no measurement, at all, of...On Shelf Availability.  Check out my post Point of Sale Data – Supply Chain Analytics for more details on On Shelf Availability.
  • Statistical inventory models to identify how much inventory you really need built entirely without  statistics. (Managing hundreds of $millions in inventory value)
  • Countless excel models that calculate nothing of value.
I could go on..

In many cases the issue is that the people assigned to the task do not have the skills to wield the tools they need.  The trailer loading project listed above was developed without real understanding of how to build an optimization model.  The developer had found an extended version of Excel's "Solver"  tool on the internet (a good small to medium scale optimizer from Frontline Systems).  Unfortunately the Excel model  was bad enough that Solver could only find the optimal solution to the wrong question: the model ran without throwing an error; it was a small improvement on what went before; the results were implemented; and the opportunity to do it right (worth $millions) was lost for a few years.

In other cases, and I saw a new one just this last week, the software tools leave out the diagnostics you need to tell whether the model is good.  Predictive Analytics tools packaged for business use (like price/promotion modeling packages, sales forecasting tools) tend to do this.  I can only assume that this is to prevent confusing the user.  

Before you use any tool's output to make critical decisions, someone with good modeling skills (perhaps your Primary Analytical Practitioner)  needs to check that your models are sound.

As my father taught me: “If a job is worth doing, it's worth doing well.”  How can that not be true when your financial results depend on getting it right ?  

Point of Sale Data – Supply Chain Analytics

I’ve spent a large part of my career working in Analytics for Supply Chain.  It’s an area blessed with a lot of data and I’ve been able to use predictive analytics and optimization very successfully to drive cost out of the system.  Much of what I learned in managing CPG supply chains translates directly to Retailer supply chains it’s just that there is much more data to deal with.  

If you have not read the previous posts in this series, please check out:
·         [Point of Sale Data –Basic Analytics] to see why you need to set up a DSR; and 
·         [Point of Sale Data – Sales Analytics] and [Point of Sale Data – Category Analytics] for some examples as to what else you can do with predictive analytics (and relatively basic POS data).

There is a lot that can be done with POS data from a supply-chain perspective.  Let’s start with a few examples:  
  • Use POS data to enhance your forecasting process. (At least recognize when the retailer’s inventory position is out of position as it will need a correction, thereby impacting what you sell to them.)
  • Measure store level and warehouse level demand uncertainty to calculate accurate safety stock and re-order point levels that make the Retailer’s replenishment system flow more effectively.  (see [How much inventory do you really need?]
  • Calculate Optimal order quantity minimums and multiples to reduce case picking (for you) and case sorting/segregation (for the retailer) 
  • Build optimal store orders to correct imbalances in the supply chain and/or correctly place inventory in advance of events. 
  • Where the retailer provides their forecast to you along with POS data:
    • monitor the accuracy and bias of the retailer’s forecasting process and help fix issues before they drive inappropriate ordering. 
    • build your own forecast (a “reference” model) from the POS data and look at where the 2 forecasts diverge strongly. Chances are that one of them is very wrong: if it’s the retailers forecast that’s wrong, remember that this is what drives your orders.
All of this is good and I certainly don’t discourage you from working on any of them but it seems to me that it may be missing the biggest problem   I may upset a few supply chain folks here (feel free to tell me so in the feedback section), but I’m going to say it as I see it - most retailer supply chain measurements do not help, in fact they distract from the one thing that is crucial – is the product available on the shelf? 

Unless the product is on the shelf where the shopper can find it, when they want to buy it, nothing else matters.  If you delivered to the retailer’s warehouse as ordered, in-full and on time, overall inventory levels at the retailer are within acceptable bounds, the forecast was reasonably accurate and the store shows that they we’re in stock that’s great – but it’s not enough.  The product MUST be on the shelf where the shopper can find it or you wasted all that effort.

And yet, it is very rare to see any form of systematic measurement of On-Shelf-Availability.  The few I have seen are so obviously biased as to be largely useless.  If you have the resources to do so and are interested in setting up a credible off-shelf measurement system, please, give me a call.  Otherwise, how can you know whether your hard work in product supply is really making a difference at this one moment of truth?  You need a good model to flag likely off shelf situations.  Do this well and you can make effective  use of field sales to correct the issues, work with your retailer’s operations team to help fix problem areas, identify planogram problems that are contributing to the issue or even examine whether some products or pack types are more frequently associated with off-shelf and could be changed .

It is possible to look at your sales and inventory history and try to “spot” periods where it looks as though the product was not on shelf.  Imagine a product that typically sells in multiple quantities every day in Store A that has not sold now for 10 days: the store reports having inventory all through this period, but no sales.  It seems very likely then that this product is not getting to the shelf.   For products that sell in smaller quantities it gets harder to guess how many days of zero sales would be unusual. 

You can build your own Off Shelf Alerts tool by (educated) guesswork and many people do.  Using some statistics you can get much “better” guesses.  Better guesses mean that you miss fewer real alerts, your alerts are correct more often and you find those issues with the biggest value to you.

Industry studies typically report that on-shelf positions range between the high 80’s to low 90’s in percentage terms.  Fixing this is probably worth 1%-3% in extra revenue.  What is that worth to you?  

Point of Sale Data – Category Analytics

If you haven’t already read the previous entries in this series, you may want to go back and check out [Point of Sale Data – the basics] to see why you really need a DSR to handle this data, and  [Point of Sale Data – Sales Analytics]  for some thoughts on analyzing sales drivers that are equally relevant to Category Management,

 As Category Manager you’re working with the retailer to help drive sales for the entire category.  You hopefully have access to the full data for your category (which could be substantially more than your account manager colleagues).  Let’s see how predictive analytics and modeling could help address some of your challenges:  How well are current planograms performing?  What is the best product assortment for each store?  How can you best balance customization of assortment by store with the work required to create that detail?

  • Understanding which stores are truly similar in terms of what sells and why in terms of demographics, geographies and local competition helps you manage your store list as a small set of groups (“clusters”) rather than trying to deal with each store individually or abandoning store-customization as too time consuming. 
  • Cluster analysis is a relatively simple statistical process (see Cluster Analysis - 101 ) and there are tools available from high-end statistical modeling packages to Excel Add-ins that can handle this process. Preparing the data appropriately, interpreting the results, and creating presentation ready output quickly is more of a challenge. I’m working on something to help with this…watch this space. 
  • Tools that help you find individual products that are not currently listed in a store but sell well in local competitor stores (from syndicated data) or in the retailers own “similar stores” help correct the assortment list.
    It’s relatively easy to find such “missing stars” when you know where to look. For example, stores in areas with lots of Hispanic households should probably stock every one of your top 10 Hispanic-oriented products. But this approach relies on you knowing these associations up front: the analytic approach scans your database to find all possible associations. 
  • Understanding product substitution and the shopper’s decision tree ensure that you add the right products onto the assortment without excessive cannibalization of existing products.Tools that handle assortment optimization are becoming more common now. Quite how well they work should depend heavily on how well you clustered stores into “like” stores and how accurate your decision tree is. 
  • Automating the assortment-selection and planogram-build processes can allow you to work at lower levels of detail and provide more finely tuned customization.  In many ways, this is not the area likely to generate the best return. The next best option is actually to bring in more temporary labor and it just does not cost that much to do so. However, what I see in reality is that category managers work extraordinary hours to try and get it done. They can’t work 10x normal hours so the work cannot get done to the depth that is possible. Neither can they handle changes late in the project with the same diligence and structure they brought to it initially – there is simply not enough time. There is at least one tool on the market for building planograms now and I may be tempted to build another J

Predictive analytics like these may seem complex and may only drive a few percentage points of incremental sales.  But then, what’s even 1% of incremental sales worth to you?  Worth handling a little complexity?

Point of Sale Data – Sales Analytics

I’m assuming that you now have a DSR (see [Point Of Sale Data - Basic Analytics] ) so you can manipulate the large quantities of data necessary to do this work, you have your routine reports automated and use the DSR for ad-hoc queries against the POS data. 

The DSR provides a great foundation for analytic work: use it to integrate multiple data sources, clean the data, handle very large data volumes as though it was all sat on your desktop and it will help you build reports that summarize that history with ease. Typically, the DSR does not provide much help for you with predictive-analytics. 

Let’s look at an example related to what really drives sales.   Do you know?  Can you quantify it?  Knowing these answers with quantified detail can help you better explain your sales history and plan for the future.  Better promotions, better pricing, supply chains that anticipate peaks in demand and make sure the product is on the shelf when it’s needed.  Here are some of the things that could drive your sales:
  • Pricing and Promotions
  • Competitor price gaps
  • Competitor promotional activity
  • On-shelf availability
  • Activity in competitor stores,
  • Weather: rain-fall, temperatures, hurricane warnings
  • Seasonal events,
  • Day of the week, Day of the month. 
  • Availability of disposable income  (government assistance  programs, pay cycles)
  • Placement in store, placement on shelf, product adjacency.
  • Store-area demographics

If you are fairly sure that just one thing drives the majority of your sales you can get a decent estimation of its impact visually with charts and tables.  Here’s an example:

In the USA, money for the government assistance program, SNAP (previously known as Food Stamps) is given out on specific days of the month.   Some States even make the entire month’s allowance available on just one day every month.  If your product is heavily impacted by the availability of SNAP dollars and you plot average sales by day of the month for a State you can clearly see the impact that this has on these days and the residual effects 2-3 days later.

If your sales are driven by multiple drivers, trying to tease apart the impact of each is going to need more complex analysis but in many, many cases, it can be done, either by moving the data across to external analytic routines or by embedding predictive-analytics directly into your DSR. 

Predictive Analytics like this may seem overly complex and will probably drive just a few percentage points of incremental sales.  But then, what’s 1% of incremental sales worth to you?  Enough to cope with a little complexity?

All posts in this series

Point of Sale Data – Basic Analytics

You've got access to Point of Sale Data…now, what are you going to do with it?

For the purpose of this blog entry, I’m assuming that we have daily aggregated data by product and by store.  We will certainly get measures of sales (both units sold and currency received).  We may also get other useful measures like inventory on-hand, inventory in-transit, store-receipts, mark-downs taken at the store and perhaps some data around warehouse activity too.

[Note: Aggregated data is not as potentially useful to us as individual transaction records but it’s more readily available so we’ll start with that.]

Now, this can be a lot of data (2 years of daily data for 10 measures, 100 products and 2000 stores is almost 1.5 billion data points) - you are not going to handle this in Excel or Access J .  If you forget about daily data and pull weekly aggregations and forget about wanting data by store you can reduce this a lot to a little more than 100,000 data points, BUT,  you have thrown away the opportunity to do much of the more value-added activities I will get to later.

Check out my blog post [Data Handling - the right tool for the job] and then set up your own Demand Signal Repository (DSR).    The DSR is designed to handle these data quantities and should enable standard reporting (as outlined above) straight out of the box. 

Now, you can figure out what you sold last week… or the last 4 weeks, or last 13 weeks, or Year to Date or the same periods for the prior year and I’m sure you will.  Aggregate this data with product hierarchies defined by the Retailer or with your own corporate hierarchies for discussion with your head-office.  You can calculate growth (or decline) and perhaps pricing and distribution.  You can even integrate some of your own corporate data (like shipment details) or externally audited data to combine into reporting.  

All very necessary, but rather basic and a long, long way from the value you could drive from the same data.  If you are looking to POS data to provide you with some real competitive advantage you will have to try a little harder.  The DSR is your cost of entry into this space and provides a solid foundation for deeper analysis, but ownership of a DSR does not, by itself, provide any competitive advantage.   What you do with this tool defines your competitive advantage.  Use it to automate a few reports that you used to pull manually and you saved a few hours a week and nobody, other than you, is going to notice.  You do need to do this but it’s not an end-point it just frees up enough time for you to consider taking bigger steps.

Use your DSR to find opportunities to: reduce off-shelf situations; or drive incremental sales; or reduce cost to supply and you are talking real money.  What is an incremental 1% of sales worth to you?  A lot more than the labor saved in report automation I bet.

In the following series of blog entries, I’ll suggest a few thoughts on more value-added activities.  Now, this will need incremental investment and additional skills but then you didn’t think competitive advantage would be free did you?

All posts in this series

Is the juice worth the squeeze?

I have heard this phrase a lot in recent months in a business context.  It’s so visual, I love it! 

It’s not quite enough though.  It’s pretty simple to understand that every project must be able to pay for itself and deliver a return.  Is the juice worth the squeeze?

It’s also true though that no organization has infinite resources of time or money.  If you have 10 projects that you could do but only enough resources to handle 3, you must prioritize those projects that help you meet your objectives (growth, profitability, market share).  What has the most bang for the buck? 

So with these 2 phrases in mind, it should now be easy…right? (Can you hear the sarcasm)?

In the past, I have been guilty of figuring out something can be done and then wanting to rush ahead to do it.  I have been there, I have done it and I have been surprised when a manager along the way was disinterested in my project or actively trying to shut it down. 

My enthusiasm was high, I had the capability to deliver but my ability to see the bigger picture from a business perspective was lacking.  Perhaps it’s a gap in my formal educational process that focused on business math and statistics rather than finance or managerial common sense?  I’m afraid I am not alone however; many analysts and process improvement folks seem to suffer from the same condition.

I think it’s fair to say that my working experience has largely corrected this: a spell in finance working on investment appraisal helped; running a logistics development team that generated more good ideas than we had capacity to work on helped too.  Being responsible for delivering financial results may have been the clincher. 

Analysts can be blinkered, I admit it, but this inability to see the big picture is not restricted to technical folks.  Managers see the complexity of analytics and have their own knee-jerk reactions.  Some perceive high-risk and immediately, perhaps unconsciously, discount the net benefit they are likely to receive.  In other cases I have seen an almost cult-like view that is without justification and disconnected from the results that could have been predicted  (“do this and good stuff will happen”).

In both cases managers are limited by their inability to estimate cost and/or benefit.  This is where a good analyst can really help. 

Note that for prioritizing most projects we do not need extreme precision in cost or benefit.  Really good projects do not have to scrape a return, the poor ones are usually struggling to hit whatever hurdles your finance team has put in place (e.g. NPV, IRR or payback).  What you need is a reasonable estimation backed up by sound analytics and whatever benchmarks you can lay your hands on.  Let’s take a few examples:

Project A:  Implementing an upgrade to the warehouse management system that converts all current paper-based processes to run on the existing computers. 
·         Based off a time-study, this is likely to save 10 minutes per order
·         The network of distribution centers process approximately 3000 orders per day
·         Cost to implement is estimated at 13 weeks of development time.
·         Warehouse labor costs about $25/worked-hour including all benefits
·         Development labor costs around $100/worked-hour including all benefits

Annual Savings:
10 x 3000 x 365 =  10,950,000 minutes/year
= 182,500 hours / year
= $ 4,562,500  / year         
            One-off Costs:
13 x 40 x 100 =  $52,000 / year
Without calculating NPV or IRR or Payback, I think we can clearly see that this would be a very, very good project.  Focus on the savings per person (forgetting that there are a lot of them and you can easily miss finding this opportunity)

Project B:  Report Automation: in our sales office, our analysts currently spend around 10 hours each, every week, preparing standard reports from Point of Sale (POS) data.  Automating these reports would be very popular, removing a tedious, repetitive part of the work.  Automation of each report takes about 1week of developer time. 
·         We have 10 sales analysts producing 40 reports
·         Sales analysts typically cost about $70/worked-hour including all benefits
·         Development labor costs around $100/worked-hour including all benefits
Annual Savings:
10 x 10 x 52 x 70                =  $364,000
One-off Costs:
40 * 40 * 100                       =  $160,000
Our annual savings do outweigh the costs… or do they?  For the savings to be real we have to stop paying for these hours (the equivalent of 2.5 people) or be able to reinvest them into other work that also generates a return.  Will we?  Frankly report automation is more reasonably justified by eradication of error and consistency of output that makes it easier to manage the thing you are reporting on – perhaps $Billions in sales.

Project C:  Use Point of Sale (POS) data to improve the forecast accuracy of the forecast we build for manufacturing planning.
·         Our current Forecast Accuracy is 75% for 1 month out.
·         We believe that incorporating POS data into the forecasting process could improve forecast accuracy.

(This is where your analyst should help, because you really need a lot more information, knowledge of how inventory buffers uncertainty, a decent model, a pilot and good benchmarks to figure out what this is worth)

A small pilot project using POS data and shipment history from the last 3 years to predict sales for last year suggests we could improve forecast accuracy by 3 – 7 percentage points.
Finished goods inventory is what buffers the manufacturing plant from uncertainty in demand.  With a better forecast you need less safety stock (see [How much inventory do you really need ?] for more details and a handy inventory model).  Using the inventory model:
·         The safety stock portion of our overall inventory is currently 1.8 weeks of supply.
·         A 5 percentage point improvement in forecast accuracy (from 75% to 80%) is worth about 0.4 weeks of supply.
·         From Finance we understand that 1 week of supply is worth approx. $12 million at cost.
·         Our weighted average cost of capital is 12% so projected working capital savings are ~ $1.4 million
·         We may be able to save on storage costs too (assuming they are variable not fixed).  Converting inventory in storage pallet positions we estimate saving about 20,000 pallet positions at a current cost of $5 per pallet per month.
(20,000 x 5 x 12)  = $1.2 million
·         Note: there are no ongoing savings to handling costs as we have reduced inventory not throughput (or sales would have dropped too).  A one-off saving in handling while inventory levels fall could be included but would be relatively immaterial.

Our pilot project has also helped us understand exactly how we can enhance the forecasting process with POS data and allowed us to cost the necessary changes to the forecasting system at approx. $1 million in one-off cost.   So we end up as follows:
Annual Savings:
$1.4 million in cost of working capital
$1.2 million in variable storage costs
One-off Costs:
$1 million
This juice is (probably) worth the squeeze.  With a payback around 2.5 years it should be on our list of viable candidates.  Remember that the pilot said that accuracy improvement was in a range of 3 to 7 percentage points.  We evaluated the average here. At 3 points the costs will stay the same, the savings would only be 60% (not so good).
Does it give the best bang for the buck?  Well we will have to line it up against all other projects competing for our resources to know that.  My guess… probably not.
By the way, the inventory modeling exercise also said that you have 0.5 weeks of unnecessary inventory in the system.  Perhaps it would be better to start by trying to eradicate that.

The bottom line for you is that you should consider using this sort of Analytical Estimation in deciding which of your projects make the cut.  A good estimate is a lot better than a bad guess.

BTW - from recent experience, I can confirm that beetroot juice is most definitely not worth the squeeze J

How much inventory do you really need?

If you are following lean methodologies you will have encountered the concept of inventory as waste.   It’s something you have because you cannot instantly manufacture and deliver your product to a shopper when they want it, but not something that the shopper sees any value in.

I find that a very interesting idea as it challenges the reasons that you need inventory, and that’s definitely worthwhile.   However, many of these causes of inventory need more substantial changes in your supply chain (additional production capacity, shorter set-up times, multiple production locations) so as a first step, I suggest that you figure out what inventory your supply chain really needs and why.  Take out the truly wasted, unnecessary stock and then see what structural changes make sense.

Typically you can remove at least 10% of inventory while improving product availability. What’s that worth to you?  If that sounds a little aggressive, I can only say “been there, done that, got the coffee-mug”. (We didn’t do t-shirts).

I’m going to look at this from a manufacturer’s perspective and try to visualize for you why they need inventory and how to quantify the separate components.  Why does a manufacturer have inventory?  Let’s start with an easy one:

Cycle Stock is related to how often you add new inventory to the system.  Manufacturing lines typically make a number of different products, cycling through them on a reasonably consistent schedule.  “We make the blue-widgets around once a month”.  It may not be exactly a month apart and that’s not really important to us.  Once a month, in this case, a batch of blue widgets is made and added to inventory.  Over the course of the next month (or so) that inventory is consumed and inventory drops until we make another batch.  Over the course of 12 months it would look something like the example below.

Cycle Stock across time

Hopefully it’s not too hard to see that while “Cycle Stock” varies from 0 to about 30 days worth of demand, it will average out to about half-way between the peak and trough – roughly 15 days.  If you manufacture your product less frequently, say once every 2 months, Cycle Stock will peak at 60 days of demand and, on average, adds 30 days of inventory to your overall stock position.  If you manufacture your product once a week, Cycle Stock will peak at 7 days of demand and, on average, adds 3.5 days of inventory to your overall stock position.

If you want to reduce Cycle Stock you need to make your product more often.  That probably means reducing the time and lost production associated with line change-overs so changing more frequently is less painful.

Pipeline stock is slightly harder to explain but really easy to calculate.  Pipeline stock is inventory in your possession that is not available for immediate sale.  Good examples would be inventory that is in-transit, or awaiting release from quality testing: you own it but you can’t sell it yet.  Let’s say that from the point of manufacture it takes 3 days to move the product to your warehouse where it can be combined with other products to fulfill customer orders.  This has the effect of increasing your inventory by exactly … 3 days.  It really is that simple.  If you know how long inventory is yours but unavailable to meet demand, you know your Pipeline Stock.

If you want to reduce pipeline stock you need to reduce testing time post production, get your product to market faster even consider adding production capability nearer to your markets to reduce transportation time.

Inventory Build.  This is easy to describe but very difficult to model.   For products with large variations in sales volume across time (typically but not always due to seasonality) there may not be enough production capacity to manufacture everything you need just prior to the demand.    As long as the product can be stock-piled, the manufacturer just makes it earlier and holds it until its ready for sale.  If you want an example, think of Halloween Candy, it hasn’t really just been made in early October. 

So why is it so hard to calculate?  Well inventory models are typically built one product at a time but to know your production capacity availability you need to look at all products using shared resources and production-lines simultaneously and build a production plan that understands all your constraints and your planning policies.  Essentially, you need to build an entire (workable) production plan and that’s typically beyond the scope of an inventory modeling exercise.  Often the best place to get this is from your production planner.

If you want to reduce Inventory Build you may be able to do so by more effective production-planning , (Optimization models may be able to help here).  Alternatively you will need to add production capacity.

Safety Stock is the most complex part of the calculation but thankfully the math is not new and you can buy tools that do this for you.  You can’t make whatever you want whenever you want it (or you have little need for any inventory).  If I was to tell you now that we need another batch of “Red Doodas” it’s going to take some time to organize that.  Apart from purchasing raw and packaging materials you may need to break into the production schedule, reorganize line labor perhaps even organize overtime shifts.  You may say that you could that done in about 7 days by expediting, but you probably do not want to plan on having to expedite very much of your production.  So, think of something more reasonable, an estimate not too conservative but one that you could stick to most of the time…21 days ?  Let’s work with that and call it the “Replenishment Lead Time”.

Now, I want to set my safety stock so that it buffers me from most of the uncertainty I could encounter during the Replenishment Lead Time.  It seems highly unlikely that I will sell exactly what was forecast in the next 21 days.  If I sell less I am safe if unhappy.  If I sell more I need a little extra stock to help cover that possibility.  Similarly, even though I asked for 1000 “Red Doodas”, production does not always deliver what I asked for and sometimes it takes a little longer than it should too.  By measuring (or estimating) each of these sources of uncertainty and then combining them together we can get a picture of the total uncertainty you will face over the replenishment lead-time.  If we also know what level of uncertainty you want the safety stock to cover  we can calculate a safety stock level.

Typically the amount of uncertainty you wish it to cover is expressed in terms of the % of total demand that would be covered.  So, 99% means that safety stock would target fulfilling 99% of all product ordered.  The other 1% would, sadly, be lost  to back-orders; or future orders; or possibly lost completely.    As CPG case-fill rates (as measure of the proportion of cases fulfilled as ordered) are typically closer to 98%, 99% is actually rather high.

[Note: Don’t go asking for the safety stock to cover 100% of all uncertainty as this (theoretically at least) requires an infinite amount of safety stock]

Here’s our previous example with some additional variation (uncertainty) added in demand.  Safety stock has been set so that you should meet 99% of all demand from stock and production kicks off when we project inventory will drop below the safety stock level 30 days ahead.

Cycle and  Safety Stock across time

If there was no uncertainty the inventory would have a low point at exactly the safety stock level with production immediately afterwards.  Clearly actual sales did not turn out exactly as forecast.  Sometimes we sell less (and inventory is a little high when production kicks in).  Sometimes we sell more and sales start to use up the safety stock.

The safety stock level is intended to buffer most of this uncertainty, but as you can see, inventory does occasionally drop to 0 and (for very short periods of time) you would not have enough inventory to meet all orders.  On the days when this happens you will short a lot more than more than 1% of the ordered quantity but over time this would average out to about 1%.

If you want to reduce Safety Stock you have a few options.  Remember that they key inputs are:
·         Replenishment Lead-Time
·         Demand Uncertainty
·         Supply Uncertainty
·         % of uncertainty you want to cover.
If you can reduce any of these, your safety stock will come down.
I've embedded a simple inventory model below that you can use to experiment with the various inputs that drive your need for inventory.

Once you have set up the inputs appropriately for your business take a look at what a change to any of these inputs would do for total inventory.  What if you can:
·         improve Forecast Accuracy by 5 points;
·         reduce Replenishment Lead-Time by 1 week;
·         reduce you Cycle Time by 50%;
·         reduce Pipeline Length by 25% ?


Uncertainty of demand is typically measured by “forecast accuracy”.   There are some variations on the calculation of forecast accuracy but here I am using it as (1 – [Mean Absolute Percentage Error]) measured in monthly buckets.   [Mean Absolute Percentage Error] may seem a little scary, but it actually does exactly what it says, it’s the average, absolute error as a % of the forecast. (Absolute errors treat negative values as positive)

Forecast accuracy is typically measure in fixed periods that are relevant to you.  These may be close to but typically not the same as your Replenishment Lead-Time, so the model will try to estimate the value it needs from the standard metric.

If you are not already measuring your own forecast accuracy, you really do need to start.  A forecast with no sense of how accurate it is, is (relatively) useless.

Disclaimer:  This tool is a reasonable guide  and should give you a good sense of what is driving your need for inventory and what you might do to reduce it.  Ultimately though, its limited by the complexity I wanted to include in the Excel model it’s based off and of course it can only handle one product at a time.   Don't use it to build your inventory policies - invest in the real thing.

Are pie-charts evil or just misunderstood ?

OK - I admit it: when I see a pie-chart in business analytics, my blood pressure rises and, yes, I am apt to tell the analyst exactly what I think of the monstrous, hard to read, waste of space and ink they created.

I am most definitely NOT the first person to suggest that pie-charts are over used and close to useless.  Google 'Pie Charts are Evil' and see for yourself.  This is an area where career analysts tend to agree, yet, in practice, pie-charts are very, very heavily (ab)used.  In the hope that I can influence even a handful of people to create fewer of these eyesores I'm adding my voice to the argument.

Let's look at an example and see how it might develop.  I want to display share of sales for my company vs. the rest for the last month.  Something like this:
example 1
It's not very pretty is it ?  Perhaps if we made it 3D ?  (Just 3 clicks in Excel)

example 2
And "explode" it - that even sounds cool (4 more clicks).

example 3
Cool-er, but I want to see 'our' portion at the front so rotate it. (Another 4 clicks)

example 4

Now let's play with the 3D settings, surfaces and add a shadow (about 12 more clicks).

example 5

This is starting to look very cool !  It is a little hard though to see what the share is for 'Vendor G'.    Bit more than 25%?  Less than 33%? But, we can fix this!  With just 2 more clicks we have data labels.

example 6
Now that looks really cool and it's a... monstrous, hard to read, waste of space and ink .

Let me list the ways I find it offensive:
  1. Graphical display is supposed to help us visualize data.  This display manages to visualize "Our share of sales last month was 28.8%" .  Just how much visualization do you need for that ? 
  2. It turns out that we (humans not just analysts) are quite bad at estimating proportions from area.  This was an issue in the very first example.  And there is nothing to help us - pie charts have no scale to reference to so you're stuck estimating based on a few proportions you know well (halves, thirds, quarters).
  3. Guessing the actual share got harder and harder as the chart got 'cooler'.  Going 3D, exploding the section and rotating so that we no longer had at least one part of our segment at 12 o' clock  all made it harder to read.
  4. Adding shadow, highlights and texture probably did not do much harm, but it certainly did not help visualize our one and only fact. (Yes, I do know there are 2 numbers in the chart but as one is easily calculated from the other, essentially we have one useful piece of information)
 Expand this to multiple values and we can examine another issue.

example 7
Is it at all obvious to you which pie segment is related to which Vendor?  Note that I did not devise this color scheme for the example, it's the default that Excel provided when I added more data to the previous example.  If we are trying to spot categories by color they have got to be very high contrast and when we get to 8 different colors its very hard to tell some of them apart.  (The color blind could give up long before 8). In this case, I can figure it out from the thin orange segment which is clear to me.  The others are then charted (and labeled) in sequence so I can figure each one out.  But, does this not seem like an awful lot of work for your reader ?  

What if we add category labels to the pie segment too ? 

example 8
If you think this is an improvement, you should probably stop reading now and let's just agree to disagree.

By the way, those folks that did stop reading at the last sentence are so wrong !

Most of the blogs and articles I have read on this subject, quite a few, but probably less than 1% of the voluminous published material, spend a lot of time telling you why pie charts are wrong but less on suggesting alternatives.  Let's try and address that.

How about this as an alternative to example 8 ?
example 9
This just works.  A simple bar chart immediately fixes 2 of our issues.  
  • We are no longer trying to estimate proportion from area (which we are bad at) but from length (where we are substantially better).  Even without the data labels you could make an intelligent guess as to actual values, more so if I added a few more grid-lines and scale labels.
  • The category label is directly (and much more neatly) tied to its visualization.  No need for multiple colors, no trying to figure out which number relates to which category.
Perhaps its not as "cool-looking" and if you really wanted to, you could go 3D, switch out to pipes rather than bars and apply a texture or shadows.

example 10

Personally I don't think this adds any value for people who actually want to read and use it.  If you are publishing to people who don't want to read your material but may be impressed by the look of the graphic, perhaps it has a place ?

How about this as an alternative, a simple table.  Yes, its primitive, I know, but it really does work, try reading it.
example 11
  Or how about this, a "cooler" version using Excel's in-cell data bars?
example 12
Let's wrap this up by going back to our very first example.  We wanted to show share of sales for 'our' company.  On my screen (your's may well be different) this uses roughly a 2" x 3" space (5cm x 7.5 cm) to display only one useful number 28.8 %.  If that's all the data you have to display and you have such a space to fill, well, maybe, it's ok to use a pie chart.  But can't you display something a little more useful ? 

Example 13
 Same space, much more data and very importantly, much more useful.

So, back to the question at the top of this post.  Are pie charts evil or just misunderstood?  To my mind, they are 99.9% evil and misunderstood - if in doubt, don't use them.

Analytic tools "so easy a 10 year-old can use it"

When you search the web you'll find lots of analytic tools to support your business: tools to help with forecasting, inventory-optimization, risk-analysis, simulation for production lines and warehouses, production-scheduling, supply-chain network design, vehicle-loading, price-sensitivity modeling and planogram building - and that is very, very far from being an exhaustive list.

Some of these tools are bought as a service that includes expertise to prepare your data, do the modeling work for you and configure the system to meet your needs. These tools will be much more expensive than the "your on your own" variety and the more frequently that expertise is required, the more you will pay.

If the business problem you are working on is relatively stable (meaning that the same issue comes up repeatedly and your business rules around how to solve it stay the same) such a system can have a long a fruitful life once set up. Vehicle loading applications are a good example - once configured to your needs they should need infrequent updates to work well.

Many analytic applications are more one-off in nature and if you use an expert to help you with every use it will cost you a substantial amount of money. So, do you really need all that expertise or is it simple enough that you can pick it up? Desktop analytic tools that could help you deliver millions in savings and can be bought for a few thousand dollars. Many of them are marketed as being simple to use, and imply that you'll be able to "pick it up" as you go with, perhaps, a little training,

Let's look at an analogy - driving a car. Cars are very complicated things and yet almost every adult can drive one. Why would business analytic tools be any different? Cars are certainly complicated but I see no reason a 10 year old could not master driving one because they do not need to understand or master most of that complexity.

  • Cars have been around for a long time and have seen billions, perhaps trillions of dollars invested to make them robust and hide almost all the complexity from the driver.
  • Cars have relatively few controls that you need to understand to operate them: ignition, steering wheel, brake, accelerator, turn-signals, lights, mirrors, windscreen-wipers and gear selector. I know there are more gadgets but master these few (and hoping I missed nothing important) you can drive a car.
  • You get instant feedback when you use these controls badly and that really helps you learn
  • Most of us get a lot of practice driving a car and some of us are still quite bad at it :-)

In contrast most business users are ill-equipped to take on analytic tools where the complexity remains very apparent: there can be hundreds of options to choose from in building and executing your models and these must be understood to use the tool effectively.

Let's make a distinction here between getting a tool to work and getting it to work effectively. I've seen a lot of spreadsheet using Excel's Solver add-in which solves optimization models: you build a spreadsheet that models some aspect of your business then ask it to find the combination of input values that maximize your output (profit, sales or some such value). If you built a model with no errors and avoided a handful of functions that this optimizer can't handle it will run and most likely return an answer. It may even be able to guarantee (depending which algorithm it used) that this is the absolutely best answer available - the optimal solution to your model. Unfortunately, many of the models bear little resemblance to the real-world issue you are trying to optimize and consequently the optimal solution for the model is not particularly good (and certainly not optimal) in the real world.

There's one good example of an analytic application that really is "so easy a 10 year-old can use it" and it's not a business application at all. It came to prominence in the last few years and is now available in cars, on low-cost specialized devices, even on your phone - GPS navigation. Behind the scenes there is a large, accurate database of road networks and locations along these roads and an efficient routing algorithm (the analytic part) to find the shortest route between any 2 points The whole thing is wrapped into a well-designed user interface so that all this complexity is hidden from the user and we don't care because it just works, all of the time. Well, almost all of the time :-)

In my opinion, there are very, very few analytic tools that are truly accessible to business users without some configuration, packaging or guidance from an expert. If you are serious about solving a business issue that needs some analytic work, you need the tool and the mechanic to wield it.

The Primary Analytics Practitioner

The field of Business Analytics can be very complex. Top level analysts are experts; just like medical specialists, they have undergone years of additional training and know their area of specialty (perhaps price sensitivity, multivariate statistical modeling, survey analysis or mathematical optimization) backwards. Keeping with this analogy, most business managers are as well informed as to what business analytics can do for them as a patient heading in to see their primary physician; perhaps less so.

In the medical world there is a wealth of information specifically written for non-experts available on the web. We see legions of ads for pharmaceuticals that inform us about new conditions and their chemical solution. Many of us will routinely see a primary care physician (that’s ‘General Practitioner’ for those of you in the U.K.) and at least be familiar with our own complaints. When we have an issue with our body, we head to the doctor and let him direct us.

In the business analytic world we have done less well at making it accessible to managers. Most of the published material, easily accessible via the web, is not intended for general consumption. It’s very useful to those already initiated into the craft, but if you do not have a very strong mathematical/statistical background, good experience applying these skills in a business environment and some idea what to search for it may as well not exist at all. An excellent example, there is no equivalent of WebMD’s SymptomChecker. Perhaps also why I have seen no analytical version of a hypochondriac :-).

As a business manager when you encounter a tough problem you often just try and ‘work it out’. After all that’s what managers are paid for, right, to "manage" ? You self–medicate, potentially causing harm or, at least, not addressing the real issue.

Well, the web is a relatively new phenomenon, whereas people going to see a Primary Care Physician is certainly not. What we are missing most, I think, is the role of Primary Analytics Practitioner. Someone a manager turns to who can:

  • Help you identify the real issues/issues and frame potential solutions         
  • Personally handle 90% of the issues that arise       
  • Source technically adept specialists when the need arises

The business analytics world is new – it has no certification process, no minimum educational requirements and no clear terminology to identify practitioners by what they do best. By my experience there are many good analytical providers out there, most of whom have expertise in specific areas. Sadly, you cannot open the Yellow Pages or search online and find a  Primary Analytics Practitioner. (I just Googled the term and found nothing useful to me).  In finding one, you're on your own for now but take heart, these people are out there, it's not just me.

When you do stumble across someone with a breadth of analytical capability, an understanding of your business problem, the ability to directly handle many of your analytical needs and the willingness to say “I don’t know this well enough, but I can find someone who does” use them wisely and stop self-medicating.

Bringing your analytical guns to bear on Big Data – in-database analytics

I've blogged before about the need to use the right tools to hold and manipulate data as data quantity increases (Data Handling the Right Tool for the Job).  But, I really want to get to some value-enhancing analytics and as data grows it becomes increasingly hard to apply analytical tools.

Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength database (Oracle, SQL*Server, MySQL, DB2, …)  - one that can handle the data volume without choking.  Each of these databases has its own dialect of the querying language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.
Here's an example:

(COUNT(*)*SUM(x.Sales*y.Sales)-SUM(x.Sales)*SUM(y.Sales))/( SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))* SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales))))
correlationFROM BulbSales x JOIN BulbSales y ON x.month=y.monthWHERE x.Year=1997 AND y.Year=1998

extracted from the O'Reilly Transact SQL Cookbook

This calculates just one correlation coefficient between 2 years of sales.  If you want to calculate a correlogram showing correlation coefficients across all pairs of fields in a table this could take some time to code as you are re-coding the math every time you use it with the distinct possibility of human error.  It can be done, but it’s neither pretty nor simple.  Something slightly more complex like regression analysis is seriously beyond the capability of SQL.

Currently, we would pull the data we need into an analytic package (like SAS or R) to run analysis with the help of a statistician.  As the data gets bigger the overhead/delay in moving it across into another package becomes a more significant part of your project, particularly if you do not want to do that much with it when it gets there.   It also limits what you can do on-demand with your end user reporting tools. 

So, how can you bring better analytics to bear on your data in-situ?   This is the developing area of in-database analytics:   Extending the analytical capability of the SQL language so that analytics can be executed, quickly, within the database.  I think it fair to say that it’s still early days but with some exciting opportunities:

  • SAS, the gold standard for analytical software, has developed some capability but, so far, only for databases I'm not using (Teradata, Neteeza, Greenplum, DB2)  SAS in-database processing
  • Oracle recently announced new capability to embed R (an open source tool with a broad range of statistical capability) which sounds interesting but I have yet to see it. Oracle in database announcement
  • It’s possible to build some capability into Microsoft’s SQL Server using .NET/CLR   and I have had some direct (and quite successful) experience doing this for simpler analytics.  Some companies seem to be pushing it further still and I look forward to testing out their offerings.  (Fuzzy Logix, XLeratorDB).

No doubt there are other options that I have not yet encountered, let me know in the feedback section below.  


For complex modeling tasks, I am certain we will need dedicated, offline analytic tools for a very long time.  For big data that will mean similarly large application servers for your statistical tools and fast connections to your data mart.

For simpler analysis, in-database analytics appears to be a great step forward, but I’m wondering what this means in terms of the skills you need in your analysts: when the analysis is done in a sophisticated statistics package, it tends to get done by trained statisticians who should know what they are doing and make good choices around which tools to deploy and how.

Making it easier to apply analytical tools to your data is very definitely a good thing.  Applying these tools badly because you do not have the skills or knowledge to apply them effectively could be a developing problem.

Cluster Analysis - 101

The current Wikipedia page on Cluster Analysis, excerpted below, is correct, detailed and makes absolute sense.  Then again, if you do not have a background in statistical modeling, I'm guessing these two paragraphs leave you no wiser.

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.
Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Wikipedia 4/2012 

In this post I hope to provide a workable introduction for people that need to be educated consumers of cluster analysis.
(If you want more technical detail, I suggest you go back to the Wikipedia link above, and follow up on the range of hyperlinks embedded in the document -  it's really very good.)

Let's put this in context with an example.  Assume we are working with a retailer that has 1000 stores and they want to decide which dairy products to put in each store to maximize sales.

One option would be to treat all stores as being the same, come up with one assortment list and put it everywhere.  This has the singular advantage of being easy to execute while leaving all non-mainstream products off the shelf.

At the other extreme, we could try to tailor product assortment individually by store.  Did I mention there are 1000 of them?  Apart from the work involved in building 1000 individual analyses, do we have the discipline to execute such analyses consistently across 1000 stores? Would we have sufficient organization to execute these unique selections in 1000 stores?

Most teams will end up working with groups of stores that they consider "similar" as a compromise.   These groupings may be based off single store features (e.g. stores with more than 30% of sales from premium products) or maybe geographical features (e.g. everything in the South East).  Bear in mind that, if you are trying to do this without statistical help, they do need to be very simple groupings .

For assortment selection, we really want to group together stores where people buy similar products.  In this case we want to find groups of stores that have similar sales patterns.  For dairy, these sales patterns could be related to the % of sales associated with various product characteristics:

  • premium vs. value, 
  • single-serve vs. multi-serve, 
  • cheese vs. yogurt vs. creamer vs. drinks vs. milk
  • yogurt styles

(Note: I do eat a lot of dairy but I haven't worked that category yet so forgive me if I missed something big)

Cluster analysis (actually a family of statistical algorithms, not just one) is used to scan across multiple features that you think are important, to find groups (clusters) so that:

  • stores within a cluster are similar to each other
  • stores in different clusters are dissimilar

It sounds rather like magic doesn't it?  You just throw data at the algorithm and (big fanfare) it finds clusters!  Well perhaps it's not quite that easy.

  • It does take some care to prepare the data, ensuring it's clean, accurate and in a form that works for this process (see Data Cleansing: boring, painful, tedious and very, very important).  
  • In reviewing the results you may decide to drop some features and split out others (e.g. "Premium" is split into "Premium" and "Super Premium").
  • You need to determine how many clusters are correct for your data.
  • You may want to bring in some additional data to help describe clusters after they are created
    • demographics of people living near the store (ethnicity, income, household size etc. )
    • geography (maps work well)
    • local competition
  • Really an extension from basic clustering, but you could build predictive models to explain why , for example, super-premium, greek yogurt is so very popular in Cluster 4.  If you can tie high sales of this product group to specific demographics, you may find other stores with similar demographics that have not previously sold it. (Could be a big opportunity).

I'll return to this topic in future posts, but for today, your takeaways are simple:

  1. Cluster Analysis finds better groups (clusters) of similar things.  
  2. Clusters help you target your offering without dying under the weight of work.

Reporting is NOT Analytics

Reporting is about what happened; Analytics is about answering "what if" and "what's best" questions.  Most of the materials that land on a VP/Director’s desk (or inbox) are just reports with no analytical value added.

Reporting tells us what has happened: sales; orders; production; system downtime; labor utilization; forecast accuracy. Reports leave it up to the reader to digest this information and based off their experience and expertise about the world around them construct a ‘story’ as to why it may have happened that way.

Really good reports will look for known causes of routine issues. For example, if I know a store is low on inventory of a specific product, I could just report that. If I flag it as an exception the person receiving the report may even see it among the sea of other facts. To go the extra mile, it would be wise to see whether I can (automatically) find any disruptions back in the supply chain (either with inventory or flow of goods) and include that information to answer the routine questions that will be raised by my report. But, the report builder must anticipate these needs at the point they are writing the report and for more complex issues that’s just not realistic.

Analytics is all about finding a better story or, if you prefer, insight from data. We’ll talk about tools for finding insights in a moment, but much of this is about approach: develop a working theory about what may be happening and test it out with the data you have available. Revise your theory if needed, rinse and repeat: this is an iterative and interactive process.

At the simplest level a lot of really good analytics is enabled by being able to interact with the data: filter it to see specific features; sort it to find your exceptions, drill-down into more detail to see (for example) which stores are causing the issue in your region, chart it to see trends across time or perhaps even see relationships between variables (like temperature and sales of ice-cream by region). Generally available tools like Excel can get you a long way to intuitively understanding your data and finding some insight.

A further step (and one I fully understand most analysts cannot take - see  What is ‘analysis’ and why do most ‘analysts’ not do it? ) would be to run some descriptive statistics around the data.

  • Measures of ‘average’ (mean, median, mode)
  • Measures of ‘spread’ (Standard deviation, percentile ranges, Min/Max)
  • Frequency histograms and boxplots to visually show the distribution of data
  • Scatter plots to view interaction
  • Correlation matrices to spot high level interactions
  • Outlier detection
  • Dealing with missing values

If these options seem strange perhaps even archaic and of little relevance to the business world, you may need to trust me when I say that these are exceptionally valuable capabilities that increase understanding of the data, uncover insights and prepare you to step into the world of (predictive) modeling. Thinking back over the last 2 weeks of work (store clustering, system diagnostics, algorithm development and even some "reporting") I can confirm that I have used every one of these multiple times and to good effect.

In Predictive Modeling we build a mathematical model around the problem you are trying to solve.  Once the model is built, and (importantly) validated to confirm that it really does behave as you expect the world too, you can start asking questions like:

  • what happens if I add another warehouse to this network?
  • what is the best combination of price and promotion to maximize profitability?
  • how much inventory do I need in each warehouse?
  • what is the best way to load this truck?
  • what is the best assortment for each of my stores?
  • why are sales dropping in the Western region?

Everyone is familiar with the phrase “Garbage In, Garbage Out” relating to how computers will happily calculate garbage when you throw bad inputs at them. With modeling, the structure of the model is one of those inputs and many of you may have experienced the complete junk that comes out of a bad model even when you put good data into it. Picking the right modeling tools for the right job and applying them correctly is a very skilled task. Predictive modeling covers an extraordinarily broad field of statistics, mathematics and operations research. Just as for data handling (see The Right Tool for the Job), this is not something you are likely to do well without appropriate training: understanding the field you are trying to apply these tools to helps enormously too.

So why go to the trouble and expense of ‘Analytics’ rather than ‘Reporting’? Reporting is essential, but well-built Analytics or Predictive Models can find insights and opportunities that you will never find by any other means.

A really good analyst will do this work and while being willing, ready and able to take you to whatever depth of complexity you wish, will furnish you with a simple (perhaps even one page) report that answers your questions and makes the end result look … obvious.

Data handling - the right tool for the job

Microsoft Excel must be the most widely used analytic tool available today but, it’s not the only tool in my tool-bag and it should not be the only one in yours either.  It’s perfect for building small models, has broad embedded functionality, good charting capabilities, pivot-tables (a superb tool worthy of its own post), comes with a programming language so you can extend its capability even further and almost everybody has a copy already.  It’s awesome and I use it every day.

But...the data we analyze is getting much bigger and a lot more complex.  Even with newer versions of Excel allowing over 1 million records in a sheet, what can you do usefully in Excel with 1 million records?  Certainly, you don’t want to print it, it’s near impossible to chart or model against, bottom line you are using the wrong tool for the job.  To do the job well you need to find, learn and use the right tool.  Using Excel on large or complex datasets is like chopping down a tree with a hammer!

It may be tempting to aggregate your data to reduce data size and continue using Excel.  Beware!  When you aggregate (Sum, Average, etc.) you lose detail you might need.  As an example, let's look at retailer point-of-sale data: the record of every purchase in every store:  

  • Aggregate individual shopper transactions into daily summaries (by product and store) and you can no longer see product associations in the basket or product substitution  from basket to basket;  
  • Aggregate daily to weekly data and you will be blind to monthly sales patterns (which can be extreme).  
  • Aggregate store to national data and you will be unable to see regional patterns or relate sales profiles to demographics by store.  

If you throw away the detail you need (just to fit in a smaller space) then to quote a friend of mine "everything just looks vanilla".

At this point, most business analysts will open up Microsoft Access.  This can handle more data (up to about 2 Gigabytes) and the complexity of multiple tables better than Excel does it in multiple worksheets.  It also uses the same programming language so there is less to learn.   It’s a very good tool when used appropriately and I have a lot of experience with it, even designing custom data architectures and programming extensions to drive faster performance from it.

It seems easy to go beyond 2 Gigabytes these days but even before you reach 2GB you really should move to a more industrial-strength database (SQL Server, Oracle, DB2, MySQL…) just to handle the data volume effectively.  As data heads towards Terabytes and the data complexity increases you will need much more knowledge about database design and programming to do it well.

Analysis tools are not nearly so easy to apply in database environments as in Excel, requiring access to more high powered and analytical, data mining and optimization tools (R, SAS, SPSS, ILOG) or, for simpler analytics and my occasional preference, a capable programmer that can embed the analytics directly into your database applications so they are usable on demand.

I have recently been working on a 3 Terabyte database, around 1500 times larger than the biggest Access database you can build and substantially larger than most SQL Server databases.  At this point, we are talking ‘Big Data’.  In the analytical world you’ll see this phrase ‘Big data’ a lot right now.  If, for example, you’re collecting high frequency sensor data for weather modeling, or collecting data on for a web search engine, 3TB may not seem that big, but by current CPG standards it certainly is and importantly, it's big enough that it becomes challenging to use regular, even industrial-strength, database packages. 

In this case (for Point of Sale data analysis) we use a novel and application-specific, data-architecture and associated query engine developed by Shiloh Technologies ( on a SQL Server database.  The Shiloh application, enhanced now by a variety of analytic tools handles the huge data volumes for us, runs analysis, filters, summarizes and presents the data back to the user in a volume that a mere human can manage (certainly less than 100,000 records, 10,000 would be better).  And guess where we put that data?  Back into Excel of course – it’s a phenomenal tool!

Data Cleansing: boring, painful, tedious and very, very important

I've been working recently on a category management project and I'm reminded of just how essential clean, well-organized data is.  We are working to group stores into "clusters" of similar stores; later we will see what geographic and demographic data best helps us to predict cluster membership and optimize product assortment by cluster.

As a first pass, and under a severe time crunch, we took the data available, ran it through the model and while it processed, I was unhappy with the predictive power we found.  Of course, this approach was ridiculously optimistic: so, back to look at the product characteristics we were using.  While the data were cleaner than expected they still suffered from a range of problems visible even to someone who does not know the products that well:

  • missing, invalid and inconsistent values
  • inconsistency across related products (flavor variations with different weights and pricing). 
  • product characteristics that should really be split into multiple characteristics (because the options are not mutually exclusive)

It's taken a few iterations and a few days to get this cleaned up and to embed some of the new product characteristics in the system but its worth every minute.  Even simple analysis  now shows more meaningful results.  Predictive models will benefit even more.

Clustering is a relatively simple statistical process.  Once set up, I can teach someone with limited predictive modeling skills to re-run models with sensible defaults and to interpret the outputs.  Cleaning the data and presenting it correctly to the modeling tools (so you get useful answers) takes more skill.

So, if you are knee-deep in a modeling project and have not paused to check your data quality, perhaps now is the time...



What is "analysis" and why do most "analysts" not do it?

In the Consumer Product Goods (CPG) world, there are a lot of analysts: supply chain analysts; sales analysts; shelf analysts; category analysts; transportation analysts... you get the idea.  For many people the 'analyst' role is their first step onto the managerial ladder.  It is their role to learn the business,  "crunch" numbers and one day, with hard work and a little luck, get promoted to a non-analytic, managerial role.

Knowing this career path, their managers recruit for skills and attitudes that reflect managerial and leadership potential (results focus, initiative, collaborative team working and initiative).  Rarely do they look for strong analytical skills (math, statistics, logical reasoning, systems engineering, management science, data science, programming and operations research) and as most organizations provide limited career paths for people with such skills, perhaps it is best not to recruit them into a dead-end role.

For many analysts, the work is focused around report generation which is manual and repetitive in nature ( run/copy/paste/format) and leaves little time for investigative, analytical work to comment on the reports, let alone investigate why the report says what it does.  The analyst stays stuck in this rut because they do not possess the skills and perhaps the time to find a way out.

It should not come as a surprise then that while analysts can be very bright, very energetic and looking forward to a great career in your company, they may not be very strong at 'analysis': framing the business issue to be solved; extracting the maximum useful insight from available data; and making recommendations as to next steps.

Of course it also gives you an idea why some people strong in analytic skills don't always do good analysis either :-)  In my experience, many lack the interest in the business issue they are working on, being enamored with the analytic process rather than the end-result.

Finding the combination of skill-sets you need to do this work well is tough and unless you have a very large organization, maintaining a critical mass of these skills is likely to be a problem too - it may be necessary to buy it in.