Averages work ! (At least for ensemble methods)

After an early start, I was sitting at breakfast downtown enjoying a burrito and an excellent book on "ensemble methods".  (Yes, I do that sometimes... don't judge)

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data...by Giovanni Seni, John Elder and Robert Grossman(Feb 24, 2010)

For those who have built a few predictive models: regression , neural-nets, decision trees,...  I think this is an excellent read, outlining an approach that can deliver big improvements on hard to predict problems.  The introduction provides a very good overview:

Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges...

Ensemble models use teams of models.  Each model uses a different modeling approach or different samples of the available data or emphasizes different features of your data-set and each is built to be as good as it can be.  Then we combine ("average") the prediction results and,  typically,  get a better prediction than any of the component team members.

When I was first learning predictive modeling as an under-graduate the emphasis was on finding the

best

model from a group of potential candidates.  Embracing ensemble methods, initially, just felt wrong, but the proof is in the performance.

It sounds easy, but, clearly, this is more complex than building a single model and if you can get a good-enough result using simple approaches you should.  You'll know when it's worth trying something more high powered.

With thanks to my friend Matt for this simplification, this may be one of the few contexts where we can say

"Averages work!!"

As a reminder that working with averages (or aggregations of any kind) is generally dangerous to your insight, take another look at this post on why you should be using

daily point-of-sale data

.

Or, consider this...