Business Analytics - finding the balance between complexity and readability

In this blog I try to present analytic material for a non-analytic audience.  I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.  

So, I was inspired by a recent post on Nathan Yau's excellent blog 

FlowingData

 to look at the "readability" of my own posts and apply some simple analytics to the results.

I've followed Nathan's blog for a couple of years now for the many and varied examples of data-visualization he builds and gathers from other sources. One that particularly caught  my eye was this one published by the  Guardian just before the recent State of the Union address in the United States (click to enlarge).

The Guardianplotted the Flesch-Kincaid grade levels for past addresses. Each circle represents a state of the union and is sized by the number of words used. Color is used to provide separation between presidents. For example, Obama's state of the union last year was around the eighth-grade level, and in contrast, James Madison's 1815 address had a reading level of 25.3.

Neither the original post nor Nathan's go into much detail around why the linguistic standard has declined.  Within this period, the nature of the address and the intended audience has certainly changed.   Frankly, having scanned a few of the earlier addresses I think we can all be grateful not to be on the receiving end of one of them.

 So, 

I was inspired to find out the reading level of my own blog

.  It's intended to present analytic concepts to a non-analytic audience.  I can probably go a little higher than recent presidential addresses (8th-10th grades, roughly ages 13-15) but I don't want to be writing college-level material either.

All the books my kids read are graded in this (or a very similar) way but I had never thought about how such a grading system could be constructed.   The

Flesch-Kincaid

grade level estimate is based on a simple formula:


0.39 \left ( \frac{\mbox{total words}}{\mbox{total sentences}} \right ) + 11.8 \left ( \frac{\mbox{total syllables}}{\mbox{total words}} \right ) - 15.59

That's just a linear combination of : 

  • average words per sentence;
  • average syllables per word
  • a constant term.

In fact (though I have not yet  found details of how it was constructed) it looks to be the result of a regression model.  (Simple) data science in action from the 1970's.

Note that Flesch-Kincaid says nothing about the length of the book or the nature of the vocabulary it's all down to long sentences and the presence of multi-syllabic words.

(BTW - the preceding sentence has a Flesch-Kincaid grade score of 

13.63,

calculated with this online

utility

).  Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students.    The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.

Approach

I could have used copy/paste into the online utility I used above, recorded the results in a spreadsheet and pulled some stats from that. That would work, but if I ever want to repeat the exercise or modify it, perhaps to use a different readability index, I must do all that work again.   At the time of writing, there are currently 44 published posts on this blog - there must be a better way.

Actually there are probably many better ways but as I also wanted to flex some

R

-programming muscle I built a web-scraper in R to do the work for me and analyze the results (more on this in a later post).

Results

Let's start with some simple summaries of the results I collected.

Histograms showing the % of posts from this blog (prior to 2/14/13)

, the average (mean) value shown in red.

There is some variety in the grade reading level indicated by Flesch-Kincaid for my blog posts, averaging around 10 but ranging from 7 through 14.  I average about 750 words, but occasionally go much longer and have a number of very short "announcement" style posts.  Average words per sentence of 18.

OK, so now I know, but is that good?  I don't know that I have a definitive source but according to at least one 

source

  the target range on  Flesch-Kincaid for Techical or Industry readers is 7-12, so I'm feeling pretty good about that.

I did wonder whether there was any other, hidden, structure to the data though.  I know the equation is based on words per sentence and syllables per word so there is no point looking at those, obviously I'll find a relationship.   But is my writing style influenced by anything else?

Flesch-Kincaid grade level vs. the number of words by post on this blog.

Other than

 a h

andful of long posts that rate lower in the range 8-10,  I don't see much going on here.

Flesch-Kincaid grade level vs. the publication date by post on this blog. 

 The size of each post (in words) is shown by the area of each point, color is used purely to help visually differentiate each of the points.  Apart from a couple of recent "complex" posts  this does seem to be showing a trend, so I added a regression line and labeled the more extreme posts.  Point (b) is a very short "announcement" style post (you can hardly see the point at all) and I could probably ignore it completely.  Point (e) is a more fun piece I did around using pie-charts that's probably not very representative of the general material either.

If you want to compare readability for yourself here are the top (and bottom) posts ranked by Flesch-Kincaid grade level

Rank

Post

 Flesch-Kincaid grade level

words

sentences

1

Analytic tools "so easy a 10 year-old can use it" 

13.3

784

33

2

Point of Sale Analytics - newsletter released 

13.1

82

4

3

Point of Sale Data – Category Analytics 

12.8

676

29

4

How to save real money in truckload freight (Part I) 

12.8

723

31

5

The Primary Analytics Practitioner 

12.7

541

29

6

Reporting is NOT Analytics 

12.4

891

43

7

Point of Sale Data – Sales Analytics 

12.1

478

24

8

Data handling - the right tool for the job 

11.9

762

38

9

Data Cleansing: boring, painful, tedious and very, very important 

11.8

297

16

10

Point of Sale Data – Supply Chain Analytics

11.6

958

41

35

The right tools for (structured) BIG DATA handling

  9.0

1878

114

36

Better Point of Sale Reports with "Variance Analysis": Velocity...

  8.9

1264

78

37

Better Point of Sale Reports with Variance Analysis (update)

  8.5

177

10

38

Better Business Reporting in Excel - XLReportGrids 1.0 released

  8.4

70

5

39

What's driving your Sales? SNAP?

  8.3

651

42

40

Do you need daily Point of Sale data?...

  8.2

1395

83

41

SNAP Analytics (1) - Funding and spikes.

  8.1

531

32

42

SNAP Analytics (2) - Purchase Patterns

  7.9

773

44

43

Business Analytics - The Right Tool For The Job

  7.6

483

36

44

Are pie charts truly evil or just misunderstood ?

 7.1

1097

70

Conclusions

It appears that my material is (largely) written at a level that should be accessible to the reader.

 And I am using more readable language in recent blogs which sounds like a good thing.

But there remains a key question for me that these stats can't really answer.

 Am I getting better at explaining the 

complex (my goal) or just explaining simpler things ? What do you think ?

In case you are wondering, this post has a Flesch-Kincaid grade level of about 8.  So if you can follow the "State of the Union" address you should have been just fine with this.