1
Average: I Do Not Think It Means What You Think It Means

Princess bride Average: I Do Not Think It Means What You Think It MeansLast month, I wrote a post titled, “Statistics: I Do Not Think It Means What You Think It Means.”  The title paid homage to one of my favorite sources of movie quotes, The Princess Bride (you can view this particular quote on YouTube).  Since that article, I have read no fewer than a dozen more posts from various bloggers who try to draw conclusions from data on social media by calculating averages and medians.  When it comes to using these calculations in social media all I can say is, “You keep on using that word. I do not think it means what you think it means.”

Primer (I promise this won’t hurt)

Most of the statistical jargon we use (like averages, medians, and standard deviations) apply to data populations that follow a normal distribution, also popularly known as a “bell curve.”  This describes data that is more or less centered around an average (or “mean”) value.  The median is a line of demarcation, having half of the values above and below this number.  The general shape of this bell curve is defined by the standard deviation.  A high standard deviation results in a short, wide bell while a low standard deviation results in a tall, narrow bell.

BellCurve 300x233 Average: I Do Not Think It Means What You Think It Means

Graph depicting a normal distribution and its mean.

The key to normal distributions is randomness.  Without a process that is truly random, you will not have a normal distribution and terms such as average and median can be thrown out the window.  This is a common pitfall in statistical process control, where processes controlled by machines are sometimes not as random as people think.  The caution here is that if you’re going to measure statistics like average, median, and standard deviation you must be sure that the data population forms a normal distribution.   Otherwise, the numbers will not mean what you think they mean.

Enter the power law distribution, also known as a Pareto distribution or the 80/20 rule.  Unlike a normal distribution, these values are not symmetrical, but highly weighted against one end of the graph.  This type of relationship typically represents social and economic data patterns, and is especially true in social media.

PowerLaw1 300x201 Average: I Do Not Think It Means What You Think It Means

Graph of a power law (or Pareto) distribution and its mean

A power law distribution can represent participation rates in social media; for example, the number of contributions (Y axis) per participant (X axis).  It can also be used to describe a demand curve; for example, the price of a given product (Y axis) versus the demand for it (X axis).

What I Think It Means

This is very much a case of caveat emptor.  I’m not saying that the word “average” is meaningless in power law distributions, just that it has a very, very different meaning than most people understand it to have.  If you read, for example, an article that talks about the “average number of retweets/posts/likes” etc… then you must first determine whether or not the data population from which it was calculated is a normal distribution.  This can be done pretty easily through a mathematical calculation but many of the people touting these statistics don’t know how or even why they should.

So before you make any decisions based on what the “average” person does, make sure it means what you think it means.

Continue Reading

1
Statistics: I Do Not Think It Means What You Think It Means

One of my favorite, most quotable movies of all time is The Princess Bride.  While a small group of kidnappers is pursued by a masked man, the ring leader (Wallace Shawn) repeatedly exclaims, “Inconceivable.”  Eventually, Inigo Montoya (Mandy Patinkin) grows weary of hearing this every time their plans are thrown of kilter, saying, “You keep on using that word.  I do not think it means what you think it means.”

Which brings me to the point of this post.  I saw another marketing article this week (which shall remain anonymous) from a highly respected thought leader in the industry, which looked at a very small set of statistics and made ridiculous claims and conclusions based on dubious data.  I see this far too often, especially in the blogoshphere and Twitterverse, where self-made marketing mavens abound who may be very successful and knowledgeable in many areas but also have a tendency to play “weekend warrior” with statistics in order to prove a preconceived point.

Common Mistakes

My engineering and quality control background give me just enough knowledge to be dangerous.  I would never claim to be a statistical guru, but I do know enough to understand when statistics are being abused (sometimes).  I’m going to try to keep this list short and as non-mathematical as possible in the hopes that it can help people be a little more suspicious of titles they read and conclusions they adopt.

Mistake #1: Assuming Normal Distribution

The vast majority of statistics that non-mathematicians use and understand are based on normal distributions.  This is what people commonly understand as the “bell curve,” in which the majority of occurrences in a population are centered around a mean, and tail off more or less evenly on either side.  Whenever you use a term like “average,” “mean,” or “standard deviation,” they must describe data in a random, normal distribution in order to be accurate.  There are mathematical ways to deal with non-normal distributions, but they are rarely applied by the “weekend warriors.”  There are two common fatal flaws that happen:

  1. The data itself is not random.  In fact, data is generally much less random than we think; especially if we’re talking about marketing data because human behavior is rarely truly random.  But there are often situations in the physical world (like boundary conditions) that result in non-random distributions also.
  2. The sampling is not random or not representative.  There are numerous examples of this, including cherry-picking, small sample groups, placebos, etc…  Garbage in, garbage out.

Mistake #2:  Assuming Causality

This is probably the most common (just a guess on my part) form of statistical abuse, and it comes in several different flavors:

  1. Reverse causality: Most basketball players are tall.  Therefore, playing basketball makes you tall.
    Or, if you prefer a marketing example:  Most shared posts on Facebook contain word “x.”  Therefore, word “x” makes your post more sharable.
  2. Spurious relationship:  Vodka and water gets you drunk.  Scotch and water gets you drunk.  Gin and water gets you drunk.  Therefore, water causes intoxication, since it is the only common factor (alcohol is the hidden variable).
    Or, if you prefer a marketing example:   Most Twitter re-tweets occur around 4PM.  Therefore, people create better content in the afternoon.
  3. Coincidence:  Sales of ice cream increase in direct correlation with increases in drowning deaths.  Therefore, ice cream causes drowning.
    Or, if you prefer a marketing example:  The number of Facebook users has increased sharply during the same period iPod sales have exploded.  Therefore, Facebook is responsible for the success of the iPod.

Mistake #3:  Gambler’s Fallacy

The gambler’s fallacy is rooted in a misunderstanding of randomness.  We feel that a distinctly non-random pattern increases the likelihood of a particular result.  For example, if a roulette wheel lands on black eleven times in a row, we mistakenly believe the changes are greater that the next result will be red.  The chances of a truly random outcome are always the same, regardless of previous outcomes.

Risks

Examples abound from prominent marketing companies and mavens who draw spurious conclusions, based on questionable data sets and containing logical fallacies.  I am concerned that businesses may be making important decisions that result in the mis-allocation of precious resources (time and money) based on declarative titles like, “Data Shows: X Gets Shared More on Y.”  Caveat emptor.

Comic Relief

This post was prompted in part by this brilliant and hilarious TED presentation, “Lies, damed lies and statistics (About TEDTalks) – Sebastian Wernicke.”  As is so often the case, it’s funny because it’s tragically true.

Hat tip to Garr Reynolds for once again provoking thought.

Continue Reading

High Five for Week Ending 21-Mar

Published on March 21, 2010 by in High Five

1
High Five for Week Ending 21-Mar
HighFive 300x275 High Five for Week Ending 21 Mar

Weekly High Five lists the most interesting, compelling, and/or useful links of each week.

This week’s High Five is about Internet advertising and metrics.

#5: Is 2010 the Year Digital Will Eclipse Print Ad Spending?

A recent Outsell study predicts that advertisers will be spending 32.5 percent in digital media versus 30.3 percent in print.  The silver lining for print is that it predicts advertising expenditures to increase slightly.  For some time, this has been a question of when and not if, and so while it comes as little surprise, it is no less momentous.

Link: Wired

#4: Why Ad Blocking is devastating to the sites you love

While print advertising is taking a beating these days, it’s not all moonlight and roses for digital advertising either.  Ars Technica decided to conduct an interesting experiment on their own site to block their content from visitors who were using ad blockers, since this was detrimental to their revenue stream.  After all, everybody needs to put food on the table.  While the experiment was a technical success, it was a social failure.  They determined that the backlash from this was far worse than the lost revenue, but more importantly they discovered that they had made a false assumption.  Their visitors, as it turns out, were not blocking the ads out of malevolence.  The simply hadn’t considered the ramifications of doing so and the vast majority were more than happy to whitelist the site.  The takeaway here is <drumroll> communication works!

Link: Ars Technica

#3: Chart of the Week: Marketing Budgets Shifting to Digital Tactics

Another marketing survey, this one from Econsultancy and ExactTarget, confirms a shift not only away from print but radio and television as well.  In all, 66 percent of companies surveyed are increasing their investments in digital marketing.

Link: Hubspot

#2: 35 Crucial SEO, Twitter & Social Media Statistics for Business People

Given the mass exodus from traditional marketing into Internet and social media, it’s important to have data to determine which which digital channel is appropriate for a given campaign.  This article posts a long list of recently gathered statistics that are helpful in that regard.

Link: SEOptimize

#1: Odds Are, It’s Wrong (Science fails to face the shortcomings of statistics)

Fair warning – this article is fairly dense with mathematics and statistics.  However, the bottom line and the reason it’s included here is that with all of these statistics and metrics, it’s important to maintain some healthy skepticism.  Almost every week, I see a marketing company either make faulty logical assumptions (here’s a bonus link: 7 Common Logical Mistakes People Make), rely on poor sampling, or flat out use the wrong statistical calculation.

Link: ScienceNews

Feel free to provide your thoughts and/or contributions…

Continue Reading

5

In a recent discussion with ISA leaders regarding how to lessen the number of emails it sends members, the topic of Facebook fan pages came up.  The context of this discussion was focused on how ISA could be at least as effective at marketing its publications while reducing the number of emails it sends.  I was asked to explain specifically how a fan page compares with email marketing, and I came up with seven advantages:

1) “Opting In” vs. “Not Opting Out”
People must take an affirmative action to “become a fan,” which says a lot more than “I choose not to opt out.”  From a marketer’s perspective, these become your top shelf, number one, gold plated prospects.  And you treat them that way.

2)  Marketing Upside
When someone becomes a fan, all of their friends see it. This has tremendous marketing “up side.”  When someone doesn’t opt out of emails, nobody knows and there is zero additional up side.

3) Build a Community
Fans can interact with one another on the fan page, providing book reviews, answering questions, talking about their favorites, etc.  This is the very essence of Web 2.0.

4) Analytics
Facebook provides detailed statistics with regard to interactions that occur on fan pages.  This makes is very easy to quantify the value of the page over time.  Typical email marketing solutions provide counts of the number of times a message is read or a link is clicked.  However, Facebook has additional metrics that can measure interactivity and “buzz.”

5) Reach
Fan pages are open to everyone on Facebook (that’s 325 million users) – not just your email database.

6) Demographics
The fastest growing age demographic on Facebook is 35 to 45 year olds.  This is a critical demographic for many organizations.

7) Cost
Fan pages are FREE.  Enough said.

Let me know if I missed something.

Continue Reading