Statistics: I Do Not Think It Means What You Think It Means

One of my favorite, most quotable movies of all time is The Princess Bride.  While a small group of kidnappers is pursued by a masked man, the ring leader (Wallace Shawn) repeatedly exclaims, “Inconceivable.”  Eventually, Inigo Montoya (Mandy Patinkin) grows weary of hearing this every time their plans are thrown of kilter, saying, “You keep on using that word.  I do not think it means what you think it means.”

Which brings me to the point of this post.  I saw another marketing article this week (which shall remain anonymous) from a highly respected thought leader in the industry, which looked at a very small set of statistics and made ridiculous claims and conclusions based on dubious data.  I see this far too often, especially in the blogoshphere and Twitterverse, where self-made marketing mavens abound who may be very successful and knowledgeable in many areas but also have a tendency to play “weekend warrior” with statistics in order to prove a preconceived point.

Common Mistakes

My engineering and quality control background give me just enough knowledge to be dangerous.  I would never claim to be a statistical guru, but I do know enough to understand when statistics are being abused (sometimes).  I’m going to try to keep this list short and as non-mathematical as possible in the hopes that it can help people be a little more suspicious of titles they read and conclusions they adopt.

Mistake #1: Assuming Normal Distribution

The vast majority of statistics that non-mathematicians use and understand are based on normal distributions.  This is what people commonly understand as the “bell curve,” in which the majority of occurrences in a population are centered around a mean, and tail off more or less evenly on either side.  Whenever you use a term like “average,” “mean,” or “standard deviation,” they must describe data in a random, normal distribution in order to be accurate.  There are mathematical ways to deal with non-normal distributions, but they are rarely applied by the “weekend warriors.”  There are two common fatal flaws that happen:

  1. The data itself is not random.  In fact, data is generally much less random than we think; especially if we’re talking about marketing data because human behavior is rarely truly random.  But there are often situations in the physical world (like boundary conditions) that result in non-random distributions also.
  2. The sampling is not random or not representative.  There are numerous examples of this, including cherry-picking, small sample groups, placebos, etc…  Garbage in, garbage out.

Mistake #2:  Assuming Causality

This is probably the most common (just a guess on my part) form of statistical abuse, and it comes in several different flavors:

  1. Reverse causality: Most basketball players are tall.  Therefore, playing basketball makes you tall.
    Or, if you prefer a marketing example:  Most shared posts on Facebook contain word “x.”  Therefore, word “x” makes your post more sharable.
  2. Spurious relationship:  Vodka and water gets you drunk.  Scotch and water gets you drunk.  Gin and water gets you drunk.  Therefore, water causes intoxication, since it is the only common factor (alcohol is the hidden variable).
    Or, if you prefer a marketing example:   Most Twitter re-tweets occur around 4PM.  Therefore, people create better content in the afternoon.
  3. Coincidence:  Sales of ice cream increase in direct correlation with increases in drowning deaths.  Therefore, ice cream causes drowning.
    Or, if you prefer a marketing example:  The number of Facebook users has increased sharply during the same period iPod sales have exploded.  Therefore, Facebook is responsible for the success of the iPod.

Mistake #3:  Gambler’s Fallacy

The gambler’s fallacy is rooted in a misunderstanding of randomness.  We feel that a distinctly non-random pattern increases the likelihood of a particular result.  For example, if a roulette wheel lands on black eleven times in a row, we mistakenly believe the changes are greater that the next result will be red.  The chances of a truly random outcome are always the same, regardless of previous outcomes.

Risks

Examples abound from prominent marketing companies and mavens who draw spurious conclusions, based on questionable data sets and containing logical fallacies.  I am concerned that businesses may be making important decisions that result in the mis-allocation of precious resources (time and money) based on declarative titles like, “Data Shows: X Gets Shared More on Y.”  Caveat emptor.

Comic Relief

This post was prompted in part by this brilliant and hilarious TED presentation, “Lies, damed lies and statistics (About TEDTalks) – Sebastian Wernicke.”  As is so often the case, it’s funny because it’s tragically true.

Hat tip to Garr Reynolds for once again provoking thought.

1 thought on “Statistics: I Do Not Think It Means What You Think It Means”

Leave a Comment