Last month, I wrote a post titled, “Statistics: I Do Not Think It Means What You Think It Means.” The title paid homage to one of my favorite sources of movie quotes, The Princess Bride (you can view this particular quote on YouTube). Since that article, I have read no fewer than a dozen more posts from various bloggers who try to draw conclusions from data on social media by calculating averages and medians. When it comes to using these calculations in social media all I can say is, “You keep on using that word. I do not think it means what you think it means.”
Primer (I promise this won’t hurt)
Most of the statistical jargon we use (like averages, medians, and standard deviations) apply to data populations that follow a normal distribution, also popularly known as a “bell curve.” This describes data that is more or less centered around an average (or “mean”) value. The median is a line of demarcation, having half of the values above and below this number. The general shape of this bell curve is defined by the standard deviation. A high standard deviation results in a short, wide bell while a low standard deviation results in a tall, narrow bell.
The key to normal distributions is randomness. Without a process that is truly random, you will not have a normal distribution and terms such as average and median can be thrown out the window. This is a common pitfall in statistical process control, where processes controlled by machines are sometimes not as random as people think. The caution here is that if you’re going to measure statistics like average, median, and standard deviation you must be sure that the data population forms a normal distribution. Otherwise, the numbers will not mean what you think they mean.
Enter the power law distribution, also known as a Pareto distribution or the 80/20 rule. Unlike a normal distribution, these values are not symmetrical, but highly weighted against one end of the graph. This type of relationship typically represents social and economic data patterns, and is especially true in social media.
A power law distribution can represent participation rates in social media; for example, the number of contributions (Y axis) per participant (X axis). It can also be used to describe a demand curve; for example, the price of a given product (Y axis) versus the demand for it (X axis).
What I Think It Means
This is very much a case of caveat emptor. I’m not saying that the word “average” is meaningless in power law distributions, just that it has a very, very different meaning than most people understand it to have. If you read, for example, an article that talks about the “average number of retweets/posts/likes” etc… then you must first determine whether or not the data population from which it was calculated is a normal distribution. This can be done pretty easily through a mathematical calculation but many of the people touting these statistics don’t know how or even why they should.
So before you make any decisions based on what the “average” person does, make sure it means what you think it means.