It’s hilarious (and hilariously easy) to find pairs of data series that have extremely high correlations—they’re mathematically related to each other—but have no real-world or economic relationship. (Think annual deaths-by-swimming-pool and Nicolas Cage movies, or margarine consumption and divorces in Maine.)
That’s why my data-loving heart grew three sizes the day I saw law student Tyler Vigen’s comical visual explanation of “correlation doesn’t equal causation,” which has since gone viral online.
Finally! I thought to myself. An impossible-to-misunderstand explanation of every data researcher’s favorite caveat…correlation is not causation.
But really, Vigen’s light-hearted project is actually quite important.
Many articles referencing Vigen have almost—but not quite—gotten to the same conclusion. They dance around the issue, nailing cold the idea that just because two series of data move in a similar fashion doesn’t mean they have uncovered an actual relationship of the form “X causes Y.”
A story on Vox.com, a site dedicated to “explaining the news,” got closest to the big idea: it compares Vigen’s ridiculous relationships to other, good social science that finds high correlations between two series - that also make good intuitive sense.
It implies that informed readers should lend most credence to correlations that actually pass the smell test— does that relationship make sense? Fair enough.
But I fear that even Vox’s article misses the real point. Vigen’s comparisons are funny because they’re so ridiculous— of course cheese consumption and death-by-bedsheet aren’t related! But what about the many examples in social science of high correlations that make great sense - but in reality have no real-world relationship?
As my pithy colleague John Roman likes to say, “for every question you can think of, there is an answer that is simple, intuitive, and wrong.” When you find a high statistical correlation between two things that seems sensible, how do you know if it really is?
What if, for example, you found that places with many immigrants have lower literacy rates? Should you (incorrectly) conclude that immigrants tend to be less literate? Or do they simply tend to move to places with low average literacy?
What if you found that low-income families with cars tend to move to better neighborhoods than families without cars? You would need to do more to find out if it’s just coincidence or really causation. If it’s causation, does it mean cars help families get to better neighborhoods, or is it that families that can afford cars can also afford to live in good neighborhoods?
The point is that correlations are actually pretty weak standards of evidence, even if the (mathematical) relationship they show makes good sense. And there are many statistical reasons why you might see a spurious correlation. That’s why organizations like the Urban Institute and so many others put so much effort into developing what we call “rigorous” analytical techniques that (we think!) uncover actual relationships among real-world phenomena.
So, next time you see a high correlation coefficient, beware: there’s a decent chance it tells you next to nothing about how the world really works.