Raising the standards of data journalism
Looking back, 2014 may be known as the Year of Data-Driven Journalism. Journalists’ use of more and better data will probably, ultimately lead to better stories and a better understanding of important issues, but there are some growing pains (for example, here, here, and here). As a social science researcher, some of these growing pains, well, just pain me.
Take Wednesday night’s post from Nate Silver at FiveThirtyEight on the decision by Sony Pictures to pull the Seth Rogen and James Franco movie The Interview. In it, Silver writes the following:
Both production budgets and Rotten Tomatoes ratings are predictive of box office grosses. A 1 percent gain in Rotten Tomatoes ratings is associated with a roughly $2 million increase in international box office gross (according to a linear regression analysis). And every additional $1 million in production budget translates to about $2 million more at the box office.
This is not merely a “correlation does not equal causation” critique. My more fundamental critique is: “Why are journalists less responsible for citation and documentation than researchers?” Any researcher in any field would be laughed out of the room if he or she were to run a regression without describing the method and at least providing a source for the data. Why are journalists who purport to be “driven by data” held to different standards? Why are we so keen on requiring journalists to source whom they interview, but not the data and methods they use?
I don’t know what data Silver used here (it might be the table of 22 data points shown earlier in the story), but my guess is that he ran some simple linear regression of box office grosses on Rotten Tomatoes ratings (which also leads me to believe that his conclusion above is a percentage point change, not a percent change). A natural question to ask is why use box office grosses rather than profitability in the first place? Isn’t it important to distinguish between two movies that have same exact Rotten Tomatoes score, but one made a profit of $100 million while the other lost $100 million? That said, the simple correlation also misses an awful lot of other aspects of movie profitability: release date, popularity of the main stars, production costs, type of production (explosions are really expensive), type of movie, film rating, and even weather and sporting events on opening weekend.
And a quick side note here: people have been yelling at researchers and government agencies for years to provide their data in more accessible ways, but media organizations like FiveThirtyEight continue to publish their data tables in picture formats, which means readers who want to use the data have to type those values in by hand.
So Silver runs a regression—it might be on these mere 22 observations and it might not, he doesn’t say—and we could debate what’s in and what’s not in that regression. But he at least gives us a link to “linear regression analysis,” a link that will surely have model and data documentation, right? Nope. It’s a link to an “Introduction to linear regression analysis” primer page from a professor at Duke University. Not so much in the way of documentation there, eh?
Is the relationship between movie profitability and Rotten Tomatoes scores an important research topic? Not for me, really, and probably not for most folks here at the Urban Institute. But it probably is important for folks in Hollywood and for folks in marketing and advertising firms. These relationships are the sorts of things that help them determine what types of movies to produce, what products to place in those movies, and who is going to star in them.
But Silver, FiveThirtyEight, and other data-driven news organizations also write about issues in which I am interested: inequality, immigration, employment, and the minimum wage, to name a few. These organizations have big platforms—Nate Silver himself has 1.1 million followers on Twitter—and they are wading into important topics and affecting how people think about them. How are we supposed to trust data-driven journalism on big national issues if we can’t trust them on a story about a Seth Rogen movie?
If data-driven journalists want to be more like researchers, they can’t just start making up their own set of rules. I’m not arguing that the rules and ethics around academic and science publishing are correct, or that they even make sense, but they are based in decades of experience and debate. Basic responsibilities, like documenting data and explaining the methods, are fundamental tenets of research and should be followed. Does that mean Silver needs to write a National Bureau of Economic Research working paper for every little regression he runs? No, but a little documentation might help. Oh, and a little data too.
Illustration by Tim Meko, Urban Institute