The blog of the Urban Institute
December 19, 2014

Raising the standards of data journalism

December 19, 2014

Looking back, 2014 may be known as the Year of Data-Driven Journalism. Journalists’ use of more and better data will probably, ultimately lead to better stories and a better understanding of important issues, but there are some growing pains (for example, here, here, and here). As a social science researcher, some of these growing pains, well, just pain me.

Take Wednesday night’s post from Nate Silver at FiveThirtyEight on the decision by Sony Pictures to pull the Seth Rogen and James Franco movie The Interview. In it, Silver writes the following:

Both production budgets and Rotten Tomatoes ratings are predictive of box office grosses. A 1 percent gain in Rotten Tomatoes ratings is associated with a roughly $2 million increase in international box office gross (according to a linear regression analysis). And every additional $1 million in production budget translates to about $2 million more at the box office.

This is not merely a “correlation does not equal causation” critique. My more fundamental critique is: “Why are journalists less responsible for citation and documentation than researchers?” Any researcher in any field would be laughed out of the room if he or she were to run a regression without describing the method and at least providing a source for the data. Why are journalists who purport to be “driven by data” held to different standards? Why are we so keen on requiring journalists to source whom they interview, but not the data and methods they use?

I don’t know what data Silver used here (it might be the table of 22 data points shown earlier in the story), but my guess is that he ran some simple linear regression of box office grosses on Rotten Tomatoes ratings (which also leads me to believe that his conclusion above is a percentage point change, not a percent change). A natural question to ask is why use box office grosses rather than profitability in the first place? Isn’t it important to distinguish between two movies that have same exact Rotten Tomatoes score, but one made a profit of $100 million while the other lost $100 million? That said, the simple correlation also misses an awful lot of other aspects of movie profitability: release date, popularity of the main stars, production costs, type of production (explosions are really expensive), type of movie, film rating, and even weather and sporting events on opening weekend.

And a quick side note here: people have been yelling at researchers and government agencies for years to provide their data in more accessible ways, but media organizations like FiveThirtyEight continue to publish their data tables in picture formats, which means readers who want to use the data have to type those values in by hand.

So Silver runs a regression—it might be on these mere 22 observations and it might not, he doesn’t say—and we could debate what’s in and what’s not in that regression. But he at least gives us a link to “linear regression analysis,” a link that will surely have model and data documentation, right? Nope. It’s a link to an “Introduction to linear regression analysis” primer page from a professor at Duke University. Not so much in the way of documentation there, eh?

Is the relationship between movie profitability and Rotten Tomatoes scores an important research topic? Not for me, really, and probably not for most folks here at the Urban Institute. But it probably is important for folks in Hollywood and for folks in marketing and advertising firms. These relationships are the sorts of things that help them determine what types of movies to produce, what products to place in those movies, and who is going to star in them.

But Silver, FiveThirtyEight, and other data-driven news organizations also write about issues in which I am interested: inequality, immigration, employment, and the minimum wage, to name a few. These organizations have big platforms—Nate Silver himself has 1.1 million followers on Twitter—and they are wading into important topics and affecting how people think about them. How are we supposed to trust data-driven journalism on big national issues if we can’t trust them on a story about a Seth Rogen movie?

If data-driven journalists want to be more like researchers, they can’t just start making up their own set of rules. I’m not arguing that the rules and ethics around academic and science publishing are correct, or that they even make sense, but they are based in decades of experience and debate. Basic responsibilities, like documenting data and explaining the methods, are fundamental tenets of research and should be followed. Does that mean Silver needs to write a National Bureau of Economic Research working paper for every little regression he runs? No, but a little documentation might help. Oh, and a little data too.

Illustration by Tim Meko, Urban Institute


As an organization, the Urban Institute does not take positions on issues. Experts are independent and empowered to share their evidence-based views and recommendations shaped by research.


Jon, I agree with the spirit of the critique but I think you've picked a poor example. In the article you reference, the regression analysis was just a garden-variety linear regression on the 22 data points and three variables listed in the table. Why didn't the article say this explicitly? Because we thought it was self-evident. These are always judgment calls. In a news article -- or an academic journal article -- do you need to specify the country wherein the city London is located? If it's London, England, probably not. If it's London, Ontario -- well, yes. In this case, if we'd run the regression on data that wasn't fully listed in the table, or if we'd applied something "fancier" than a simple linear regression -- if it had been the equivalent of London, Ontario -- we absolutely would have disclosed that. Quite possibly, we'd have published the dataset on Github. Possibly also, we've have engaged in a discussion about how sensitive the results were to model specification (something that neither academics nor journalists do enough). There may have been footnotes! Many FiveThirtyEight articles have them. (This one holds the record with 76 footnotes: We think a LOT about how to communicate methodological detail in articles. We will not always get the balance right. It's a huge challenge, especially because we want those details to be understandable to a lay audience. It's not easy to communicate a concept like overfitting to a lay audience, but we'll try (, especially when it's relevant to the conclusion of the analysis. At the same time, you can detract from the empirically important details by including too many spurious ones, just as a chart can fail to communicate the empirically important relationships in the data if it includes too much "chartjunk". From an academic/nonprofit standpoint, this may still seem insufficient. We're sympathetic to these critiques since our view of what it means to be "objective" is close to the academic/scientific understanding, i.e. objectivity is very much related to the notion of testability and replicability. And as Ben says, we've leaned more and more toward more complete disclosure and transparency over time. I'd argue, though, that the level of transparency applied by FiveThirtyEight and other data-driven journalism projects is extremely high as compared to that expected of "traditional" journalists, who sometimes make no pretense of "showing their work". -Nate
Spot on. Whether the example you have selected is the best example or not, your argument applies. Journalists must be held accountable for any "fact" that they publish, especially conclusions resulting from analysis that are presented as fact.
This is a reason why at my blog, when I run econometric analyses, I disclose everything. Yes, it's nerdy, and flies over the heads of most, but it is basic honesty to those who can understand it. Plus, most people don't use regression properly. At least most people on Wall Street or practitioners in finance don't, which is my main area of work.
Jon, Thanks for this critique. I can't speak for Nate's pieces, but since I wrote some of the others you linked to, so let me respond very briefly [ed: er, not so brief in the end]. First, let me say that we're absolutely still trying to work out the right balance between sometimes competing priorities: transparency, accessibility, speed, rigor, etc. I don't think any of us would claim we get it right all the time, and I appreciate constructive critiques like this one. I will note a couple things. First, we are posting data and code on GitHub: We aren't posting everything, but we're posting more every day. In some cases, it's just data tables, but in other cases it's fully reproducible code. We hope to do more of that in the future. Second, we make a distinction on the site between our blog, DataLab, and our feature stories. It's admittedly a distinction that isn't always that obvious to readers, which is something we're talking about. But at least as originally conceived, DataLab is a place for us to test out ideas, respond to news events in near-real time, provide updates on earlier stories, etc. Features are intended as more in-depth, fully fleshed-out analyses, complete with footnotes and explanations of our methodology. That doesn't mean DataLab posts should be sloppy -- we want everything on the site to meet our standards -- but it does mean they'll often be a bit rougher. (Several of the stories you linked to, including Nate's "Interview" post, were DataLabs.) Third (and I recognize that you acknowledged this in your critique), we are intentionally staking out a space somewhere between the daily news cycle and an academic (or even think-tank) timeline. That means we won't always have the opportunity to run the kind of robustness checks, alternative specifications, etc., that would be required in academia, and it means we may sometimes be forced to use the best data available now, rather than the best data that could conceivably exist. This is a tricky balance -- how do you know when "good enough" is good enough? I think we're still figuring that out. But I think that's the right way to frame the conversation. If the goal is academic rigor, then we're going to be stuck on an academic timeline, which I think misses out on a lot of opportunities for data journalism to enhance the quality of the public debate. One thing that we're absolutely committed to, however, is transparency. If we think we know something, we should tell you how we know it. If something is a rough, order-of-magnitude estimate, we should make that clear. If we're basing something on preliminary or incomplete data, we should say so. I think we do a pretty good job of that on the whole, but readers should hold us to account when we don't. Ben Casselman, chief econ writer, FiveThirtyEight
Jon, For what it's worth, there are programs available that can extract data from graphs saved as images (e.g., JPEG). There is a script called grabit on the Matlab file exchange that does the dirty work for you. (I know this was not your main complaint, but hopefully it helps when writers don't post the data table in github.)