What does “big data” mean for researchers?
Changes in the value, kinds, and availability of data can have a profound impact on how places like the Urban Institute conduct research. As new kinds of data become more available for social and public policy research, it’s critical that researchers understand how to use—and the potential implications of using—these new and different sources of data.
A recent paper in the Journal of Economic Perspectives shows that the quality of traditional data sets, such as the Current Population Survey and the Survey of Income and Program Participation, is declining. Those data sets are used for a wide variety of research and provide many official government statistics, such as the unemployment rate and the poverty rate.
This is just one example of an important data set whose value is slipping. If the quality of these traditional data continues to decline, newer forms and sources of data will become even more important to researchers. “Big data” is of particular interest because such data sources could offer large samples, high frequencies, and unique collection methods. But it’s not clear to me that public policy researchers completely understand what is meant by “big data” and how to utilize such data in their work.
What does “big data” mean to researchers? Are administrative data “big data?” What about administrative data linked to survey data? How about Electronic Benefit Transfer (EBT) card data, which would consist of food transactions for the more than 46 million people receiving benefits? How do researchers get these data and how do they use them? These are important questions researchers need to consider.
Because these other data sources may be more familiar and relevant to researchers, I think “new and nontraditional data” is perhaps a better phrase than “big data.” (Do you think “NANSOD” will take off?) And yes, new and nontraditional sources of data can offer researchers different ways to explore important public policy topics. But such sources of data are not a panacea. We can’t move away from in-depth, sophisticated analysis simply because we have new, large data from new sources. Although the supply of and demand for these data continue to increase, there should be concerns that researchers (and others) will simply use these data without grounding them in high-quality analysis, theory, and previous work. Access to good data does not replace good analysis.
One example of how researchers could improve their use of new and nontraditional data is in using Twitter data. Instead of using a third party company to purchase data from Twitter, researchers could learn how to use the Twitter Application Program Interface (API) and extract the data themselves. And instead of categorizing sentiment data (for example, Twitter data that is categorized into ‘positive,’ ‘neutral,’ or ‘negative’), researchers could use Natural Language Processing algorithms or Amazon’s Mechanical Turk, which asks workers to complete “human intelligence tasks.” Doing so would be faster, easier to replicate, and expand who does the categorization. (One might argue that it adds an explicit cost to the analysis, but remember that researchers’ time also has value!)
None of this is to say that there aren’t interesting uses of new and big sources of data already taking place in the realm of social policy research. Edward Glaeser and colleagues at Harvard and MIT, for example, just published a National Bureau of Economic Research working paper on the advantages of big data to study cities in different ways.
As the availability of data continues to grow, researchers need to understand what new data sources are and how they can access and use them. As communities of economists, demographers, sociologists, or political scientists, researchers need to start coming together to better understand new and nontraditional data so that those data can be used more intelligently, more efficiently, and more responsibly.