Earlier this week, I wrote about how new and nontraditional forms of data hold promise for public policy researchers, but also raise a lot of new questions. I want to now skip ahead a bit and posit the following: In a world where researchers understand how to access, download, and analyze new data sources, are they prepared to do so responsibly and ethically?
While data breaches at your favorite store or online dating website are now almost commonplace, researchers have worried about data privacy for some time. Anyone using administrative data from the Internal Revenue Service or Social Security Administration, for example, must go through extensive training to access the data and often must use it in secure buildings on local networks. The Census Bureau takes data privacy very seriously because without it, people might be less likely to answer surveys (other issues of nonresponse and imputation issues notwithstanding).
But say a researcher matches credit card transaction data to administrative earnings records or data on participation in some government program. These data might be initially considered "big data," but if the researcher merges the different data together and then focuses on a specific, small group, the researcher or others may easily identify individuals. Anonymity is fundamental to ethical research, but are researchers prepared to recognize these types of issues with such types of data? Or will the excitement of using new, real-time data overwhelm the requirements to be responsible?
Furthermore, will researchers be prepared to use analytic techniques that are accurate and replicable? How many graduate programs in economics are offering lessons in machine learning (which can move the typical, hypothesis-testing driven research approach to a data-driven approach)? I have seen a number of papers using, for example, Twitter data, where the researchers are themselves hand-coding and categorizing responses. Because such approaches are subject to error and might restrict researchers to small(er) sample sizes, this method is neither responsible nor replicable.
As “big data” and nontraditional data become a larger part of the tools available to researchers, we must be prepared to answer difficult questions about data security and privacy, computing power, and analytic methodologies.