Many data sources are becoming less reliable. Can we still use them responsibly?
When people increasingly trust data to inform their research, business decisions, and even exercise, the quality of that data becomes even more important. How can businesses decide which markets to enter if the data are suspect? How are we expected to make our own exercise and eating decisions if our Fitbits don’t accurately measure steps, heart rates, and calories burned? How are policymakers expected to make decisions based on analyses that use unreliable data?
Using data responsibly and ethically is something all data creators and users need to take seriously. Researchers especially need to acknowledge that many of our bread-and-butter data sources may be becoming increasingly unreliable, perhaps households are asked to take more surveys than in the past, or because people have more privacy concerns, or perhaps because there’s some belief that the collected data don’t provide meaningful information.
There are a variety of reasons why data may be unreliable:
- Unit nonresponse: where respondents fail (or refuse) to answer a survey at all;
- Item nonresponse: respondents are less likely to answer specific questions;
- Misreporting: the respondent provides incorrect information;
- Proxy reporting: where one person responds for another; and
- Sample selection: where the survey sample does not represent the population of interest.
It’s possible to address some of these errors, by using statistical techniques to impute missing responses, or matching to administrative data to correct for certain kinds of bias, but they can’t replace actual responses. But at what point does it become irresponsible—perhaps even unethical—for researchers to use data sources that no longer paint an accurate picture?
An article in the Fall 2015 issue of the Journal of Economic Perspectives (JEP) points to the alarming decline in the quality of datasets familiar to many social science researchers, and clearly shows how unit nonresponse is an obvious problem.
Take the National Health Interview Survey (NHIS), the principal source of information on the health of the US population. The NHIS covers broad topics such as health insurance coverage, health status, and health care access. Between 1997 and 2013, the nonresponse rate—the rate at which the interview units (individuals, households, whatever) were not interviewed—rose from 8 percent to 24 percent. The nonresponse rate accounts for people who can’t be contacted, who refuse to answer (because they don’t want to be bothered, are too busy, are not interested, or have other concerns), or who are not able to answer the questions because of illness, language, or other problems.
Other major surveys such as the Current Population Survey (used to gauge poverty), the General Social Survey (used to assess Americans’ attitudes on a range of issues), and the Consumer Expenditure Survey (a crucial input to calculating the consumer price index), have all seen dramatic increases in nonresponse rates as well.
The deterioration in the quality of these major surveys should be a big concern for researchers, and some are taking note. In their instructions for authors, The Journal of the American Medical Association, for example, states that “survey studies should have sufficient response rates (generally at least 60%) and appropriate characterization of nonresponders to ensure that nonresponse bias does not threaten the validity of the findings.” A 2013 National Research Council report states that, “current trends in nonresponse, if not arrested, threaten to undermine the potential of household surveys to elicit information that assists in understanding social and economic issues.”
At some point, even with sophisticated data cleaning and statistical wrangling methods, data fail to tell reliable stories. For example, that same JEP article reports that spending on the Temporary Aid to Needy Families program is underreported by more than a third in the Survey of Income and Program Participation, and up to nearly three-quarters in the Consumer Expenditure Survey. Similarly, least 30 percent of spending on unemployment insurance is not captured in each of four surveys they examined. At some point, researchers start reporting findings—and policies get designed and implemented—based on flawed data.
The research community needs to recognize that in a world where data is becoming more and more important, the very data used to make assessments of policy and to recommend future policy could be flawed. As a profession, it is time for researchers to have a more open discussion about how the decline in these traditional data sources will affect the work we do and the analyses we show.