Using open Social Security data carefully
Open data can be a good thing. Making data more readily available and easier to use can lead to more insights, better use, and hopefully, better policy. But making more data available doesn’t always mean the best data are available, or that those data will be used correctly. Researchers and experts, then, are an important part of the data ecosystem, because their expertise can help identify—and potentially remedy—data caveats and errors.
Take, for example, Urban’s new Social Security Data Tool, which allows users to explore and visualize Social Security data in ways not available in the original format produced by the Social Security Administration (SSA). There are caveats to using those data that the casual user—or oftentimes, even the more experienced user—may not be aware of, and which may not be explicitly noted in the data or source notes.
The simplest example might be geographic data. You can easily create a map of the number of people receiving Disability Insurance benefits in each state in the Data Tool. You’ll find more people receiving benefits in Texas and California (the darker blues in the map below); of course, those states have larger populations, so it’s not entirely surprising to see such patterns. But the raw data file does not include state populations that would enable you to calculate the per capita prevalence rates. If you wanted to create your own population-adjusted visualization, you could use our tool to download the source data, then join it with Census data on state populations.
Consider a more nuanced example: the percentage of people who are awarded Social Security retirement benefits each year. The data from the Social Security Administration allow you to explore these patterns by age group, and you can see that the percentage of all men claiming benefits in a given year who claimed benefits at age 62—the earliest age at which you can claim retirement benefits—was fairly constant between about 1985 and 2005 before declining slightly over the past decade.
Based on these data, you might be tempted to conclude that since 2005 there have been behavioral changes in the age at which benefits are claimed. The problem with this interpretation is that it mixes up what is happening among the newest retirement age cohort with the behavior of the older cohorts who, by claiming more benefits at age 62, have fewer people eligible to claim at later ages. Moreover, the number of people eligible to claim benefits at age 62 has increased since about 1997: the number of men turning 62 rose from about 830,000 in 1997 to 1.4 million in 2013. That pattern is driven by a number of things, especially the birth rate some 60-plus years ago (think: Great Depression and World War II, and Baby Bust followed by Baby Boom), that end up changing the relative number of people in each age group over time. Thus, it’s not that the data themselves are incorrect or misleading, but drawing conclusions based on these data require more data, context, and analysis.
A better way of tracking age-related changes in Social Security claim rates is to conduct calculations on a cohort basis: determine the percentage of all people born in a given year (not the percentage of all claimants) who claim their benefits at each age. Unfortunately, those data are not publicly available, but fortunately Alicia Munnell and Anqi Chen at the Center for Retirement Research at Boston College did receive the data from SSA. They calculated the claiming numbers using both approaches and found non-trivial differences: Using the publicly available ‘claim year’ data, the percentage of all men of any retirement age who claimed retirement benefits at age 62 fell from about 57 percent in 1996 to 42 percent in 2013; in the unpublished ‘birth year’ data, the percentage of men aged 62 who claimed at age 62 fell from 56 percent to 36 percent.
Whether you think it’s a good or bad thing that more people appear to be claiming benefits later is not the point. The point is that there are often subtleties and nuances to data that can lead to misinterpretation and even incorrect policy decisions unless they are more fully understood or explained. No data set is ever entirely complete, moreover, so users need to realize what a given data set can and can’t tell them.