Why we created an open-source Social Security data tool
What does it mean for data to be open? When we recently created the Social Security data tool to explore and visualize Social Security data, one of our goals was to make the data behind it free and as open as possible. The tables displayed in the tool are pulled directly from a public document, the Social Security Administration’s (SSA) 2014 Annual Statistical Supplement, but we wanted to make them open, not just public. So what's the difference?
To be "open," data need to be easily accessible to as wide an audience as possible. Often, this means making the data available in a machine-readable format; if you put some data in a text document separated by commas, for example, (called a CSV document) you could easily analyze, visualize, and explore those data. A CSV file is as simple as it gets, and you don't need any specialized software to read it, just the ubiquitous text editor. By contrast, if you wrote down some numbers on a napkin, or put an image of a table in a PDF document, a computer would have a lot of trouble figuring out what those numbers were. Extracting data from a PDF requires special software and lots of effort, and even “XLS” files assume the user has Excel (free alternatives exist, but the less a data format requires specific software, the more open it is). Basically, to have open, machine-readable data, you want it to be simple and ubiquitous.
It's likely that this file will open in Excel, so it might not seem too different from the XLS data available from SSA, at first. However, a CSV is very simple, with a consistent format: one row of header values that are separated by commas, followed by rows of data that are similarly separated by commas. SSA’s XLS files contain formatting, hidden columns, nested headers, and many more complications. (Much of the work of translating the SSA data into the machine-readable format for the visualization tool was extracting the data from these XLS files; you you can see the almost thousand lines of code it took on Github.)
We’ve stored the tables and charts in JSON files. While a JSON doesn't open in Excel, it's often the most useful format for programmers because it allows for more than the two dimensions (rows and columns) in a CSV or table. In the JSON's we created for this project, we include column headers and data, but also a "type" field, which states whether a given column is a percent, dollar amount, date, etc.
JSONs, like CSVs, are simple and don't require any special software to read or write. They also are the preferred format when building an Application Program Interface (API), which is a programmatic way to interact with data via the web. We didn't build an API for this dataset, since it's relatively small and straightforward, but these or similar data could one day power an API. Because the data are in JSON format, it is easy to search (find all tables with a "dollar" column type, for example), and it matches API standards.
We also believe it takes more than publicizing data in a machine-readable format to make it open. That’s why we’ve also added the ability to search and filter the data. Imagine if there was an encyclopedia on your desk, but the entries were not alphabetical and the index was missing. It would be full of data, but functionally useless. Our data tool allows you to search and filter the SSA data, and to then explore it visually. Not only are the data available, but it is possible to find what you need quickly and easily. While the preexisting SSA website organizes the data, you cannot search across column headers and table names, or quickly browse trends through visualizations. The old site had a table of contents, but our tool, through the search feature, has an index.
Whether to open data and source code is not an easy decision for many organizations. Along with issues of data privacy and security, there is also the investment an organization makes in creating or cleaning the data, analyzing the data, and creating tools or visualizations to share the data. In this case, we believe that providing all of the data and source code in a machine-readable format is important to help others who might also want to explore the data or analyze the Social Security system. To help you do that, we have posted all the data and the code that powers the project on Urban’s Github page.
We've only scratched the surface of what it means for data to be open. You can read deeper discussions of open data standards from organizations like the Open Knowledge Foundation, 18F, and the Sunlight Foundation. At Urban, as we continue to release data sets, we are striving to make them not only public, but truly open.
Illustration by Tim Meko/Urban Institute