Although publicly releasing disaggregated data can help researchers and policymakers enhance equity in policy design, these releases can also jeopardize people’s data privacy. A growing body of work at Urban is researching and applying statistical methods for protecting privacy in publicly released data, including “synthetic datasets,” which replace confidential records in a dataset with statistically representative pseudo-records.
Most synthetic datasets, such as Urban’s work with the Internal Revenue Service to create and release synthetic tax records, are generated at the federal level, as local governments often face additional barriers to creating synthetic data because of resource constraints. To understand the specific challenges local governments face in generating synthetic data, we partnered with the Western Pennsylvania Regional Data Center and the Department of Human Services (DHS) in Allegheny County, Pennsylvania, to synthesize their 2021 human services data.
These data cover individual-level usage of services, including child welfare services, behavioral health services, and homeless and housing supports provided by the Allegheny County DHS during 2021. Although Allegheny County does publicly release this data, they are aggregated (grouped by characteristic), and some information about municipalities with fewer than six individuals is suppressed.
Our partnership made it possible for the county to release synthetic 2021 human services data in a disaggregated form, which can allow for greater public insight, more granular analysis of individual-level interactions with multiple services, and improved representation for racial and ethnic groups. Through this project, we learned several lessons that could inform other local governments who may be interested in creating synthetic data for public release.
1. Seek stakeholder input to define the synthetic data’s use case.
The tradeoff between privacy and utility (the quality of the data from the perspective of those using the data) is well known in the data privacy field. Privacy protections must be lowered to achieve higher utility, and vice versa. This tradeoff means that synthetic data can replicate key elements of the original data, but will never perfectly match it, so stakeholders must prioritize preserving the most important elements.
For Allegheny County stakeholders, the added value in releasing disaggregated human services data was the ability to track one individual over the course of the year and their potential interactions with services. Centering and refining this use case drove all aspects of our strategy, quality evaluation, and tradeoff decisions. We specifically evaluated each version of our dataset on how well it captured month-to-month service usage for individuals. When pursuing their own synthetic datasets, local governments can ensure that domain experts weigh in on the ideal use case and make the final call on the balance between privacy and utility that best suits their use case.
2. Creating a synthetic dataset involves an iterative and time-consuming process.
Making the tradeoff decisions described above is an iterative process to maximize the quality and privacy of the final product. For our synthetic dataset, we evaluated 77 versions before selecting a final output for release. We relied heavily on input from our partners as experts when determining how to measure quality and privacy in this context, a process which took place over more than eight months of biweekly meetings. This timeline could accelerate considerably for future releases of data with the same structure and use case, but would still require some level of manual review and input before approval. Since this process is so time-consuming, local governments should be aware that synthetic data cannot feasibly replace confidential data that require frequent or real-time updates.
3. Creating a synthetic dataset takes dedicated capacity and technical expertise.
Although the exact computational requirements will depend on the size and type of the data, generating synthetic dataset options can be expensive and might require using paid cloud services such as Amazon Web Services (AWS) to speed up the process. Allegheny County stakeholders approached this process equipped with expertise in their dataset and use cases, but synthesizing the data also involved many technical decisions that required advanced understanding of data privacy. Urban staff provided trainings on data privacy and synthetic data to staff members of Allegheny County’s government, but other interested local governments may need to consult the existing literature and data privacy experts. Local governments should be aware of the time, resources, and knowledge needed, and prepare for funding additional staff and computational resources, or allow for a corresponding delay in the release timeline if funding is not available.
4. Clearly communicate limitations to the public alongside the dataset.
In order to avoid misuses of the synthetic data, users must grasp the tradeoff between quality and privacy and understand the dataset’s limitations. For our project, we created a user guide for the synthetic human services data discussing how and why synthetic data were released and how to responsibly use the product. We also ensured that the documentation for the public data clearly indicates that the released data are pseudo-records, not identical representations of the confidential data. In addition to transparent documentation, local governments can consider other methods to engage with the public, such as focus groups or workshops.
Though not without challenges, synthetic data offers local governments a richer and more nuanced understanding of their services, which can empower them to design more equitable policies while preserving data privacy. As more local governments begin to explore synthetic datasets, it’s worth planning for and keeping these lessons in mind.