Technical Report Generating a Fully Synthetic Human Services Dataset
Subtitle
A Technical Report on Synthesis and Evaluation Methodologies
Madeline Pickens, Jennifer Andre, Gabriel Morrison
Display Date
File
File
Download Report
(861.23 KB)

Administrative data, or data collected about the operations of an organization, can help stakeholders understand the complexities of organizational performance. The Department of Human Services (DHS) in Allegheny County, Pennsylvania, serves one in five residents of the county every year through child welfare services, behavioral health services, aging services, developmental support services, homeless and housing supports, and family strengthening and youth supports. In the process, DHS collects administrative data about service usage for the purpose of care coordination, case management, and quality improvement efforts.

Because of the sensitive nature of this data, DHS has not widely shared the data at an individual level. To allow researchers, service providers, and members of the public to better understand the populations served by DHS, we partnered with the Allegheny County DHS and Western Pennsylvania Regional Data Center (WPRDC) to create a fully synthetic version of the 2021 Integrated Services dataset. The final synthetic dataset entirely replaced the underlying records that track usage of these services with statistically representative pseudo-records.

This technical brief will detail the methodology used and technical decisions we made when creating and evaluating the synthetic service data, including:

  • background on data privacy importance and key definitions,
  • limitations of current Allegheny County Human Services data products and use cases for synthetic data,
  • key features of the service data and the resulting impact on our methodology,
  • preprocessing the service data to serve as a valid input for the synthetic data generation process,
  • modeling and synthesizing service usage and service distributions over time,
  • postprocessing the synthetic output to match the desired use case, and
  • evaluating candidate synthetic datasets for quality and privacy.

This project partnership functioned as a pilot for synthetic data generation at the local level. We highlight takeaways for state and local governments interested in generating synthetic data to enhance privacy, including the importance of centering the use case when designing the synthetic dataset, the computational and staffing resources that could be required, and the implications of the bespoke nature of the synthetic data generation process.

Policy Centers Office of Race and Equity Research
Research Methods Research methods and data analytics Data Governance and Privacy
Related content