The Statistics of Income (SOI) Division of the Internal Revenue Service collects and curates an enormously valuable trove of tax data, which researchers can use to assess the effects of tax policies and pursue other research questions, such as studying income inequality. Although only a limited number of government analysts and researchers can access the raw data, the IRS releases a public use file for external researchers and data practitioners. However, this public use file has become increasingly difficult to protect using traditional methods, as the vast amount of personal information available in public and private databases combined with enormous computational power create unprecedented privacy risks.
SOI and researchers at the Urban Institute are developing a solution: synthetic data that represent the statistical properties of the administrative data without revealing any individual taxpayer information. Through this collaboration, researchers have conducted an extensive feasibility study on various formal privacy definitions and built a prototype validation server that would allow researchers to perform statistical analyses on administrative tax data, using programs they develop and test using the synthetic data, without ever revealing confidential taxpayer information. We plan to improve and extend the prototype validation server so it can process complex statistical queries and apply to more administrative tax datasets, including those with geographic identifiers and panel data. These new datasets will allow researchers to study racial disparities that occur as a result of tax policies and control for state and time fixed effects in estimation.