As public datasets are being altered, discontinued, or left without contingency, blending data from multiple sources has become an increasingly valuable tool to inform policymaking.
Want to prevent traffic accidents? Policy solutions involve combining accident data from the US Department of Transportation with demographic data from the US Census Bureau. Want to understand which university resources, study habits, or campus activities are linked to higher graduation rates and other outcomes? Merging academic records with location-tracking data could help.
But blending data is not simple, and it carries significant privacy risks. A National Academies of Sciences, Engineering, and Medicine report offers a framework for how policymakers, community organizations, and any data user can balance usefulness with privacy protections.
Privacy concerns with blended data
Although a single dataset may seem to provide sufficient protection for the people it includes, combining the dataset with another can reveal sensitive information.
For instance, in the university example above, blending data could help universities better understand their students and inform their investments on certain resources, but it could also inadvertently disclose some students’ identities, even more so if used for surveillance.
Consider another example from the report: the College Scorecard is a web-based tool from the US Department of Education (ED) that helps users compare colleges by costs, admissions, and outcomes without requiring access to individual-level data. To increase the tool’s usefulness, ED sought to integrate earnings data from tax records. Though blending the data helps families understand a student’s potential income and other outcomes, the data pose significant disclosure risks.
These examples highlight the need to balance privacy or disclosure risk against utility or usefulness, which involves carefully managing disclosure risks and prioritizing certain analyses over others. The more information gained or combined, the greater the privacy risks, stressing the importance of thoughtful dissemination strategies with stakeholders. Conversely, less information or not combining data may deprive society of an invaluable public good. That’s why analysts and policymakers should implement technical and policy solutions together.
How analysts can balance usefulness with privacy
Efforts to thoughtfully modernize the national data infrastructure have received bipartisan support since the Foundations for Evidence-Based Policymaking Act of 2018 was enacted under the first Trump administration. These efforts continue with the current administration’s executive order to eliminate data silos, which will likely require safe, secure data blending among state and federal agencies.
The report offers a six-step framework to guide users through the full lifecycle of blended data: from deciding whether to blend data to maintaining or sunsetting the resulting product. Each step includes guiding questions and is illustrated through real-world examples, such as the College Scorecard:
- Determine auspice and purpose of the blended data project. ED wanted to incorporate earnings data from tax records to improve the tool, requiring a partnership with the Internal Revenue Service’s Statistics of Income (SOI) Division.
- Determine ingredient data files. To create the desired final data product, ED and SOI determined they needed student aid records from ED and individual earnings from SOI.
- Obtain access to ingredient data files. SOI cannot share individual earnings data directly with ED but can release aggregated statistics under Internal Revenue Code Section 6108(b).
- Blend ingredient data files. ED sent the student-level aid data to SOI, which they then matched to IRS tax records using Social Security numbers, resulting in a blended dataset.
- Select approaches that meet the end objective of data blending. Initially, ED and SOI used traditional privacy-enhancing techniques—such as suppression and aggregation of values—to protect sensitive education and tax data. In 2020, aiming to release more-detailed data by credential and field of study, they adopted differential privacy (or formal privacy) that adds statistical noise to protect sensitive data, such as random values based on a bell curve or normal distribution. To balance and manage the tension between privacy and utility, they used CSExplorer, a tool that allowed them to simulate and evaluate different configurations of the formal privacy approach.
- Develop and execute a maintenance plan. Unfortunately, the College Scorecard does not disclose its privacy parameters or mitigation strategies publicly.
How to navigate the trade-offs of data blending amid a changing policy landscape
Even the most advanced technical solutions can fall short if they don’t account for varying laws and regulations, differing expectations of privacy across communities, and evolving public trust. That’s why policy solutions must
- be dynamic and responsive,
- offer multiple modes of data access to accommodate different risk tolerances, and
- use a shared taxonomy to enable effective communication between technical and nontechnical stakeholders.
By adopting a formal privacy framework, ED and SOI demonstrated responsiveness to evolving policy demands, such as the need to include more-granular information that required stronger protections.
However, they also made an important decision in how they implemented the formal privacy framework. The privacy framework adds random noise or changes to the outputs, which could make smaller student counts—such as the number of students in a certain field of study at a college—less reliable. To avoid misleading students and their families, ED and SOI decided to remove or suppress outputs that were potentially less accurate. This decision highlights how policy decisions shape even technically driven solutions.
Though there is bipartisan commitment to modernizing data infrastructure, the current environment also presents risks. Without complete and consistent federal data, blending public and private datasets offers the opportunity to unlock powerful insights for evidence-based decisionmaking—but only if done responsibly.
Let’s build a future where everyone, everywhere has the opportunity and power to thrive
Urban is more determined than ever to partner with changemakers to unlock opportunities that give people across the country a fair shot at reaching their fullest potential. Invest in Urban to power this type of work.