Data analysts sometimes lump together populations with small sample sizes—often racial and ethnic groups—out of convenience. But doing so can have harmful effects on those communities.
Combining groups with different experiences can obscure specific communities’ circumstances and paint an inaccurate picture. This prevents people from seeing their lives and experiences reflected in the data, can lead to misleading analyses, and can prevent decisionmakers from making policies that can create better outcomes for those communities.
For example, the 2020 Census form included 15 checkboxes for aggregate racial categories with additional space to write in more detail, and nearly 34 million people—or more than 10 percent of the US population—selected two or more race categories. This is a remarkable increase over the 9 million people who selected the option in 2010 and the nearly 7 million people who selected the option in 2000 (PDF), the first year the Census Bureau allowed people to identify as multiple races.
Census results have significant policymaking implications—from congressional redistricting to influencing how billions of federal dollars are allocated. Given the number of people who select two or more races, better understanding each of these groups and how to aggregate their responses is critical for representative decisionmaking.
Though large groupings can mask important variation in data, small sample sizes can limit the ability to conduct rigorous and accurate analysis. Time, budget, or survey methodological limitations may be why more data cannot be collected about certain communities. As a result, researchers are often forced to work with the data that are available and may choose to combine groups to be able to perform their analysis in a way that is statistically meaningful and thus useful for decisionmakers and stakeholders.
In our Do No Harm Guide, we offer considerations for researchers and data analysts to ensure they approach their work through a lens of diversity, equity, and inclusion. A one-size-fits-all method for data aggregation doesn’t exist, so instead of prescribing specific recommendations, we highlight key considerations for how to thoughtfully work with and present data.
How to determine when aggregation is the best approach
Balancing representation with the ability to conduct methodologically sound research can be difficult. Here are some questions to ask yourself before aggregating groups:
- Do the data show these groups of people are experiencing similar effects?
- Do these groups have a shared history?
- Do members of the community or experts on that community confirm that analyzing and presenting these populations together is reasonable?
If the answer to any of these questions is “no,” combining groups is likely inappropriate. It could lead to an inaccurate portrayal of the realities people and communities face, hide the vulnerabilities and challenges of smaller populations, and therefore deprive them of necessary resources. For example, the Center for Health Policy Research team at the University of California, Los Angeles, told us that understanding the specific needs, languages, and cultures of people and communities in the aggregate Asian American and Pacific Islander group helped them more effectively communicate and solicit participation in a survey.
If the answer to any of these questions is “yes,” aggregating groups may be appropriate. If you do, clearly define which groups are included in the aggregated groupings and provide a justification for why they were analyzed and presented together. Try to avoid using the term “other,” and instead use a more empowering, inclusive label such as “another race(s),” “additional groups,” “all other self descriptions,” or “identity not listed.” Transparency not only helps build trust with your reader but might also help motivate the collection of better, more detailed data for future analyses.
Alternatives to aggregating
If you determine it’s inappropriate to aggregate groups, there are a couple options for balancing representation with analytical accuracy:
- In cases where you have data and are choosing to aggregate, present the data about the groups with smaller sample sizes to acknowledge the existence of these populations in the dataset. In such situations, communicate in notes or annotations that the estimates from the data may not be statistically significant and to use visual cues to reflect the uncertainty in these estimates. In cases where the aggregation has been done for you, you might include a note or explanation as to why more disaggregated data would enable you to better reflect the experiences of specific groups and communities.
- Show a blank chart or message to represent how data are missing. Doing so allows the analyst to include the group while drawing attention to the lack of data on the population and avoiding presenting analysis that may not be statistically significant. ProPublica used this approach in their July 2020 story, “What Coronavirus Job Losses Reveal About Racism in America.” When a user selects a group that has a small sample size in the dataset, the ProPublica story displays a message over the chart explaining how the data for that group are too limited to be reliable. This allowed ProPublica to be inclusive while not misinforming readers.
More data collection will be key to truly addressing underrepresentation
Data analysts and communicators can consider these options to ensure communities with small sample sizes are represented—without sacrificing accurate analysis. But ultimately, collecting more data about underrepresented communities and broadening the number of groups included in the data are what will ensure research reflects the lives of all people and that their voices and experiences are acknowledged in how data are analyzed and visualized.
The Urban Institute has the evidence to show what it will take to create a society where everyone has a fair shot at achieving their vision of success.