Paired Testing

Project

Quantitative Data Analysis

Project Navigation

Project Home

Microsimulation

The Analysis of Transfers, Taxes, and Income Security (ATTIS) microsimulation model

The Medicare Policy Microsimulation Model (MCARE-SIM)

The Model of Income in the Near Term (MINT)

The Tax Policy Center Microsimulation Model

The Dynamic Simulation of Income Model (DYNASIM)

The Health Insurance Policy Simulation Model (HIPSM)

The Transfer Income Model (TRIM)

Descriptive Data Analysis

Inference

Impact Analysis

Bias

Experiments

Paired Testing

Quasi-experimental Methods

Difference-in-Difference and Panel Methods

Instrumental Variables

Propensity Score Matching

Regression Discontinuity

Regression Techniques

Generalized Linear Model

Linear Regression

Logit and Probit Regression

Segregation Measures

Inequality Measures

Decomposition Methods

Performance Measurement and Management

Paired Testing

Systematic discrimination against certain groups still persists, often in subtle ways. Paired testing, also known as auditing, is an effective and intuitive way to test whether and in what form discrimination exists. In a paired test, two people are assigned fictitious identities and qualifications that are comparable in all key respects. The identities differ only on the characteristic (for example, race or presence of a disability) being tested. Each tester of a pair then applies for the same opportunity (for example, a job or an apartment lease) and documents the interaction. With an appropriate sample of tests and statistical techniques, paired testing can identify treatment that differs for testers of different classes.

For example, in a paired test focused on disability-based discrimination in the rental housing market, one person (referred to as the “protected tester”) would have a disability, and the other (the “control tester”) would not have a disability. Both testers would be assigned similar education levels, incomes, and family compositions. Both testers’ assigned incomes would ensure they are qualified for the rental unit selected for testing; if a difference in characteristics exists, it makes the protected tester slightly more qualified. Testers are rigorously trained to conduct themselves similarly when interacting with the provider, whether in person or via e-mail or telephone, with as close to the same level of pursuit, questioning, and responsiveness as feasible. Testers are also trained to respond to initial and follow-up contact in the same way. Done well, the only important difference between the two testers will be the factor on which discrimination might occur. Each tester works independently; no one is made aware of his or her tester pair to avoid biasing test data.

Testers document each stage of their experience thoroughly and systematically. Analysis of the documentation data from each tester of a pair can measure differences in treatment that individual testers might not notice. Such analysis also can uncover subtle forms of discrimination. Overt discrimination can often be recognized as soon as it occurs: refusing to meet with a tester after seeing that he or she is disabled or a racial or ethnic minority, or making a patently discriminatory comment. Less obvious discrimination could take the form of a housing provider showing white testers, on average, three apartment units and minority testers, on average, two units. It could also include offering less information about rent specials to minority testers or steering them toward poorer-quality units. The individual testers would not realize they were treated differently from their matched pair, but analysis of many tests would reveal the differential treatment.

Sampling

Paired testing’s analytical rigor also relies on its careful sampling techniques. A good paired testing study will ensure that the agents (employers, housing providers) tested represent all units (e.g., jobs, available rental housing, or mortgage brokers) in the study area. The most recent studies of rental housing discrimination have allocated the tests by zip codes, according to where rental housing is located in a particular metropolitan area. Ads are then chosen from online sites, such as Craigslist, to correspond to those zip codes. Past studies focused on employment and housing discrimination relied on job and housing advertisements in an area’s major newspapers. Because many opportunities likely are conveyed through private channels, using newspapers or Craigslist might be a conservative strategy for estimating the frequency of discrimination (Fix and Struyk 1993).

The sample of tests should also be large enough that any conclusions drawn cannot be attributed to chance. For example, a difference in how testers are treated by a given employer might stem from discrimination or from some other reason having nothing to do with the testers’ characteristics; data from many tests provide more convincing evidence regarding whether and in what parts of an interaction discrimination occurs. Finally, national estimates need to be conducted in a relatively large and representative set of metropolitan areas. The locations are typically chosen using a stratified random sample, designed to ensure that the sample is representative in the size and geographic location of the areas. The 2012 housing discrimination study, for example, was conducted in a total of 28 metropolitan areas.

Determining the key metric

Though intuitive definitions of discrimination may seem obvious—for example, “unfair treatment of equally qualified people”—it’s important to have a definition that stands up to the rigor of quantification and statistical testing. For example, in any given situation or number of tests, one person may receive less favorable treatment for idiosyncratic reasons that have nothing to do with discrimination based on the characteristic of interest. It’s therefore important to have a definition—and sufficient data collection and statistical rigor—that allows researchers to distinguish between random and systematic outcomes.

Recent Urban Institute testing studies report, for each outcome, the share of the tests in which the control tester was favored over the protected tester, the share in which the protected tester was favored over the control tester, and the difference between them. Statistical hypothesis tests are used to determine whether the difference is nonzero. This difference—or net measure—provides a measure of the overall degree of disadvantage the protected tester faces in the market as a whole. That said, it is a conservative measure of systematic discrimination against the protected class, because it not only subtracts random differences from the gross measure of control-favored treatment, but may also subtract some differences that reflect systematic favoring of the protected testers.

Drawbacks

As a method, paired testing has strengths and drawbacks. With representative samples and sufficient numbers of tests, paired testing can provide rigorous evidence of systematic discrimination, even if that discrimination is subtle. For that reason, the results of high-quality paired testing studies often carry political appeal: they can provide strong evidence and tell a story with clear policy implications.

Critics of paired testing have raised several concerns. Paired testing involves deception and research engagement of human subjects without their knowledge. In addition, the tests can cost agents time as they interact with testers who will never accept a job offer, rent an apartment, or accept services. Proponents of paired testing argue that those costs are usually nominal, and that the deception is not harmful because by advertising a job, housing, or service, the agents are inviting members of the public to contact them. Agents would not expect every applicant or prospective client to follow through, so having testers pose as bona fide interested parties should impose minimal, if any, harm. Because testing for research purposes does not involve building an enforcement case, legal risk to agents can be mitigated through data protection and confidentiality processes.

Other criticism holds that testers might already suspect or want to show discrimination by an agent and thus subconsciously or intentionally document test interactions in a way that indicates discrimination. While all self-reported data are subject to this type of bias, rigorous training, strong supervision, and the proper monetary incentive for testers and supervisors are believed to reduce that risk.

One further drawback is that paired testing can only be carried out in the most accessible parts of the process being examined. For example, employment testing can examine job hiring, but not promotions. Tests of hiring are necessarily focused on entry-level jobs—where the applicants require little knowledge—or, in broader studies in which applicants submit résumés with racially identifiable names, one is limited to examining the rate at which employers call back the applicant. Rental housing testing can examine information exchanges related to available rental units, but testers may not submit an application; testing does not uncover whether there is discrimination in the application review and acceptance phase of the process, much less treatment if the unit is obtained.

Related research

The Urban Institute has used paired testing for decades to investigate and document the extent of labor and housing market discrimination. Urban Institute paired testing studies include:

Reference

Fix, Michael and Raymond J. Struyk, eds. 1993. Clear and Convincing Evidence: Measurement of Discrimination in America. Washington DC: Urban Institute Press.