Often, we want to explain the gap between two groups, in wages or poverty levels or some other outcome of interest. Observed gaps in average wages between groups, for example between men and women or across racial and ethnic groups, can be ascribed to differences in characteristics across groups, for example in levels of education, and differences in returns to characteristics or "coefficients" (the coefficient on education represents the monetary "return" to education).
The iconic method of assigning portions of an observed gap to characteristics on one hand and coefficients on the other comes from Blinder (1973) and Oaxaca (1973). If we assume each of two group's outcomes are described by a linear regression of outcomes, y (log wages), on characteristics, X, we can write down two linear regression models
y1 = X1b1 + u1 or E[y1|X1] = X1b1, which means E[y1] = E[X1]b1
y2 = X2b2 + u2 or E[y2|X2] = X2b2, which means E[y2] = E[X2]b2
and we can ask whether the mean wages in group 1 would be the same if it had the average characteristics of group 2 by calculating their expected outcomes with those average characteristics but their own returns for a "counterfactual" average wage E[X2]b1. Then, the portion of the gap in average wages due to observable characteristics can be written as the sum of two terms, where the first term measures the difference in mean wages due to the difference in mean characteristics and the second measures the difference in mean wages due to differences in coefficients. A modification of this method can be used for logit and probit models.
A related method from Juhn, Murphy, and Pierce (1993) starts with the same two linear regression models where y1 and y2 are the outcome variable in two samples, such as the log of hourly wages, X1 and X2 are the observable explanatory variables (characteristics), b1 and b2 are the vectors of estimated coefficients or "returns" to characteristics (observable "prices"), and u1 and u2 are residuals (unmeasured prices and quantities). Group 1 might be men and group 2 women, or group 1 might be white workers and group 2 black workers.
If F1(.) and F2(.) denote the cumulative distribution functions of the residuals, then pi1 = F1(ui1|Xi1) is the rank of an individual residual in the residual distribution of model 1, and we can write ui1 = F1-1(pi1|Xi1) where F1-1(.) is the inverse of the cumulative distribution function. If F(.) is a reference residual distribution (for example, the average residual distribution over both samples) and b is an estimate of benchmark coefficients (for example, the coefficients from a pooled model over the whole sample), we can then determine hypothetical outcomes with varying quantities between the groups but fixed prices (coefficients) and a fixed residual distribution, the hypothetical outcomes with varying quantities and varying prices but a fixed residual distribution, and the outcomes with varying quantities, varying prices and a varying residual distribution (these are simply the originally observed outcome values).
Differences in a summary statistic across groups can then be attributed to differences in observable quantities, differences in observable prices (or returns to changes in quantities), and differences in unobservable quantities and prices.
Blinder, Alan S. 1973. “Wage Discrimination: Reduced Form and Structural Estimates.” Journal of Human Resources 8: 436–55.
Juhn, Chinhui, Kevin M. Murphy, and Brooks Pierce. 1993. “Wage Inequality and the Rise in Returns to Skill.” Journal of Political Economy 101 (3): 410–42.
Oaxaca, Ronald. 1973. “Male–Female Wage Differentials in Urban Labor Markets.” International Economic Review 14: 693–709.