Social Genome Project

Project

A large fraction of American children who are born into low-income families grow up to become low-income adults. The Social Genome Model is an analytic data tool designed to help identify and assess which policy interventions might most effectively and efficiently break that pattern of intergenerational disadvantage.

The Social Genome Model is a model of social mobility from birth through adulthood. With the model, researchers can explore how targeted social policies or changes might influence social mobility across various outcomes and across life-course stages from childhood through adolescence and adulthood. For example, if we could reduce the number of children born below a healthy weight, what improvements would we see in school performance, high school graduation, and adult incomes? Similarly, if we could improve reading scores in elementary school, how much improvement might we see in high school grade point averages and graduation rates?

The Social Genome Model can be a relatively quick and inexpensive laboratory in which to investigate ideas about the effects of earlier life interventions on later life outcomes, and a prelude and guide to the development of full field studies. The model is best understood as a helpful tool in a larger toolbox of resources for policy research. For example, given a randomized controlled treatment trial of an early childhood intervention and outcome, the Social Genome Model can simulate how that trial outcome is associated with a web of other outcomes through the life course. As with any statistical tool, and especially considering the broad scope of the questions it poses, the Social Genome Model must be employed with care and its results checked for their construct validity and methodological soundness.

Data sources

The Social Genome Model is actually two models, each based on its own survey data. The original version of the Social Genome Model uses data from the National Longitudinal Survey of Youth 1979 (NLSY79), focusing on children born to women in the cohort (the CNLSY). Data from the CNLSY provide information on American-born children from birth through adolescence, and in some cases into early adulthood. To extend the model’s ability to examine outcomes through young adulthood and middle age, we draw on data from the original NLSY79 sample. We articulate the data from the NLSY79 and CNLSY to construct a panel study of respondents from birth through age 40.

An additional version of the Social Genome Model uses longitudinal panels from the National Longitudinal Survey of Youth 1997 (NLSY97). This cohort was born in the early 1980s, represented ages 12 to 16 at the initial interview in 1997, and has been followed through members' late 20s as of the most recent panel.

Multivariate regressions across life stages

In both versions of the Social Genome Model—the version based on the NLSY79/CNLSY and that based on the NLSY97—relationships between variables across life stages were estimated using ordinary least squares regression for continuous outcomes and a linear probability model for dichotomous outcomes. The Social Genome Model for both surveys uses substantial imputation of missing values.

For the version of the Social Genome Model based on the NLSY79 and the CNLSY, the regressions build in five sets of estimations:

The first set of regressions estimates outcomes for each age-5 variable (early childhood, EC) as a function of all the variables for circumstances at birth (CAB).
The second set estimates outcomes for each age-11 variable (middle childhood, MC) as a function of all EC and CAB variables.
The third set estimates outcomes for each age-19 variable (adolescent, ADOL) as a function of all MC, EC, and CAB variables.
The fourth set estimates outcomes for each age-29 variable (transition to adulthood, TTA) as a function of all ADOL and CAB variables. Because of articulation between panel studies, it is not possible to include EC and MC variables directly in the TTA estimates, but their effects are indirectly estimated to the extent they are mediated through the array of ADOL variables.
The fifth and final set of regressions estimates outcomes for each age-40 variable (adulthood, AD) as a function of all TTA, ADOL, and CAB variables.

For the version of the Social Genome Model based on the NLSY97, the regression patterns proceeds similarly, although the ages vary slightly. Because of the limitations of the single longitudinal panel, there are no regressions for early childhood or mid-adulthood.

Given the likelihood that many processes across life stages could vary by race and gender, the model’s parameters were estimated using separate batteries of regressions for black men, black women, nonblack men, and nonblack women.

Choosing interventions, simulating results

We use the estimated parameters and the actual values for the baseline characteristics to create a synthetic baseline prediction of each outcome for every individual in our target population. The procedure for creating the baseline predictions varies by the nature of the variable (continuous or dichotomous) and life stage (ADOL or earlier, or TTA or later). To simulate the likely effects of a given policy, we alter the characteristics and circumstances directly affected by the policy and then assess how that change flows through a child’s subsequent outcomes through the life course. For each hypothetical intervention, we observe the changes from the synthetic baseline predictions as the effects of each intervention propagate across all life stages following the intervention.

We can control many dimensions of the hypothetical policy interventions, including the life stages at which the interventions occur, the variables that are changed, the target population for the intervention, the fraction of the target population affected, and the size of the effect on the affected. We can choose the extent of the hypothetical interventions based on any criteria we choose; generally, published research based on randomized controlled trials or aspirational policy targets.

Limits and potential of the model

It is impossible to construct a truly comprehensive model of the effects of social policy across the life course, so all our work with the Social Genome Model involves careful analysis and discussion not only of its results, but of the specific and general limitations of each study. For example, our use of regression models involves well-known problems with omitted variables and selection effects on the variables that we can measure. Further, while our data sources are rich and of high quality overall, establishing and maintaining the representativeness of data across a longitudinal panel study is always a concern, and different variables within the samples will involve their own unique data and measurement issues. Finally, the Social Genome Model involves a necessary trade-off of the ideal of observing as many successive life stages as possible for each respondent against the ideal of measuring data in the most current period available; as a result, it is not able to fully satisfy either ideal.

Nevertheless, the Social Genome Model serves an indispensable function when it is limits are properly understood. In most cases, randomized controlled trials have neither the time nor the funding to measure the longer-term effects of a policy intervention. In such cases, the Social Genome Model can extend the information provided by randomized controlled trials, providing imperfect yet empirically based, conceptually informed, and multidimensional estimates to help guide policymakers and interested citizens facing important decisions.

Applications and ongoing development

Urban’s work with this new methodological tool is just beginning. This site will be updated regularly as new Social Genome studies and reports come on line.

Researchers at Urban have several studies in progress based on the Social Genome Model:

an examination and comparison of interventions from infancy to young adulthood, for improving socioeconomic outcomes of black men
comparisons of intergenerational income mobility for black men, nonblack men, black women, and nonblack women
examinations of aspirational interventions into an array of circumstances at birth, for outcomes at all subsequent childhood and adult life stages
a study of how responses to early life interventions vary among black men, nonblack men, black women, and nonblack women.

Urban is collaborating in the ongoing development of the Social Genome Model with the Brookings Institution (the previous home of the Social Genome Project) and Child Trends. Brookings research using the Social Genome Model has focused on the effects of academic or social skills on later success—for example, the effects of educational attainment and achievement on earnings and, by extension, on income. Child Trends is using the Social Genome Model to analyze adolescence and the transition to adulthood.

Urban is also continuing to improve and expand the Social Genome Model. The improvements involve benchmarking and validating different pieces of the regression modeling, imputation, and simulation programs. The expansions include incorporating new variables at currently studied life stages and planning for articulating the existing Social Genome Model with other simulation models at Urban, such as DYNASIM.

Research and Evidence

Technology and Data

Expertise

Research Methods and Data Analysis

Microsimulation Modeling