Inequality in Socio-Emotional Skills: A Cross-Cohort Comparison

We examine changes in inequality in socio-emotional skills very early in life in two British cohorts born 30 years apart. We construct comparable scales using two validated instruments for the measurement of child behaviour and identify two dimensions of socio-emotional skills: 'internalising' and 'eternalising'. Using recent methodological advances in factor analysis, we establish comparability in the inequality of these early skills across cohorts, but not in their average level. We document for the first time that inequality in socio-emotional skills has increased across cohorts, especially for boys and at the bottom of the distribution. We also formally decompose the sources of the increase in inequality and find that compositional changes explain half of the rise in inequality in externalising skills. On the other hand, the increase in inequality in internalising skills seems entirely driven by changes in returns to background characteristics. Lastly, we document that socio-emotional skills measured at an earlier age than in most of the existing literature are significant predictors of health and health behaviours. Our results show the importance of formally testing comparability of measurements to study skills differences across groups, and in general point to the role of inequalities in the early years for the accumulation of health and human capital across the life course.

Any opinions expressed in this paper are those of the author(s) and not those of IZA. Research published in this series may include views on policy, but IZA takes no institutional policy positions. The IZA research network is committed to the IZA Guiding Principles of Research Integrity. The IZA Institute of Labor Economics is an independent economic research institute that conducts research in labor economics and offers evidence-based policy advice on labor market issues. Supported by the Deutsche Post Foundation, IZA runs the world's largest network of economists, whose research aims to provide answers to the global labor market challenges of our time. Our key objective is to build bridges between academic research, policymakers and society. IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character. A revised version may be available directly from the author.

Inequality in Socio-Emotional Skills: A Cross-Cohort Comparison 1 Introduction
Human capital is a key determinant of economic growth and performance and of the resources an individual creates and controls over the life cycle (?). Human capital is also important for various determinants of individual well-being, ranging from life satisfaction to health (?). In recent years, the process of human capital accumulation has received considerable attention (?). There is growing consensus on the fact that human capital is a multidimensional object, with different domains playing different roles in labour market as well as in the determination of other outcomes, including the process of human development. It is also recognised that human capital is the output of a very persistent process, where early years inputs play an important and longlasting role (?).
And yet, there are still large gaps in our knowledge of the process of human capital development. These gaps are partly driven by the scarcity of high quality longitudinal data measuring the evolution over the life cycle of different dimension of human capital. Moreover, there is a lack of consensus on the best measures and on the tools to collect high quality data. As a consequence, even when data are available in different contexts, their comparability is problematic (?).
In this paper, we focus on an important dimension of human capital, which has been receiving increasing attention in the last few years: socio-emotional skills. It has been shown that gaps in socio-emotional skills emerge at very young ages, and that in the absence of interventions are very persistent across the life cycle (?). However, there is surprisingly little evidence on how inequality in this important dimension of human capital has changed across cohorts. In this paper, we start addressing this gap and focus on the measurement of these skills in two British cohorts: the one of children born in 1970 and the one of children born in 2000. We consider the measurement of socio-emotional skills during early childhood, as these skills have been shown, in a variety of contexts (?) to have important long-run effects. Our goal is to characterise the distributions of socio-emotional skills in these cohorts and compare them. In the last part of the paper, we also consider the predictive power of different socio-emotional skills for health and socioeconomic outcomes.
We proceed in four steps. First, we construct a novel scale of childhood behavioural traits from two validated instruments and assess its comparability across cohorts. By performing exploratory and multiplegroup factor analyses, we determine that two dimensions are a parsimonious representation of socio-emotional skills for both cohorts. Coherently with previous literature, we label them as 'internalising' and 'externalising' skills, the former relating to the ability of children to focus their drive and determination, and the latter relating to their ability to engage in interpersonal activities. Importantly, for the first time in economics, we study the comparability of the measures in the two cohorts. In particular, we test for measurement invariance of the items we use to estimate the latent factors. Intuitively, if one assumes that a set of measures is related to a latent unobserved factor of interest, one can think of this relationship as being driven by the saliency of each measure and the level. If one uses a given measure as the relevant metric for the relevant factor, its saliency will determine the scale of the factor, while some other parameters, which could be driven by the difficulty of a given test or the social norms and attitudes towards a certain type of behaviour, determine the average level of the factor. Comparability of estimated factors across different groups (such as different cohorts) assumes that both the parameters that determine the saliency of a given set of measures and the level of the factors do not vary across groups. We find that, for the measures we use and for both factors, we cannot reject measurement invariance for the saliency parameters. However, we strongly reject measurement invariance for the level parameters. These results imply that while we can compare the inequality in skills across the two cohorts, we cannot determine whether the average levels of the two factors are larger or smaller in one of the two cohorts. While this result hinders a comparison in the level of skills, it is of interest per se to find that mothers of children born in England thirty years apart assess behaviours differently, so that differences in the raw scales cannot be unequivocally interpreted as differences in the underlying skills. We believe this is an important finding which deserves a greater degree of attention in the economic literature.
Second, given the results we obtain on measurement invariance, we proceed to compare the inequality in the two types of socio-emotional skills across the two cohorts, for both boys and girls. We find that the most recent cohort is more unequal in both dimensions of socio-emotional skills than the 1970 cohort. This result is particularly apparent for boys, and when looking at differences by maternal background. Third, we formally decompose the increase in inequality in skills into changes in the composition of maternal characteristics and changes in the returns to those characteristics, using recently developed methods based on Recentered Influence Function (RIF) regressions. In doing so, we provide the first application of this method to the child development literature.
Fourth, we study whether the socio-emotional skills we observe at a young age are an important determinant of a variety of adolescent (and adult, for the older BCS cohort) outcomes. We find that socio-emotional skills at age five are more predictive than cognitive skills for unhealthy behaviours like smoking and measures of health capital such as body mass index. The effect of cognition, instead, dominates for educational and labour market outcomes.
Our key contribution in this paper is to bring together two important strands of the literature: on the one hand, the literature on child development and early interventions; on the other hand, the literature on the measurement and the evolution of different types of inequality. While the former literature has provided robust evidence on the long-term impacts of a variety of early life circumstances, it has not systematically focused on describing and disentangling the sources of inequality in early human development; at the same time, the latter literature has carefully studied measures such as income, wages and wealth, overlooking other important -yet harder to measure -dimensions. In bridging these two literatures, we also apply recent methodological advances in factor analysis and show the importance of testing and constructing comparable aggregates. The methodology that we apply in this paper is likely to be relevant in many other settings, for example when measuring trends in inequality in other dimensions (such as satisfaction, mental health or well-being) whose measurement might have changed over time. Lastly, it is worth emphasizing that, while learning about the evolution and the determinants of inequality in socio-emotional skills is an interesting exercise in its own right, the ultimate goal of such research would be to uncover how much inequalities in early human development contribute to income or wealth inequality later in life. The present paper constitutes a first step towards such an endeavour.
The rest of the paper is organised as follows. We start in section 2 by reviewing the main literature on determinants and consequences of socio-emotional traits. In section 3, we briefly introduce the data we use in the analysis. In section 4, we present the methods we use to identify the number of dimensions in socioemotional skills and how we estimate the latent factors that represent them. In section 5, we discuss the comparability of factors estimated with a given set of measures from different groups and the measurement invariance tests we use. Section 6 reports our empirical results on changes in inequality in socio-emotional skills and their predictive power for later outcomes. Section 7 concludes the paper.

Literature
The importance of cognition in predicting life course success is well established in the economics literature.
However, in recent years the role played by 'non-cognitive' traits has been increasingly investigated. These traits include constructs as different as psychological and preference parameters such as social and emotional skills, locus of control and self-esteem, personality traits (e.g. conscientiousness), and risk aversion and time preferences. Given the vastness of this literature, we briefly review below the main papers on the determinants and consequences of socio-emotional traits which are more directly related to our work, and we refer to other sources for more exhaustive reviews (????).
Consequences of socio-emotional traits One of the first papers to pioneer the importance of 'non-cognitive' variables for wages is ?. ? suggest that non-cognitive skills are at least as important as cognitive abilities in determining a variety of adults outcomes. ?, using data based on personal interviews conducted by a psychologist during the Swedish military enlistment exam, show that both cognitive and noncognitive abilities are important in the labour market, but for different outcomes: low noncognitive abilities are more correlated with unemployment or low earnings, while cognitive ability is a stronger predictor of wages for skilled workers. ?, using data on young men from the US National Education Longitudinal Survey, shows that eight-grade misbehaviour is important for earnings over and above eight-grade test scores. ? find that childhood emotional health (operationalised using the same mother-reported Rutter scale we use in the 1970 British cohort study) at ages 5, 10 and 16 is the most important predictor of adult life satisfaction and life course success.
There are only few studies in economics specifically studying "non-cognitive" traits and health behaviours. ? and ? are the first to consider three early endowments, including child socio-emotional traits and health in addition to cognition, using rich data from the 1970 British cohort study. They find strong evidence that non-cognitive traits promote health and healthy behaviours, and than not accounting for them overestimates the effects of cognition; additionally, they document that child cognitive traits are more important predictors of employment and wages than socio-emotional traits or early health. ? uses the US Panel Study of Income Dynamics (PSID) and finds that future orientation and self-efficacy (related to emotional stability) are associated with less alcohol consumption and more exercise. ? use the Australian HILDA data and find that an internal locus of control (also related to emotional stability, perceived control over one's life) is related to better health behaviours (diet, exercise, alcohol consumption and smoking). ? use the Longitudinal Study of Young People in England and find that individuals with external locus of control, low self-esteem, and low levels of work ethics, are more likely to engage in risky health behaviours.
? construct measures of personality from maternal ratings at 10 and 16 in the British Cohort Study and find that their measure of conscientiousness is positively associated with education and economic outcomes, and negatively associated with body mass index and smoking. ? review the interdisciplinary literature and provide a new analysis of the British Cohort Study, including a particular focus on the role of social and emotional skills (defined using a rich set of measurements of the age 10 sweep) in transmitting 'top 'job' status between parents and their children. ? show that the association between personality traits and health behaviours also holds in a high-IQ sample (the Terman Sample). ? use, instead, early risky and reckless behaviours to measure socio-emotional endowments, and confirm their predictive power for education, log wages, smoking and health limits work.
Few papers attempt to make cross-cohorts comparisons about the importance of socio-emotional skills. ? -one of the closest study to ours -examine cognitive skills, non-cognitive traits, educational attainment and labour market attachment as mediators of the decline in inter-generational income mobility in UK between the 1958 and the 1970 cohorts. The authors take great care in selecting non-cognitive items to be as comparable as possible across cohorts, from the Rutter scale at age 10 for the 1970 cohort and from the Bristol Social Adjustment Guide for the 1958 cohort; however, they do not carry out formal tests of measurement invariance and they do not construct factor scores fully comparable across cohorts as we do. Another paper related to ours is the one by ?, who study recent trends in income, racial, and ethnic school gaps in several dimensions of school readiness, including academic achievement, self-control, and externalizing behavior, at kindergarten entry, using comparable data from the Early Childhood Longitudinal Studies (ECLS-K and ECLS-B) for cohorts born from the early 1990s to the 2000-2010 period in the US. They find that readiness gaps narrowed modestly from 1998 to 2010, particularly between high-and low-income students and between White and Hispanic students. ? study the sources of differences in social mobility between US and Denmark; for the US, they use the antisocial, headstrong and hyperactivtity subscales from the Behavior Problem Index (BPI) in the Children of the NLSY79 (CNLSY), while for Denmark they use orderliness/organization/neatness grades from the Danish written exams. 1 They find that, in both countries, cognitive and non-cognitive skills acquired by age 15 are more important for predicting educational attainment than parental income. Lastly, ? uses two sets of skill measures and comparable covariates across survey waves for the NLSY79 and the NLSY97, 2 and finds that the labour market return to social skills was much greater in the 2000s than in the mid-1980s and 1990s. ? examine differences in socio-emotional and cognitive development among 11-year old children in the UK Millennium Cohort Study and the US Early Childhood Longitudinal Study-Kindergarten Cohort, and find that family resources explain some cross-national differences, however there appears to be a broader range of family background variables in the UK that influence child development. Importantly, none of these papers making comparisons across countries, cohorts or ethnic groups test for measurement invariance like we do.
Determinants of socio-emotional traits Equally flourishing has been the literature on the determinants of child socio-emotional skills, which ranges from reduced-form, correlational or causal estimates, to more 1 As the authors note (footnote 41) "Our measures of non-cognitive skills in the two countries are clearly not equivalent. The Danish measure of non-cognitive skills is more related to an orderliness/effort measure while the US measure is related to behavioral problems". 2 He uses the following four variables as measures of social skills in the NLSY79: self-reported sociability in 1981 and at age 6 (retrospective), the number of clubs in which the respondent participated in high school and participation in high school sports; and the following two variables in the NLSY97: two questions that capture the extroversion factor from the Big 5 Personality Inventory (since measures comparable to the NLSY79 are not available in the NLSY97). structural approaches. One of the first papers by (?) shows that a variety of family and school characteristics predict classroom behaviour. ? study the intergenerational impacts of maternal education, using data from the NLSY79 and an instrumental variable strategy; they find strong effects in terms of reduction in children's behavioural problems. ? and ? both estimate production functions for child cognitive and socio-emotional development (in US and Colombia, respectively), and find an important role played by parental investments. ? estimate production functions for child socio-emotional skills (internalising and externalising behaviour) at age 11 in the UK Millennium Cohort Study, and find that the effects of parental inputs which improve the home environment varies as a function both of the level of the inputs themselves and of the development of the child.
Interventions targeting Social and Emotional Learning (SEL) in a school setting have been shown to lead to significant improvements in socio-emotional skills, attitudes, behaviours, and academic performance (?), and a substantial positive return on investments (?); after-school programs have been proved to be equally effective (?).
Additionally, it has been shown that a key mechanism through which early childhood interventions improve adult socioeconomic and health outcomes is by boosting socio-emotional skills, measured as four teacher-reported behavioural outcomes in the project STAR 3 (?), reductions in externalising behaviour (from the Pupil Behavior Inventory) at ages 7-9 in the Perry Preschool Project (??), or improvements in task orientation at ages 1-2 in the Abecedarian Project (?).
In sum, even if the literature on the determinants and consequences of socio-emotional skills has been booming, most papers use skills measured in late childhood or in adolescence; and no paper in economics formally tests for invariance of measurements across different groups and constructs fully comparable scores. In this paper, we use measures of child socio-emotional development at age 5, hence before the start of elementary school; and we construct comparable scales across the two cohorts we study (the 1970 and the 2000 British cohorts), so that we can investigate changes in inequality in early development, their determinants, and consequences, in a parallel fashion.

Data
We use information from two nationally representative longitudinal studies in the UK, which follow the lives of children born approximately 30 years apart: the British Cohort Study (BCS) and the Millennium Cohort Study (MCS). The BCS includes all individuals born in Great Britain in a single week in 1970. The cohort members' families -and subsequently the members themselves -were surveyed on multiple occasions. For this paper we augment the information collected at the five-year survey with data from birth, adolescence (16), and adulthood (30, 38, 42). The MCS follows individuals born in the UK between September 2000 and January 2002. We use the first survey -carried out at 9 months of age -and the sweeps at around 5 and 14 years of age. 4 Our main focus is on socio-emotional skills of children around age five. We take advantage of the 3 Student's effort, initiative, non-participatory behavior, and how the student is seen to 'value' the class. 4 All data is publicly available at the UK Data Service (??????????). longitudinal nature of the cohorts by merging information from surveys before and after age five. From the birth survey, we include information on gestational age and weight at birth, previous stillbirths, parity, maternal smoking in pregnancy, maternal age, height, and marital status. From the five year survey, we extract maternal education, employment status, and the father's occupation. All the above variables are transformed or recoded to maximise comparability between the two studies. Furthermore, we add some adolescent outcomes such as smoking and BMI, with the caveat that these are surveyed at different ages -16 in BCS and 14 in MCS. Finally, for the 1970 cohort we also include measures of adult educational attainment, BMI, and income. Variable definitions are available in Table A1.
Ideally, we would compare socio-emotional skills alongside cognitive skills. However, the cognitive tests administered to each cohort have no overlap, even at the item level. We thus use the available cognitive tests in each cohort to estimate simple confirmatory factor models with a single latent dimension, separately by cohort (see Table A1 for the tests used). Unlike the other indicators in our analysis, cognitive skills are thus not comparable across cohorts.
Another complication arises from the fact that, differently from the British Cohort Study, the Millennium Cohort Study has a stratified design. It oversamples children living in administrative areas characterised by higher socioeconomic deprivation and larger ethnic minority population (?). We rebalance the MCS sample to make it nationally representative by excluding from the analysis a fraction of observations from the oversampled areas, proportionally to their sampling probability. 5 Finally, we also restrict our sample to individuals born in England and to cases where there is complete information on socio-emotional skills at five years of age. The final sample contains 9,545 individuals from the British Cohort Study, and 5,572 from the Millennium Cohort Study. Summary statistics for the full and estimation samples are displayed in Table 1. After the rebalancing step, the MCS estimation sample closely mirrors the full sample in terms of average observable characteristics, thus preserving representativeness.

Dimensions of socio-emotional skills
Child socio-emotional skills are an unobservable and difficult to measure construct. Over recent years, the measurement of such skills has evolved and, over time, different measures have been used. As we discuss below, this makes the comparison of socio-emotional skills across different groups, assessed with different tools, difficult.
A common approach to infer a child's socio-emotional development is based on behavioural screening scales. As part of these tools, mothers (or teachers) indicate whether their children exhibit a series of behaviours -the items of the scale. In the British and Millennium Cohort Studies, two different scales were employed. In the BCS, the Rutter A Scale was used (?) while in the MCS mothers were administered the Strengths and Difficulties Questionnaire (SDQ, ??). The SDQ was created as an update to the Rutter scale.
It encompasses more recent advances in child psychopathology, and emphasises positive traits alongside undesirable ones (?). ? administered both scales to a sample of children, and showed that the scores are highly correlated, and the two measures do not differ in their discriminatory ability. The Rutter and SDQ scales are reproduced in Table A2; they have 23 and 25 items each, respectively. In the child psychiatry and psychology literatures, the Rutter and SDQ scales are regarded as measures of behavioural problems and mental health. However, in our analysis we follow the economics literature, and -after having recoded them accordingly -we interpret them as measures of positive child development (?).
While the Rutter and SDQ scales are similar in their components (since the latter was developed from the former, see ?), there is no a priori reason to expect them to be directly comparable. First, the overlap of behaviours described in the two scales is only partial, given that -by design -the SDQ includes also strengths, in addition to weaknesses. Second, the wording of each item is slightly different, both in the description and in the options that can be selected as answers. Third, the different ordering of the items within each scale might lead to order effects. Fourth, and no less importantly, the interpretation of each behaviour by respondents living 30 years apart (1975 vs 2006) might differ due to a host of evolving societal norms. Nonetheless, the level of comparibility of the two scales is higher than that of other scales used in comparative work in the literature reviewed in section 2.
As our goal is to compare socio-emotional skills across the two cohorts, we construct a new scale by retaining the items that are worded in a similar way across the two original Rutter and SDQ scales, and making some slight coding adjustments to maximise comparability. In what follows, we will consider the included items to be the same measure in the two cohorts. The wording of the items we will be using in the analysis is presented in Table 2: we retain 13 items for the BCS (two of them are grouped) and 11 for the MCS with high degree of comparability. We exclude from the analysis items that were completely different between the two questionnaires to maximise comparability between the two cohorts, as it is standard good practice in the psychometric literature (see for example ?). 6 More details on the derivation of the scale are available in Appendix A.
Item-level prevalence by cohort and gender is in Table A3. We see that, in general, there are more similarities across genders within the same cohort, than across cohorts. For the majority of items, there is a lower prevalence of problematic behaviours in the MCS than in the BCS; however, four items (distracted, tantrums, fearful, aches) show a higher prevalence in 2006 than in 1975. Regardless, a simple cross-cohort comparison of item-level prevalence is misleading because of changing perceptions and norms about what constitutes problematic behaviour in children. The analysis in section 5 tackles this issue.
In the remainder of this section, we analyse the properties of the new scale. Following a common approach, we proceed in two steps. First, we carry out an exploratory step, where we study the factor structure of our scale. The aim of this step is to examine the correlation between observed measures in a data-driven way, imposing the least possible assumptions. Here, we establish how many latent dimensions of socio-emotional skills the scale is capturing, and which items of our scale are measuring which dimension.
As a second step, we set up a confirmatory factor model. This model fixes the number of latent dimensions, and imposes a dedicated measurement structure, based on the insights obtained in the exploratory step. This is the model to which we apply the measurement invariance analysis of Section 5.

Exploratory analysis
The original Rutter scale, used in the BCS cohort, distinguishes behaviours into two subscales: anti-social and neurotic (?). This two-factor conceptualisation has been validated using data from multiple contexts, and the latent dimensions have been broadly identified as externalising and internalising behaviour problems. 7 The Strength and Difficulties Questionnaire, used in the MCS cohort, was instead conceived to have five subscales of five items each. The five subscales are: hyperactivity, emotional symptoms, conduct problems, peer problems, and prosocial. This five-factor structure has been validated in many contexts (?); lowerdimensional structures have been also suggested (?). Recent research has shown that there are some benefits to using broader subscales that correspond to the externalising and internalising factors in Rutter, especially in low-risk or general population samples (?). Indeed, the internalising and externalising dimensions were introduced in psychology by ?, who showed that they are the two main factors underlying a wide range of psychological measures; as noted in ?, more than 75,000 articles have been published on internalizing and externalizing problems.
We use exploratory factor analysis (EFA) to assess the factor structure of our new scale, composed of 11 items of the Rutter scale in the BCS and the corresponding items of the SDQ in the MCS. 8 We start by investigating the number of latent constructs that are captured by the scale, using different methods developed in the psychometric literature, and recently adopted by the economics literature. The results are displayed in Table A4. As pointed out in ?, there is relatively little agreement among procedures; this is the case especially for the Rutter items in the BCS data, where different methods suggest to retain between 1 and 3 factors, while most methods suggest to retain 2 factors for the SDQ items in the MCS.
Given the test results, we perform a series of exploratory factor analyses, assuming a one-, two-or threefactor structure, respectively. The results for the 1-factor solution, reported in Table A5, show relatively similar loadings for both males and females across the two cohorts, of slightly bigger magnitude for the last four items in the MCS than in the BCS; thus, we retain the 1-factor solution for the measurement invariance analysis, in the first instance. The results for the 3-factor solution, instead, also reported in Table A5, show a less homogeneous picture: 9 while the magnitude of the loadings is relatively similar across the two cohorts for the first factor, items 3 and 5 only load on the second factor for the MCS, not for the BCS; more importantly, the EFA clearly shows that the third factor only loads on one single item (item number 9, "solitary") for both cohorts. Given that a one-item factor implies that the item perfectly proxies for the factor, we are not able to test for measurement invariance in this case. Hence, the 3-factor solution is not supported by our EFA results. Last, the two-factor EFA is shown in Table A6 and delivers a neat and sensible separation between items: similarly-worded items load on the same factor across the two cohorts, and also 7 See for example ?????. However, in some cases a three-factor structure was found to better fit the data, with the externalising factor separating into two factors seemingly capturing aggressive and hyperactive behaviours (??). 8 Factor-analytic methods have long been used in psychology, and in recent years they have become increasingly popular in economics, especially to meaningfully aggregate high-dimensional items measuring different aspects of common underlying dimensions of human development. The EFA is performed decomposing the polychoric correlation matrix of the items and using weighted least squares, and the solution is rescaled using oblique factor rotation (oblimin). We use the R package psych, version 1.8.4 (?). 9 We do not perform the EFA with 3 factors for males because this solution is never chosen by any test for the number of factors for the MCS, see Table A4. the magnitude of the respective loadings (measuring the strength of the association between the item and the factor) is very similar. Following previous research, we name the first dimension Externalising skills (EXT, indicating low scores on the items restless, squirmy/fidgety, fights/bullies, distracted, tantrums, and disobedient) and the second dimension Internalising skills (INT, indicating low scores on the items worried, fearful, solitary, unhappy, and aches). 10

Factor model
After having studied the factor structure underlying the 11 common items in the previous section, we now specify a multiple-group factor analysis model to formally quantify the strength of the relationship between the observed items in our scale and the latent socio-emotional skills, and to test for invariance across cohorts.
We specify two groups of children c = {BCS , MCS }, corresponding to the two cohorts. Each individual child is denoted by j = 1 . . . N c , where N c is the number of children in cohort c. For each child j in cohort c, we observe categorical items X i jc with i = 1, . . . , 11 corresponding to the eleven maternal reports in Table 2. Following the EFA results above, we specify two models: one in which we assume that each child is characterised by only one latent skills vector, and another in which we assume that each child is characterised by a latent bi-dimensional vector of externalising and internalising socio-emotional skills θ jc = (θ EXT jc , θ INT jc ). Children are assumed to have a latent continuous propensity X * i jc for each item i = 1, . . . , I. We model this propensity as a function of item-and cohort-specific intercepts ν ic and loadings λ ic , and the child's latent skills θ jc , plus an independent error component u i jc . The propensity for each item can be written as follows: X * i jc = ν ic + λ ic θ jc + u i jc for i = 1, . . . , 11 or more compactly: We make the common assumption of a dedicated (or congeneric) factor structure, where each measure is assumed to load on only one latent dimension (???). We mirror the structure found in the exploratory factor analysis above, and assume that all items load on one factor for the 1-factor solution ( Table A5), and that items 1-6 load exclusively on the externalising factor and items 7-11 on the internalising factor for the 2-factor solution (Table A6). 11 The discrete ordered nature of the observed measures X i jc is incorporated by introducing item-and 10 Internalising and externalising dimensions emerge from the exploratory step on our novel 11-item scale. Appendix B performs the same exploratory steps on the full set of Rutter items in BCS and SDQ items in MCS. It confirms that the items we select for our subscale have a broadly consistent covariance structure even when factor-analysed with the others in their original scales. Appendix C considers the robustness of our results to the exclusion of the items of the scale that perform most poorly. 11 The dedicated factor structure in the two-factor case corresponds to a sparse loading matrix, i.e.: cohort-specific threshold parameters τ ic (?). The observed measures as a function of the propensities X * can be then written as follows: with τ 0,ic = −∞ and τ 3,ic = +∞. Notice that we recode all ordered items to have higher values for better behaviours, so that our latent vectors can be interpreted as favourable skills and not behavioural problems. 12 5 Measurement invariance -For all groups, normalise a reference loading to 1 for each factor.
-Set invariant across groups one threshold per item (e.g. τ 0,Ai = τ 0,Bi for two groups A and B), and an additional threshold in the reference items above. -In the first group: -Set all intercepts ν to zero. 12 The model implies the following expression for the mean and covariance structure of the latent propensities: As per the traditional factor analysis approach, we impose a normal distribution on the latent skills and error terms.
Recent work has also used mixtures of normals for the latent factors distribution, e.g. ?.
The first two parameterisations (WEΔ and WEΘ) normalise the mean and variance of factors to the same constants in both groups, and they leave all loadings and thresholds to be freely estimated; they only differ in whether the additional required normalisation is imposed on the variances of the error terms (Ψ) or on the diagonal of the covariance matrix of the measures (Σ). The MT parameterisation instead proceeds by identifying parameters in one group first, and then imposing cross-group equality constraints to identify parameters in other groups (?). Still, all of these parameterisations are statistically equivalent. The measurement invariance analysis in this paper is based on the Theta parameterisation (WEΘ), but results are independent on this choice. The restrictions in (4.1), (4.2), and (5.1) define the so-called configural model.

Nested models
Any comparison between socio-emotional skills across the two cohorts requires that the measures at our disposal have the same relationship with the latent constructs of interest in both cohorts. In other words, the items in our new scale must measure socio-emotional skills in the same way in the BCS and MCS data.
In the framework of factor analysis, measurement invariance is a formally testable property. In this paper, we follow the recent identification methodology by ?. The configural model defined above in section 5.1 serves as the starting point. Measurement invariance is then assessed by comparing the configural model to a series of hierarchically nested models. These models place increasing restrictions on the item parameters, constraining them to be equal across groups. Their fit is then compared to that of the configural model. Intuitively, if the additional cross-group restrictions have not significantly worsened model fit, one can conclude that a certain level of invariance is achieved.
In the case where the available measures are continuous, MI analysis is straightforward (?). The hierarchy of the nested models usually proceeds by testing loadings first, and then intercepts (to establish metric and scalar invariance -see ?). Invariance of systems with categorical measures, such as the scale we examine in this paper, is less well understood. In particular, the lack of explicit location and scale in the measures introduces an additional set of parameters compared to the continuous case (thresholds τ). This makes identification reliant on more stringent normalisations. A first comprehensive approach for categorical measures was proposed by ?. New identification results in ? indicate that, in the categorical case, invariance properties cannot be examined by simply restricting one set of parameters at a time. This is because the identification conditions used in the configural baseline model, while being minimally restrictive on their own, become binding once certain additional restrictions are imposed. In light of this, they propose models that identify structures of different invariance levels. They find that some restrictions cannot be tested alone against the configural model, because the models they generate are statistically equivalent. This is true of loading invariance, and also of threshold invariance in the case when the number of categories of each ordinal item is 3 or less. Furthermore, they suggest that comparison of both latent means and variances requires invariance in loadings, thresholds, and intercepts. A summary of the approach by ? is available in Table 3.
Let's consider examples from our application. A loading and threshold invariance model restricts every item's loading λ and threshold τ parameters to have the same value in the two cohorts. It assumes that the items in our scale have the same relationship with latent skills across the two cohorts. In other words, items have the same salience, or informational content relative to skills. If this model fits as well as the configural model, we can be confident that the socio-emotional skills of children in the two cohorts can be placed on the same scale, and their variances can be compared. To see why, consider equation (4.1). If the loading matrix Λ is the same across cohorts, any difference in latent skills ∆θ will correspond to the same difference in latent propensities ∆X * . Equality of thresholds τ ensures that propensities X * map into observed items X in the same way. A loading, threshold, and intercept invariance model additionally restricts every item's intercept ν across cohorts. A good relative fit of this model indicates that socio-emotional skills can be compared across cohorts in terms of their means as well. To see why, consider the following. Since the λ and ν parameters are the same across cohorts, a child in the BCS cohort with a given level of latent skillsθ will have the same expected latent item propensities X * as a child with the same skills in the MCS cohort. Again, equality of thresholds τ fixes the mapping between X * and X. 13 We estimate the sequence of models detailed in Table 3

Measurement invariance results
Comparison of χ 2 values across models is a common likelihood-based strategy. However, tests based on ∆χ 2 are known to display high Type I error rates with large sample size and more complex models such as our own (?). In fact, for all invariance levels in our applications a chi-squared difference would point to a lack of measurement invariance. The use of approximate fit indices (AFIs) is therefore recommended alongside χ 2 . While these indices successfully adjust for model complexity (?), they do not have a known sampling distribution. This makes it necessary to rely on simulation studies, which derive rules of thumb indicating what level of ΔAFI is compatible with invariance.
Again, just like in the broader context of measurement invariance, most evidence regarding the performance of AFIs pertains to scenarios with continuous measures. The root mean squared error of approximation (RMSE) and the Tucker-Lewis index (TLI) are traditionally the most used AFIs in empirical practice.
Simulation evidence by ? shows that these indices can show correlation between overall and relative fit, and suggest relying on additional indices, such as the comparative fit index (CFI, ?), McDonald non-centrality index (MFI, ?), and Gamma-hat index (?). Subsequent simulation studies -e.g. ? and ? -have updated these thresholds for the continuous case. In particular, ? shows in two Monte Carlo studies that the standardised root mean square residual (RMSR) is more sensitive to lack of invariance in factor loadings than in intercepts or residual variances, while the CFI and RMSEA are equally sensitive to all three types of lack of invariance; he suggests the following thresholds for rejecting measurement invariance: ΔRMSE > .015, However, it is not advisable to directly extrapolate rules of thumb derived from simulations with continuous measures to the categorical case (?). Recent studies have advanced the simulation-based evidence on the performance of AFIs in measurement invariance analysis with categorical measures. ? find that the cutoffs from ? might not generalise well to problems estimated by WLSMV, but this is mostly confined to smaller sample sizes and detection of small degrees of non-invariance. More recently, ? find that a ΔRMSE threshold of .010 is appropriate for testing equality of slopes and thresholds when the sample size is large, like in our case.
In any case, we present a range of fit indices to provide a more complete assessment of measurement invariance. We present the measurement invariance results for the 1-factor model in Table A10, and those for the 2-factor model in Table A11. First, by comparing the fit of each nested model across the 1-factor and the 2-factor models, it is clear that the 1-factor model fits the data significantly worse than the 2-factor model, according to all the criteria considered. 15 Hence, in our analysis since now on, we adopt the two-factor solution, which is also consistent with the child psychology literature cited above: as mentioned above, we name the two factors externalizing and internalizing skills. We now examine the measurement invariance properties of our chosen two-factor solution in greater details. Looking at Panel A of Table A11, we see that the overall fit of the configural model for the chosen 2-factor solution is satisfactory according to all indices, with CFI around .95 and RMSE just above .05. As expected, given our large sample size, χ 2 -based tests reject measurement invariance at all levels. The model with restricted thresholds and loadings exhibits a comparable fit to the configural model, according to all the AFIs. In particular, the ΔAFIs fall within the ranges suggested in ?, ? and ?; see also ? for a review of updated guidelines for measurement invariance.
Invariance of loadings and thresholds across cohorts implies that the items in our scale are equally salient in their informational content, and that the latent propensities have equal mapping into the observed items.
However, further restricting intercepts results in a model where invariance is rejected across the board.
In other words, intercept parameters in our model (ν) are estimated to be different between maternal reports in the British and Millennium Cohort Studies. This means that, for a given level of latent skills, mothers in MCS tend to assess behaviours differently from mothers in BCS. Thus, cohort differences in scores on our scale cannot be unequivocally interpreted as differences in the underlying skills, since they might also reflect differences in reporting. 16 15 It is worth noting that threshold and loading invariance only can be established also in the 1-factor case, i.e. intercept invariance is never achieved. 16 We do not present fit results for the threshold-only invariance model, as it is statistically equivalent to the configural model and thus its fit is mathematically the same -see Table 3 in ?. The ages at which socio-emotional skills are observed varies slightly between BCS and MCS, due to different sampling and fieldwork schedules. In the MCS cohort, the age distribution has significantly higher variance. In Panel B of Table A11, we restrict the sample to 59 to 61 months, where the overlap between BCS and MCS is maximised. In Panel C, we repeat the analysis with the full sample, but excluding the poorest-performing items (5 and 11) -see This is an important finding, which has to our knowledge never been acknowledged in the economic literature. How can this lack of comparability be explained? A possible interpretation is connected with secular evolution of social and cultural norms about child behaviours. For example, commonly held views of what constitutes a restless, distracted, or unhappy child might have changed between 1975 and 2006. 17 To summarise, our measurement invariance analysis shows partial comparability of socio-emotional skills across cohorts. In particular, the variance of skills can be compared across cohorts, but mean cohort differences do not necessarily reflect differences in skills. We can use scores from our scale to compare children within the same cohort-gender group, but not across cohorts. However, we can also compare within-cohort differences between groups of children, across cohorts. As an example, consider two groups of children A and B in the BCS cohort, and two groups of children C and D in the MCS. We cannot compare the mean level of skills between groups A and C, but we can compare the mean difference between groups A and B with the mean difference between groups C and D. This is the approach we take for the rest of the paper. Refraining from direct cross-cohort comparisons, we interpreting significance and magnitude of within-cohort differences across the cohorts.

Results
Parameter estimates from our factor model are presented in Table A13. As discussed in the previous section, loadings and thresholds are constrained to have the same value across groups. Intercepts are normalised to zero, and error variances to one, for the reference group -males in the BCS cohort. We use the estimates from this model to predict a score for each child in our sample along the latent externalising and internalising socio-emotional skill dimensions. 18 We plot the distribution of the scores in Figure 1. The unit of measurement is standard deviations of the distribution in the subsample of males in the BCS. Given our measurement invariance results in section 5, we stress that the location of these scores should not be directly compared across cohorts. However, the shape of the distribution can be given a cross-cohort interpretation.
This result is in sharp contrast with what shown by the simple distribution of sum scores in A1: using raw scores we see an increase in mass only at the top of the distribution, while the factor scores clearly show that there is more mass in both tails of the distribution of the 2000 than of the 1970 cohort.
Appendix C for details. In Panels A and B of Table A12, we restrict to male and female children respectively. In all these cases, invariance of thresholds and loadings is confirmed, but invariance of intercepts is rejected. We can thus rule out that the lack of intercept invariance comes from differences in ages or invariance across child gender.

Inequality in socio-emotional skills
We find that, both unconditionally and for specific groups, inequality in socio-emotional skills at age five has increased between 1975 and 2005/6. Table 4 shows unconditional inequality statistics, using quantile differences in the distribution of skills by gender and cohort. With the exception of internalising skills in female children, all distributions have widened substantially between the BCS and MCS cohorts. The gap for both externalising and internalising skills between the 90th and the 10th percentiles for males has increased by approximately half a standard deviation. The increase in the gap is more pronounced in the bottom half of the distribution. For females, we see a narrowing at the top (90-50), but a widening at the bottom (50-10) of the distribution, again for both externalising and internalising skills.
Inequality has also increased conditional on socioeconomic status. Figure 2 shows mean skills by maternal education. We compare mothers who continued education with mothers who left school at the minimum compulsory leaving age, according to their year of birth. Given lack of comparability in the level of skills across cohort, we normalise the mean in the 'Compulsory' group to zero for both cohorts. For both males and females, and for both externalising and internalising skills, the difference in the socio-emotional skills of their children between more and less educated mothers has increased. The size of the increase is around .1 to .15 of a standard deviation. The increase is particularly pronounced for males, for whom it goes from .20 to .30 for externalising and from .12 to .24 for internalising. We then examine the same patterns as in the previous figures, but conditional on other family background indicators. The aim is to disentangle the relative contribution of each indicator to socio-emotional skills, and how it has changed in the thirty years between the two cohorts. Table 5 shows coefficients from linear regressions of socio-emotional skills at five on contemporaneous and past socioeconomic indicators, by cohort and gender. Coefficients for indicators in BCS and MCS are presented side by side, together with the p-value of the hypothesis that coefficients are the same in the two cohorts. 20 Overall, the importance of maternal socioeconomic status (education and in particular employment) in determining socio-emotional skills has increased from the BCS to the MCS children. The 'premium' in skills for children of better educated and employed mothers is significantly larger, for both boys and girls, internalising and externalising skills. At the same time, the penalty for having a blue-collar father, or not having a father figure at all in the household, has significantly declined across the two cohorts, especially for girls. Being born to an unmarried mother, and to a mother who smoked during pregnancy, is associated with a higher penalty for both dimensions of socio-emotional skills in the latter cohort. 21 Children of non-white ethnicity have worse internalising and externalising skills in the MCS, a penalty almost absent in the BCS (where the prevalence of non-white children was much lower). Firstborn boys and girls in the BCS have worse skills, but this difference disappears in the MCS. Lastly, we document an increase in the returns to birth weight, which is more pronounced for boys.
These changes in the relative importance of pregnancy factors and family background characteristics for child socio-emotional skills at age 5 need to be interpreted in the light of the significant changes in the prevalence of such characteristics across cohorts. As shown in Table 1, the age of the mother at birth, and the proportion of mothers non-smoking in pregnancy, with post-compulsory education and in employment at the age 5 of the child has substantially increased; at the same time, the proportion of households with no father figure has increased, and so the proportion of women unmarried at birth is much higher in the 2000 than in the 1970 cohort. Also, as noted, the ethnic structure of the population has changed, with a higher proportion of non-white children in the MCS than in the BCS. In general, this has been a period of significant societal changes, with an almost continual rise in the proportion of women in employment, an older age at first birth and a rise in dual-earning parents families (?).
Hence, we lastly attempt to disentangle whether and to which extent the observed changes in inequality in socio-emotional skills across the two cohorts can be attributed to changes in returns (or penalties) to characteristics such as maternal education, or to compositional changes. To this aim, we use the method recently developed by ? 22 as an extension of the Oaxaca-Blinder (OB) decomposition to any distributional measure, that here we apply for the first time to changes in inequality in early childhood development. This two-stage procedure first decomposes distributional changes into a 'composition effect' and a 'coefficient effect' using a reweighting method; then it further divides these two components into the contribution of each explanatory variable, using Recentered Influence Function (RIF) regression (?).
Following ?, we first perform an OB decomposition using the BCS sample and the counterfactual sample (BCS reweighted to be as MCS) 23 to get the pure composition effect, using the BCS as reference coefficients.
The total unexplained effect in this decomposition corresponds to the specification error, and allows to assess the importance of departures from the linearity assumption. Second, we perform the decomposition using are extremely similar to the linear estimates in Table 5, and are available from the authors upon request. 21 It is important to underscore that there has been a significant rise is cohabitation between 1975 and 2006. It is likely that unmarried mothers in the two cohorts have very different characteristics. The choice of this indicator is due to the absence of information on cohabitation in the birth survey for the BCS cohort. 22 See also ? for a recent survey of decomposition methods in economics. 23 We use a logit model to construct the weights and the post-double selection lasso (?) to select the covariates, among the set of the baseline variables in Table 1 and their pairwise interactions. the MCS sample and the counterfactual sample, to obtain the pure coefficient effect (the 'unexplained' part); the explained effect in this decomposition corresponds to the reweighting error, which allows to assess the quality of the reweighting.
In Figure 5 we present the results of the RIF decomposition for changes in five measures of inequality in socio-emotional skills for the boys, both externalising (top figure) and internalising (bottom figure). The results indicate that different factors explain the rise in inequality in the two skills: on the one hand, compositional changes explain, on average, half of the cross-cohort increase in inequality in externalising skills, regardless of the measure considered; 24 on the other hand, the increase in inequality in internalising skills seems to be entirely explained (even over-explained) by changes in returns (or penalties) to background characteristics. Composition and coefficient effects are further decomposed in the contribution of each covariate, and the results presented in Table A14 and in Table A15. We see in Table A14 that mother's age and marital status at birth are the two variables that best account for the compositional changes, driving the increase in inequality in externalising skills among the boys for the quantile differences and the variance, respectively. This is hardly surprising, given that we have seen in Table 1 that the average age of the mother at birth has increased by approximately three years (from 26 to 29 years old), and that the proportion of unmarried mothers has increased dramatically, from 5% in the BCS to 36% in the MCS. The baseline covariates, instead, do a less impressive job at explaining the changes in coefficients underlying the increase in inequality in internalising skills (Table A15, note the changes in returns to maternal employment go in the direction of reducing inequality). This can be partly explained by the fact that, due to lack of comparable measures across cohorts, we have been unable to account for important determinants of a child's internalising behaviour, such as for example maternal mental health. We also notice that, for the quantile differences 75-25 and 90-50, the composition effect is significant but negative; in other words, compositional changes linked to maternal marriage status would have led to a reduction in inequality, especially at the top of the distribution. Reassuringly, both the specification and the reweighting error are not significantly different from zero. Lastly, the results are not so clear-cut for the girls, who experienced a more muted increase in inequality, concentrated at the bottom of the distribution. The RIF results displayed in Table A16 show that no single contributing factor emerges.

Socio-emotional skills and adolescent/adult outcomes
In this last section, we study the predictive power of socio-emotional skills for adolescent and adult outcomes, to gain some insights as to whether inequality in the early years could translate into later life inequalities. We contribute to a vast interdisciplinary literature by examining medium-and long-term impacts of skills measured at an earlier age than in previous studies, i.e. well before the start of formal education.
Showing that these early skills are predictive of different later outcomes across various domains provide a key rationale for the role of early intervention in reducing life course inequalities. In practice, we proceed by regressing health and socioeconomic outcomes measured in adolescence and adulthood on the socioemotional skills scores at age five obtained by our factor model, controlling for the harmonised family 24 The coefficient effects are also sizeable, but imprecisely estimated, with the exception of the variance component. background variables at birth and age five (see Table A1). 25 We present results with and without controlling for cognitive skills. As detailed in Section 3, the available cognitive measures are not comparable across cohorts. Still, we control for a factor score that summarises all information on cognitive skills that is available in each cohort, regardless of their comparability.
Socio-emotional skills at five years of age are predictive of adolescent health behaviour and outcomes in both cohorts. 26 Table 6 examines adolescent smoking and BMI for both cohorts; Table A19 reports the results for the same outcomes in adulthood (at age 42), for the BCS only. Externalising skills are negatively correlated to subsequent smoking and BMI in both cohorts, for both genders. Recall that a child with high externalising skills exhibits less restless and hyperactive behaviour, and has less anti-social conduct. Our findings are consistent with the body of evidence reviewed in section 2, which shows that better socioemotional skills (measured using different scales and at various points during childhood and adolescence) are negatively associated with smoking. At the same time, internalising skills are positively correlated with smoking (only in the 1970 cohort) and BMI (only for girls), although less strongly than externalising skills.
This apparently counterintuitive result makes sense in light of the items in our internalising scale shown in Table 2. A child with better internalising skills is less solitary, neurotic, and worried. From this perspective, he/she is likely more sociable and subject to peer influence in health behaviours. This is consistent with the evidence in ?, who find a positive association between child emotional health (measured with items from the internalizing behaviour subscale of the Rutter scale at age 10 in the BCS) and smoking at age 42.
Furthermore, in recent work ? have shown personality to be a key mechanism through which peers affect smoking behaviour. We have also tested the robustness of these findings by jointly estimating by maximum likelihood the measurement system (with the partial invariance constraints) and the two outcome equations for smoking and BMI. The results, presented in Table A18, are qualitatively similar to those obtained with the two-step method. 27 Conditional on socio-emotional skills, cognition has limited predictive power for these behaviours, and only for girls. 28 This is in line with the evidence in ?, who show that not accounting for non-cognitive traits (in their paper, a self-regulation factor measured at age 10) overestimates the importance of cognition for predicting health and health behaviours, using data from the British cohort study. Along the same lines, ? use rich data on child personality and socio-emotional traits collected at ages 7, 11 and 16 in the 1958 British birth cohort, 29 and show that these traits rival the importance of cognition in explaining the education gradient in health behaviours (including smoking and BMI). We show that child socio-emotional skills have 25 In tables A17 and A19, we show that the conclusions in this section are not sensitive to the factor scoring methodology used. 'Raw' scores, obtained by a simple unweighted average of the item categories in the 11-item subscale have basically equal predictive power to factor scores. 26 Unfortunately the strength of the association cannot be directly compared, since the outcomes are measured at different ages: 16 and 14 years for BCS and MCS, respectively. 27 Note that the magnitudes are not exactly comparable because in one-step ML estimation the residual variances of the nonbinary measurements also need to be fixed for identification. The remaining parameter estimates are also very similar to those in Table A13 and available from the authors upon request. 28 We do not observe significant associations between early socio-emotional skills and other risky behaviours like drug-taking and alcohol consumption. One possible reason might be the relatively young age at which these skills are measured. Results are available upon request. 29 They use the Rutter scale and the Bristol Social Adjustment Guide. greater predictive power than cognition for health outcomes and behaviours even when measured at an earlier age than in previous work.
Cohort members from the British Cohort Study are now well into their adulthood. For this cohort, we can examine the association between socio-emotional skills at age five and adult education and labour market outcomes. The structure of Table 7 is similar to Table 6, but it considers educational achievement, employment, and earnings (conditional on being in paid employment) for the BCS cohort members. For these outcomes, the predictive power of cognitive skills outweighs that of socio-emotional skills, which are only predictive of educational attainment, and whose predictive power for males is driven to insignificance after controlling for cognition. This is consistent with the evidence in ?, who show that cognitive endowments at age 10 are more predictive (than socio-emotional and health ones) for employment and wage outcomes in the BCS. Again, we show that the greater predictive power of cognition for socioeconomic outcomes holds even when considering earlier-life measures of child development.

Conclusion
In this paper we have studied inequality in a dimension of human capital which has received less attention Second, we have formally tested for measurement invariance across the two cohorts (for each gender) of the 11 items comprising the two externalising and internalising scales, following recent methodological advances in factor analysis with categorical outcomes. We have found only partial support for measurement invariance, with the implication that we have only been able to compare how inequality in these socioemotional skills has changed across the two cohorts, but not whether their average level is higher or lower in one of them. These results sound a warning to research in this area which routinely compares levels of skills across different groups (at different times, or of different gender), without first establishing their comparability.
Third, after having computed comparable scores for both externalising and internalising skills, and for both boys and girls, we have compared how inequality in these skills has changed across the 1970 and the 2000 cohorts. We have documented for the first time that inequality in these early skills has increased, especially for boys. The cross-cohort increase in the gap is more pronounced at the bottom of the distribution (50-10 percentiles). We have also documented changes in conditional skills gaps across cohorts. In particular, the difference in the socio-emotional skills of their children between mothers of higher and lower socio-economic status (education and employment) has increased. The increase in cross-cohort inequality is even starker when comparing children born to mothers who smoked during pregnancy. On the other hand, the skills penalty arising from the lack of a father figure in the household has substantially declined. Moreover, we have formally decomposed the increase in inequality into compositional changes, and changes in returns to maternal characteristics -providing the first child development application of the method recently developed by ?. We have found that half of the increase in inequality in externalising skills across cohorts can be explained by compositional changes, with maternal age and marital status at birth being the most important factors; on the other hand, the increase in inequality in internalising skills seems to be entirely driven by changes in returns to maternal characteristics.
Fourth, we have contributed to the literature on the predictive power of socio-emotional skills by showing that even skills measured at a much earlier age than in previous work are significantly associated with outcomes both in adolescence and adulthood. In particular, socio-emotional skills are more significant predictors of health and health behaviours (smoking and BMI), while cognition has greater predictive power for socioeconomic outcomes (education, employment and wages). Our results ultimately show the importance of inequalities in the early years development for the accumulation of health and human capital across the life course.

Complains of headaches + Complains of stomach-ache or has vomited
Often complains of head-aches, stomach-ache or sickness Notes: Itm. is item number. Factor is the latent construct to which the item loads -EXT is Externalising skills, INT is Internalising skills. Cat. is the number of categories in which the item is coded -2 denotes a binary item (applies/does not apply) and 3 denotes a 3-category item. Title is a short label for the item. Wording columns show the actual wording in the scales used in each of the cohort studies. Items denoted by (+) are positively worded in the original scale.

Threshold invariance
· Restricts thresholds τ to be equal across groups · Statistically equivalent to configural (when measures have 3 categories or less)  Notes: The table shows differences between quantiles of the distributions of socio-emotional skills, by gender and cohort. Bootstrap confidence intervals with 1,000 repetitions are in brackets. The factor scores for socio-emotional skills are estimated using an empirical Bayes modal approach, using the parameter estimates from the factor model in Table A13. These distributions are shown in Figure 1. Notes: The table shows coefficients from linear regressions of children's socio-emotional skills at five years of age on family background characteristics. The dependent variable is a factor score obtained from the factor model in Section 4. Col. (1) and (2) show coefficients and standard errors in parentheses, for male children in the BCS and MCS cohorts separately. The latter are obtained using 1,000 bootstrap repetitions, taking into account the factor estimation stage that precedes the regression. Col.
(3) shows the p-value of a test that the coefficient is the same in the two cohorts. Col. (4) to (6) repeat for female children. Col. (7) to (12) repeat for internalising skills. All estimates additionally control for region of birth, mother height, number of previous stillbirths at child's birth, preterm birth, a dummy for missing gestational age, and number of other children in the household at child age 5. See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1. (3) additionally controls for cognitive ability at age five. This is a simple factor score obtained by aggregating the available cognitive measures. All standard errors in parentheses are obtained using 1,000 bootstrap repetitions, taking into account the factor estimation stage that precedes the regression. Col. (4) to (6) repeat for female cohort members. All estimates additionally control for region of birth, maternal education (5), maternal employment (5), father occupation (5), maternal background (age, height, nonwhite ethnicity, number of children in the household), pregnancy (firstborn child, number of previous stillbirths, mother smoked in pregnancy, preterm birth, (log) birth weight). See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1. Externalising skills (5) .044 * * (.022) .024 (.022) .069 * * * (.020) .053 * * * (.020) Internalising skills (5 (3) additionally controls for cognitive ability at age five. This is a simple factor score obtained by aggregating the available cognitive measures. All standard errors in parentheses are obtained using 1,000 bootstrap repetitions, taking into account the factor estimation stage that precedes the regression. Col. (4) to (6) repeat for female cohort members. All estimates additionally control for region of birth, maternal education (5), maternal employment (5), father occupation (5), maternal background (age, height, nonwhite ethnicity, number of children in the household), pregnancy (firstborn child, number of previous stillbirths, mother smoked in pregnancy, preterm birth, (log) birth weight). See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1. Notes: The figure shows unconditional mean values of socio-emotional skills scores by gender, cohort, and mother's education at age five. Mother's education is a dummy for whether the mother continued schooling past the minimum leaving age, based on her date of birth. The four panels on top present mean and 95% confidence intervals. Given that we cannot compare means of skills, all scores are normalised to take value zero for the 'Compulsory' category, so that the gradient is emphasised. The bottom two panels present the unconditional distribution of mother's education. Notes: The figure shows unconditional mean values of socio-emotional skills scores by gender, cohort, and mother's pregnancy smoking. Maternal smoking is a dummy for whether the mother reported smoking during pregnancy. The four panels on top present mean and 95% confidence intervals. Given that we cannot compare means of skills, all scores are normalised to take value zero for the 'Non-smoker' category, so that the gradient is emphasised. The bottom two panels present the unconditional distribution of mother smoking status in pregnancy.  Table A14 and Table A15. Bootstrapped standard errors over the entire procedure (500 replications) were used to compute the p-values. * * * p≤0.01, * * p≤0.05, * p≤0.1.

Appendices
Appendix A Deriving a common scale of socio-emotional skills In the BCS, maternal reports on child socio-emotional skills are measured using the Rutter A Scale (?) -see Panel A of Table A2. The Rutter items are rated on three levels: 'Does not apply', 'Somewhat applies', 'Certainly applies'. Since they all indicate negative behaviours, we recode all of them in reverse, i.e. 'Certainly applies' = 0, 'Somewhat applies' = 1, 'Does not apply' = 2. We augment the 19-item Rutter Scale with three additional parent-reported questions from the parental questionnaire, items A, B, and D.
These are rated on 4 levels: 'Never in the last 12 months', 'less than once a month', 'at least once a month', 'at least once a week'; we recode them into binary indicators, with 'Never' and 'Less than once a month' to 1 and zero otherwise. To increase comparability between the two scales, we also merge together two pairs of Rutter items: 4 and 19 (to mirror SDQ item 12 "Often fights with other children or bullies them"), and A and B (to mirror SDQ item 3 "Often complains of head-aches, stomach-ache or sickness"); we assign the lowest category among the two original items to the newly obtained item. We also recode the three-category Rutter items 5 and 14 to binary to mimic the split in the MCS, where they are worded positively.
In the MCS, we use the 25-item Strengths and Difficulties Questionnaire (?) -see Panel B of Table A2.
All items are recorded on a 4-point scale: 'Not true', 'Somewhat true', 'Certainly true', 'Can't say'. We set the latter option to missing and recode the rest so that a greater value represents a higher level of skills, as for the BCS items, i.e. 'Certainly true' = 0, 'Somewhat true' = 1, 'Not true' = 2 for the negatively-worded items (and the opposite for the positively-worded ones). For comparability with the BCS Rutter scale, we dichotomise items 3 and 5, and dichotomise and invert items 7 and 14.

Appendix B Robustness of exploratory analysis
In this section, we repeat the exploratory analysis step in Section 4.1 for the full set of Rutter and SDQ items.
This is to show that the factor structure emerging from the exploratory analysis of the 11-item subscale is consistent with what would emerge considering the original scales in their entirety. Again, we proceed by first assessing the optimal number of factors, and then examining the loadings obtained from exploratory factor analysis.
Results for the optimal number of factors as indicated by different approaches are in Table A7. Similarly to the 11-item subscale, there is not much agreement between methods. Since the purpose of this section is to assess the robustness of the 11-item subscale, we adopt a conservative approach by estimating EFA models with the largest number of factors suggested, i.e. five. In this way, we allow for richer factor solutions, that have more power to disprove our simpler two-factor solution for the novel subscale. Table A8 presents factor loadings for the Rutter scale in BCS, with the addition of the "headaches/stomachaches" and "tantrums" items (see Appendix A). The split between externalising and internalising items that we recover in the 11-item scale is almost entirely preserved in the full scale, as seen by items loading on factors 1 and 2. The only exception is the "headaches/stomachaches" item, which seems to load on a separate factor. We then carry out robustness checks for the measurement invariance analysis excluding this item in Appendix C.
The same analysis is repeated for the SDQ scale in MCS in Table A9. An internalising, emotional dimension (factor 3) emerges neatly, and coherently with the analysis on our subscale. The externalising items from our subscales are split across two dimensions in this full-scale EFA: one more related to hyperactivity (factor 2) and one to conduct problems (factor 4). This is consistent with the original structure of the SDQ (?).

Appendix C Robustness of item choice
In deriving our novel 11-item scale, we construct two items for the BCS cohort based on questions that are not in the original Rutter scale -namely those concerning "headaches/stomachaches" and "tantrums" (see Appendix A above for details). Concerns might arise that introducing these items might somewhat invalidate our main conclusions, rather than provide additional informational content on children's externalising and internalising behaviours and symptoms.
In fact, exploratory factor analysis (on both the full Rutter and SDQ scales and on the 11-item subscale) shows that these items, numbered 5 and 11, perform poorly and exhibit relatively low factor loadings. As a robustness check, we replicate the main results of the paper by excluding them from the subscale.
Panel C of Table A11 shows that the measurement invariance analysis yields the same qualitative results once these two items are included. Figure A2 shows a scatter plot of the factor scores obtained from the factor model with and without items 5 and 11. They exhibit very high correlation, thus indicating that our results in Section 6 would not substantially change if we omitted the two items with the least informational content.  Higher education d Employed d (log) gross weekly pay Higher education is defined as having a university degree or its vocational equivalent. It corresponds to level 4 or 5 in the National Vocational Qualification (NVQ) equivalence. Employed is a dummy for being in paid employment or self-employment, either full or part time. Gross weekly pay is weekly pre-tax pay from the respondent's main activity, conditional on being a paid employee.

Notes:
Variables denoted by d are binary or categorical.    Notes: The table displays the factor loadings obtained from exploratory factor analysis (EFA) on our novel scale, separately by cohort. The EFA is performed decomposing the polychoric correlation matrix of the items and using weighted least squares, and the solution is rescaled using oblique factor rotation (oblimin). We use the R package psych, version 1.8.4 (?). Notes: The table displays the factor loadings obtained from exploratory factor analysis (EFA) on our novel scale, separately by cohort. The EFA is performed decomposing the polychoric correlation matrix of the items and using weighted least squares, and the solution is rescaled using oblique factor rotation (oblimin). We use the R package psych, version 1.8.4 (?).  Notes: The table displays the factor loadings obtained from exploratory factor analysis (EFA) on the full set of Rutter items from the BCS. Items denoted by * are used in the 11-item scale. The EFA is performed decomposing the polychoric correlation matrix of the items and using weighted least squares, and the solution is rescaled using oblique factor rotation (oblimin). We use the R package psych, version 1.8.4 (?).   (2) and (8)           Notes: The table shows coefficients from linear regressions of cohort members' adolescent outcomes on their externalising and internalising socioemotional skills at five years of age. Col. (1) shows the mean of the outcome for males. Col. (2) regresses the outcome on the scores obtained from the factor model in Section 4. Col.
(3) additionally controls for cognitive ability at age five. This is a simple factor score obtained by aggregating the available cognitive measures. All standard errors in parentheses are obtained using 1,000 bootstrap repetitions, taking into account the factor estimation stage that precedes the regression. Col. (4) replaces the factor scores used in col.
(3) with simpler sum scores -see Figure A1. Col. (5) to (8) repeat for female cohort members. All estimates additionally control for region of birth, maternal education (5), maternal employment (5), father occupation (5), maternal background (age, height, nonwhite ethnicity, number of children in the household), pregnancy (firstborn child, number of previous stillbirths, mother smoked in pregnancy, preterm birth, (log) birth weight). See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1. Notes: The table shows coefficients from the joint estimation of the (partially invariant) measurement system for externalising and internalising socio-emotional skills at five years of age and of the cohort members' adolescent outcomes. Estimation is by maximum likelihood, see Table 6 for the corresponding results obtained via the two-step process. All estimates control for region of birth, maternal education (5), maternal employment (5), father occupation (5), maternal background (age, height, nonwhite ethnicity, number of children in the household), pregnancy (firstborn child, number of previous stillbirths, mother smoked in pregnancy, preterm birth, (log) birth weight). See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1. Notes: The table shows coefficients from linear regressions of cohort members' adolescent and adult outcomes on their externalising and internalising socio-emotional skills at five years of age. Col. (1) shows the mean of the outcome for males. Col. (2) regresses the outcome on the scores obtained from the factor model in Section 4. Col.
(3) additionally controls for cognitive ability at age five. This is a simple factor score obtained by aggregating the available cognitive measures. Col. (4) uses sum scores (see Figure A1) instead of factor scores. All standard errors in parentheses are obtained using 1,000 bootstrap repetitions, taking into account the factor estimation stage that precedes the regression. Col. (5) to (8) repeat for female cohort members. All estimates additionally control for region of birth, maternal education (5), maternal employment (5), father occupation (5), maternal background (age, height, nonwhite ethnicity, number of children in the household), pregnancy (firstborn child, number of previous stillbirths, mother smoked in pregnancy, preterm birth, (log) birth weight). See Table A1 for a description of the variables used. * * * p≤0.01, * * p≤0.05, * p≤0.1.  Figure A3: Item-level inequality by mother's education Notes: The graph displays the ratio between the prevalence of each item in our scale in children of educated vs uneducated mothers, by cohort and gender. All items that have three categories in the scale have been dichotomised. For example, if the prevalence of the 'Restless' behaviours among children of mothers with compulsory schooling in the BCS cohort is 7.5%, and 5% among mothers with post-compulsory schooling, the ratio will be 1.5. The error bars represent 95% confidence intervals.  Figure A4: Item-level inequality by mother's pregnancy smoking Notes: The graph displays the ratio between the prevalence of each item in our scale in children of mothers who smoked in pregnancy vs nonsmokers, by cohort and gender. All items that have three categories in the scale have been dichotomised. For example, if the prevalence of the 'Restless' behaviours among children of smoker mothers in the BCS cohort is 7.5%, and 5% among non-smoker mothers, the ratio will be 1.5. The error bars represent 95% confidence intervals.  Figure A5: Item-level inequality by father's occupation Notes: The graph displays the ratio between the prevalence of each item in our scale in children of white collar vs blue collar fathers, by cohort and gender. All items that have three categories in the scale have been dichotomised. For example, if the prevalence of the 'Restless' behaviours among children of blue collar fathers in the BCS cohort is 7.5%, and 5% among white collar fathers, the ratio will be 1.5. The error bars represent 95% confidence intervals.