Small population bias and sampling effects in stochastic mortality modelling

We propose the use of parametric bootstrap methods to investigate the finite sample distribution of the maximum likelihood estimator for the parameter vector of a stochastic mortality model. Particular emphasis is placed on the effect that the size of the underlying population has on the distribution of the MLE in finite samples, and on the dependency structure of the resulting estimator: that is, the dependencies between estimators for the age, period and cohort effects in our model. In addition, we study the distribution of a likelihood ratio test statistic where we test a null hypothesis about the true parameters in our model. Finally, we apply the LRT to the cohort effects estimated from observed mortality rates for females in England and Wales and males in Scotland.


Introduction
Stochastic mortality models are widely used as risk management tools in the insurance and pensions industry with the main application being the generation of plausible scenarios for future mortality rates. Many stochastic mortality models have been introduced in the last few decades. When new models have been developed the objective was mostly to improve the goodness of fit of the model to mortality data observed in relatively large populations: the Lee-Carter model and its refinements (e.g. [3,23,31]) have been developed to provide a good fit to the mortality rates observed in the United States, England and Wales and the population of UK male assured lives; while the Cairns-Blake-Dowd ( [6]) model (CBD) was introduced for modelling the England and Wales males population at higher ages.
In contrast, actuaries will often face the problem of modelling the mortality experience of much smaller populations, for example, the members of a pension scheme. Empirical research has found that mortality rates of smaller populations exhibit significantly more variability compared to the observed rates in larger populations. Furthermore, models that fit large countries well, might not be appropriate for smaller populations, for example, [3] showed that the Lee-Carter model provides a rather poor fit to the mortality experience of smaller populations. A related issue is that empirical data from smaller populations might only be available for a relatively short period, which makes mortality projections rather uncertain. As a result, a number of recent papers have aimed to develop models specifically for smaller populations: for example, the Saint Model of [18].
A common assumption for many of the proposed models is that the observed numbers of deaths are realisations of random variables with a Poisson distribution given the underlying mortality rates. The estimation of parameters of any such model is therefore based on samples from a Poisson distribution, and, as always in statistics, parameter uncertainty is related to the sample size. Furthermore, many results about the distribution of estimators and corresponding confidence intervals rely on the Maximum Likelihood theorem and large sample sizes.
The increased uncertainty about estimated parameters for small populations results in high levels of uncertainty about projected mortality rates. As a consequence future realised mortality rates will not only diverge from projected rates due to future sampling variation caused by the Poisson distribution, but might also diverge from projections since the projections themselves are uncertain.
In the actuarial literature, simulation techniques have been proposed for dealing with uncertain parameters and projected mortality rates. For example, [24] investigated mortality uncertainty by applying a block bootstrap method on the Lee-Carter model, and [4] proposed Poisson bootstrap methods for mortality forecasting. [6] studied the parameter uncertainty of the two factor CBD model by adopting a Bayesian approach. Czado et al. and Pedroza [12,29] carried out the first Bayesian analysis using Markov Chain Monte Carlo (MCMC) of the Lee-Carter model, with further work by [21,22]. Reichmuth and Sarferaz [30] applied MCMC to a version of the [31] model.  applied MCMC to a twopopulation Age-Period-Cohort model by combining the Poisson likelihood for the deaths counts with time series likelihood functions for the latent random period and cohort effects.
However, to the best of our knowledge, bootstrap methods have not been applied in a systematic way to investigate the impact of the size of a population on parameter and projection uncertainty. This is the focus of our research in this paper. We firstly apply Poisson parametric bootstrap methods to investigate how the variation of parameter estimates and projections is affected by the size of a population. The specific mortality model that we consider is a second generation CBD model with added cohort effect: see Sect. 2 for details. We vary the size of the population by assigning weights to a chosen benchmark population, e.g. England and Wales males. In simulation studies we find that the size of the population has a significant effect on the variation of parameter estimates and projections.
Although we apply a weight to the benchmark population (i.e. scale it down), we ensure that the mortality rates of the constructed small populations are equal to the fitted mortality rates of the benchmark population. In such a situation, uncertainty in projected mortality rates will be reduced if information from the benchmark population parameter estimates can be used for fitting smaller populations. This raises the question of how we can test for systematic differences between the parameters driving mortality rates in a small population and a given null hypothesis about those parameters, where the null hypothesis might have been obtained from a model fitted to a much larger population. If no significant differences can be found then it seems reasonable to use elements of the large population model fit to assist in generation of scenarios for the small population. We therefore investigate the properties of a likelihood ratio (LR) test for all or some of the estimated parameters, and, in particular, consider the distribution of the test statistic based on the bootstrap simulations. This allows us to investigate the power of the LR test and the effect of varying population sizes on the rejection rates. We find that the population size has a strong effect on the probability of a type II error. This is particularly relevant for pension schemes since the acceptance of an incorrect null hypothesis might lead to inaccurate mortality assumptions. To investigate the financial consequences of the resulting misspecified model, we consider annuity prices based on different assumptions about the underlying parameters of our model.
We apply the LR test in an empirical study. The null hypothesis for that study is the estimated cohort effect for males in England and Wales. With this null hypothesis we then carry out hypothesis tests using, first, mortality data for females in England and Wales and, second, males in Scotland to check if their cohort effects are significantly different from the estimated cohort effect for males in England and Wales. We find for both populations that the estimated cohort effect is significantly different from that in the null hypothesis.
The remainder of the paper is organised as follows. Section 2 introduces the model, assumptions and the notations we apply. Section 3 discusses the process of simulation and investigates the distribution of the maximum likelihood estimates, the correlation between the estimates and how these will be affected by changing the population size. In Sect. 4, we investigate the effect of the population size on forecasting by projecting the parameters as well as the mortality rates. Section 5 introduces a likelihood ratio test for testing systematic deviations of the true parameters from a given null hypothesis. The power of the likelihood ratio test is also analysed and we then investigate how significant the impact of shifting and scaling parameters is on the fitted mortality rates and corresponding annuity prices in Sect. 6. Finally, Sects. 7 and 8 include the LRT for testing a null hypothesis about the cohort effect only, and an empirical example for this test is provided. Section 9 provides our final conclusions.

The model
We denote by D(t, x) the number of deaths during calendar year t ¼ t 1 ; . . .; t n y at age x ¼ x 1 ; . . .; x n a and by E(t, x) the corresponding central exposure to risk.
We will fit the following Poisson model to the observed death data, see [8]: Dðt; xÞjh $ Poisðmðh; t; xÞEðt; xÞÞ ð1Þ mðh; t; xÞ ¼ À logð1 À qðh; t; xÞÞ ð2Þ where the parameter vector h is given by t ; c ð4Þ c Þ with the following interpretations: • j ðiÞ t is a period effect in year t ¼ t 1 ; . . .; t n y for each i ¼ 1; 2; 3, • j ¼ fj ð1Þ ; j ð2Þ ; j ð3Þ g, where j ðiÞ ¼ fj x is the mean of the age range we use for our analysis, and •r 2 x is the mean of ðx À xÞ 2 .
The reason for including the cohort effect is that it is a well established feature in some populations such as England and Wales [8]. We do not claim that this model is necessarily the best model for the datasets to be considered. However we select the model based on a particular set of model selection criterion studied in Ref. [8]. The choice of ''M7'' here reflects the work of Ref. [8] namely that we want to use a model that fits the males from England and Wales well. It is well known that the parameters in model (3) are not identifiable without imposing constraints on their values. Nielsen and Nielsen [27] discussed the impact of identifiability problems within stochastic mortality models on parameter estimation, hypothesis testing and forecasting. Currie [11] discussed modelling with M7 by writing the model as a generalized linear model with a non-full rank design matrix. We follow Ref. [8] and apply the following constraints on h: X c2C c ð4Þ where C ¼ t 1 À x n a ; . . .; t n y À x 1 is the set of all years of birth in a given dataset. In this study, identifiability constraints are defined as part of our model system to ensure all parameters are identifiable and provide a coherent framework for the consideration of confidence intervals and for hypothesis testing. One can freely adopt any reasonable set of constraints to the model and the study would be focusing on the results given the selected constraints.
To estimate the parameters in (3) we apply maximum likelihood estimation. The log-likelihood function in our model is Dðt; xÞlog½Eðt; xÞmðh; t; xÞ À Eðt; xÞmðh; t; xÞ À log½Dðt; xÞ! ð5Þ where mðh; t; xÞ is given by (2) and (3). It is worth noticing that both the fitted mortality rates and the log-likelihood function lðh; D; EÞ are invariant to the choice of the identifiability constraints. As mentioned earlier, in this paper we are concerned with the consequences of small exposures, or population sizes, on the distribution of the maximum likelihood estimator (MLE)ĥ of h. To study the distribution of the MLEĥ we will simulate death data D(t, x) from the model in (1-3) using a given parameter vector h 0 and different exposure sizes.
To ensure that our results are relevant for typical values of h we first fit our model to death and exposure data observed in England and Wales during the years 1961 to 2011 for males aged 50 to 89. Note that we do not claim that this is the only choice of dataset. Any large population plus any model that is known to fit it well can be used for this study. The reason for the choice of dataset is that we have familiarity with the England and Wales data and the selected model fits the similar dataset well in Ref. [8]. We then fix h 0 to be equal to the estimated parameter vectorĥ EW for this data. Note that this is only an example for the true parameter vector h 0 and our analysis can be applied to other choices of h 0 . Mortality data for England and Wales are obtained from the Human Mortality Database. 1 Note that we do not exclude short cohorts from the estimation since we are interested in how the MLE fits the short cohorts and the impact of small population sizes on the estimates.
The different exposure sizes used to simulate data in the remainder of this paper will be relative to the exposure E 0 ðt; xÞ for a benchmark population. For reasons of practical relevance and consistency with our choice of h 0 the benchmark population is the male population in England and Wales unless stated otherwise.

Distribution of MLE in finite samples
For any given parameter vector h 0 and benchmark exposure E 0 ðt; xÞ we define the small-sample exposure as E w ðt; xÞ ¼ wE 0 ðt; xÞ for a constant w 1. Table 1  A more general approach would be to consider a weights matrix W ¼ fw t;x g allowing for weights to depend on age and calendar year. This would be particularly relevant when our proposed methodology is applied to investigate the mortality of members of a pension scheme with a very different age structure than the age structure of the overall population in England and Wales. However, for clarity of presentation, we only consider a constant weight applied to all ages and calendar years.

MLE
To obtain MLEs forĥ w j for each simulated scenario j and each w we maximise the log-likelihood function lðh; D w j ; E w Þ as given in (5) subject to the constraints in (4), that is,ĥ Classical sampling theory tells us that ffiffiffi ffi w pĥw j À h 0 À! Dist Nð0; HÞ; as w ! 1 for some positive semi definite matrix H (see Appendix A for further discussion). Therefore, we would expect that, even in a finite sample, the co-variance of the distribution ofĥ w j is approximately to w À1 H and the correlations between different components ofĥ w j are approximately independent of the relative population size w. Using the simulated sampleĥ w 1 ; . . .;ĥ w N we can investigate the finite-sample covariance and correlation matrices ofĥ w . In Fig. 1 we plot a graphical representation of the correlation matrices ofĥ w j that we obtain for two values of w. We conclude from Fig. 1 that there are no significant differences between the empirical correlation matrices obtained from different population sizes, as predicted. However, individual components ofĥ w are not independent from each other as we would expect given the model in (1)(2)(3).
To investigate the finite-sample distribution of the MLEĥ w further we plot the empirical mean together with 90% confidence intervals for each of the components ofĥ w in Fig. 2. We find for all population sizes considered that the empirical means of the simulated estimates fluctuate around the true parameter values h 0 (solid line), which indicates that the MLE is approximately unbiased for all considered population sizes. However, the standard deviation of the estimator depends strongly on the size of the population, increasing significantly as the exposures get smaller as can be seen from the width of the confidence intervals.
The relative levels of the lines in the graphs on the right hand side of Fig. 2 show that the level of fluctuation increases approximately by a factor ffiffi ffi n p if the population size is reduced by a factor 1 / n, which is consistent with the asymptotic covariance matrix being proportional to 1 / w. It also suggests that the variance is generally stable for all the period effects over years, which is not the case for the cohort effect with a wave shaped pattern. We notice that the standard deviation of c ð4Þ c;w widens out considerably at both ends, reflecting the reducing number of observations that we have for the younger and older cohorts. It is worth recalling that the finite-sample distribution of the MLEĥ w varies if different sets of constraints are defined in the model system: that is, for a given w, the shapes of the various confidence intervals might be different if other identifiability constraints are used. The impact of the identifiability constraints in our study can be removed by calculating the following quantities of the point estimates:   sample distribution of these quantities and the corresponding standard deviation are shown in Fig. 3, where unsurprisingly the right column implies that our conclusion regarding to the proportional relationship between the variance and the population size holds.

Mortality projections
While fitting the model in (1)(2)(3) to observed mortality data only requires the estimation of the period effects j t ¼ ðj c , projecting mortality rates into the future requires a model for values of j t for t [ t n y where t n y is the last year for which mortality data are available. Similarly, future values of the cohort effect c ð4Þ are also required.
The most common approach to obtain future values of j and c ð4Þ is to consider these parameter vectors as observed trajectories of stochastic processes and fit a parametric time series model to each trajectory. In the following we will fit a threedimensional random walk to j t and a stationary AR(1) model to c ð4Þ c , as in Ref. [8]. We will then discuss the estimation of the parameters of those models based on the values of h 0 andĥ w j for different values of w. This will allow us to investigate the impact of the relative population size w on the estimators for the parameters of the j and c ð4Þ processes.
For the estimation of those parameters and the projections of the period effects and the cohort effect we will consider two approaches. Firstly, we will use a frequentest approach to obtain point estimates of the process parameters ignoring any uncertainty about those estimates. In our further analysis we will follow a Bayesian approach to incorporate parameter uncertainty into our mortality projections.

Projecting period effects
As mentioned above, we model the period effects j t as a three-dimensional normal random walk. where t Þ 0 are independent random vectors with a multivariate standard normal distribution. The parameter vector l is the 3 Â 1 drift vector of the random walk and L is the 3 Â 3 Cholesky decomposition of the covariance matrix V ¼ LL 0 .

Point estimators
Having generated N scenarios for the number of deaths according to (6) and having estimated the parameter vectorĥ w j in each scenario as in (7), we can now apply the The corresponding estimators for l and V for the true trajectory h 0 are defined similarly.

Bayesian estimation-parameter uncertainty
As mentioned earlier we model uncertainty about the parameters l and V by applying a Bayesian approach to estimation. We denote by p the density of the prior joint distribution of the two parameters. Assuming that we have no prior knowledge about the true values of l and V, we use the Jeffreys prior density pðl; VÞ / jVj À 3 2 ; where |V| is the determinant of V (see for example, [16]). Using this prior distribution in each scenario j, the posterior distribution is given by the inverse Wishart distribution for V and a multivariate normal distribution for l, that is, wherel w j andV w j are the estimates obtained fromĥ w j as defined in (9) and (10). By comparing the densities in the two columns of that figure we observe that the additional parameter uncertainty increases the variance of the empirical distributions of the drift estimators. This can be explained by investigating the source of uncertainty to the drift. The variation to the point estimatorl w ðiÞ with no allowance for parameter uncertainty comes from the Poisson noise in the number of deaths (a) Density Fig. 4 The impact of population size on the distribution of the random walk drift, from population of

Empirical comparison
England and Wales (vertical line). The left column is the density of drift without allowance to the parameter uncertainty; the right column is the density of drift with allowance to the parameter uncertainty from the bootstrap simulations, while the variance of the Bayesian estimatorl w ðiÞ with allowance for extra parameter uncertainty also includes the uncertainty (Eq. 12) from the posterior distribution given the Poisson noise. We also find in Fig. 4 that the size of a population affects the uncertainty about the drift vector l. The variance of the empirical finite sample distribution of both estimators,l andl decreases significantly when the population size increases, although the difference between w ¼ 1 and w ¼ 0:01 is rather small as is particularly obvious for the Bayesian estimatorl Fig. 5 However, for smaller values of w we find that the population size has a much more pronounced effect on the variance. For example, the range of likely values of l 0:001 is significantly wider than the range of values ofl 0:1 andl 1 reflecting the uncertainty about l w that we have already observed in Fig. 2 top left. The same argument applies to the point estimatorsl.
To investigate parameter uncertainty further we calculate the standard deviations for the distributions ofl andl in Fig. 4. Those standard deviations are shown in Table 2. We observe that the standard deviation of the point estimatorl is increased approximately by a factor To investigate the impact of the relative population size w and the inclusion of parameter uncertainty on the empirical distribution of the estimated co-variance matrix V of the random walk in (8)  The corresponding estimated covariance matrices, V EW , for England and Wales based on the single sample paths of j t and c c and the mean of Bayesian estimator are Table 2 The finite sample standard deviation ofl andl Comparing the mean values ofV andṼ with the estimates obtained from the England and Wales data we find significant differences in the estimated covariances.
In particular, for smaller populations (e.g. w=0.01) sampling variation pushes up significantly estimates of the covariance matrix. In addition, sampling variation also widens the distribution of V around these mean values for smaller values of w. On the other hand, for a given value of w, the inclusion of full Bayesian parameter uncertainty moving fromV toṼ has rather less of an impact. Finally, the projected parameters based on the Bayesian estimatesl andṼ are shown in Fig. 7. As we expected, the prediction intervals reflecting the uncertainty about future values of the period effects are very wide for small populations. The plots also suggest that the means of the co-variances are right biased compared to the estimate for England and Wales. The variance of projection for all the populations are much higher than the estimates, due to the additional normal randomness added in the forecasting model by simulating the sample paths for j and c ð4Þ . However, the left column shows that there is no obvious proportional relationship between the population size and projection variance. By investigating the mean co-variance matrices, we find that the increase of E½V w 3;3 from w ¼ 0:01 to w ¼ 1 is of the highest among the three period effects, which suggests that the standard deviation of projection for j

Projecting the cohort effect
As mentioned earlier we fit an AR(1) model to the cohort effect. We will not investigate how additional parameter uncertainty influences mortality projections, but will only use point estimates for the parameters in the AR(1) model. To be precise, our model is given by Figure 2 shows that the variance of the estimated cohort effect is very large for the very early and very late years of birth, in particular, for w ¼ 0:01 and 0.001. This is a consequence of the very few observations available for those cohorts. We therefore remove the cohorts with six or less observations. Cohorts are removed equally from the beginning and the end. However, removing short cohorts could significantly influence the estimated values of the parameters. To investigate the effect of removing short cohorts in more Small population bias and sampling effects... 207 detail we plot the empirical densities for the parameters in (13) based on the estimated parameters in each simulated scenario j for w ¼ 1. We find that the distribution ofâ 0 is not significantly affected by removing cohorts which is also the case for the estimated variance of c when more than 4 cohorts are removed in total.
In contrast,â c is shifted to the left as more cohorts are removed. Further we notice that the variance of the estimators for all three parameters stays approximately unchanged regardless of how many cohorts are removed. After having removed cohorts with six or less observations from the data, we fit the AR(1) model in (13) to the rest of the cohort effects. The resulting density of the parameter estimates of the model are shown in Fig. 6. All of the parameter estimates and the standard deviation of error terms appear to be biased relative to the estimate for England and Wales, regardless of the size of population. However, we find that reducing the population size will greatly increase the mean bias as well as the uncertainty.
We now forecast the cohort effect fromĉ 1956 to 1961 now comes from the Poisson and Normal randomness, which is not as great as variation at the two tails of the estimates observed in Fig. 2 where no cohorts have been removed. Within the sample, the confidence intervals are narrower for cohorts with greater numbers of observed years (ranging from 7 to 40) and greater numbers of deaths since variance is reduced by having more number of observations.

Projected mortality rates
Based on the projected period and cohort effects we can now turn to the projection of mortality rates using our model in (1)-(3). Figure 8 shows the twenty-year forward projections of mortality rates at ages 65 and 85. We compare the predicted rates with and without the allowance for parameter uncertainty for all the constructed populations with the projections based on the England and Wales data. Unsurprisingly, the uncertainty about future mortality rates increases as the forecast Reducing the population size results in greater uncertainty about mortality forecasts for both ages. For example, the uncertainty is much greater for the smaller populations (w ¼ 0:01; 0:001) at both ages 65 and 85. This means that there is considerable uncertainty about future mortality scenarios for a relatively small pension scheme with significant implications for the risk management of such a scheme.
Comparing parts (a) and (b) of Fig. 8 we find that the inclusion of parameter uncertainty for the drift parameter l adds further uncertainty about the projected mortality rates. This reflects the additional randomness from not having a sufficiently long period of observed rates. We notice that the difference of variance between including and excluding parameter uncertainty increases as time increases. Thus parameter uncertainty becomes much less important when only relatively short forecast horizons are considered. Similar results can be found in Fig. 6 of [6] which shows the log scaled variance of both, with and without parameter uncertainty, for the survival index. Our findings are also in line with results obtained by [20] who have found that the uncertainty about the drift of the period effect in a Lee-Carter model has little impact on the uncertainty of short term projections while it significantly affects the uncertainty of long-term projections. This supports our conclusion that the differences in the variances are tiny when the projection horizon t is very small, and become more significant for long term projection. We notice that for age 65 the intervals are not smooth in some years due to the cohort effect.
We also notice that age seems to affect the amount of uncertainty around the central projection differently in small and large populations. To illustrate this further we consider the standard deviation of projected mortality rates as a function of age for a fixed projection horizon. Figure 9 shows the log-scaled standard deviation of the projected mortality rates in the calendar year 2030 with respect to age. We find in this figure that the variance is an increasing function of age if the population size is rather large. In contrast, we find for the smallest population (w ¼ 0:001) that the variance only starts to increase from about age 70 while it is constant or slightly decreasing for younger ages. As we found in Fig. 8, at age 65 (and also at age 85), the three largest populations have prediction intervals which are of similar width. However, Fig. 9 shows that the much wider prediction intervals for the two smaller populations seem to be less affected by age with the relative increase in the standard deviation from age 65 to 85 being smaller than for the large populations. We are also interested in how much of the forecast variation is due to the impact of sampling variation and parameter uncertainty on the covariance matrix, V, and the drift, l. To investigate this, we consider four experiments outlined below. Note that we still projected the cohort effect, given the point estimates for population w with the method introduced in Sect. 4.2 and we sample from the empirical distribution of l andV without considering the Bayesian posterior.
1. Project mortality rates for each constructed population, while fixing the parameters l and V of the random walk to the estimates obtained from the England and Wales data.
Small population bias and sampling effects... 211 2. Project mortality rates for each constructed population, while fixing only the drift l to the corresponding EW estimates and sample realisations ofV from its empirical distribution. 3. Project mortality rates for each constructed population, while fixing only the variance matrix V to the corresponding EW estimates and sample the drift parameter from the empirical distribution ofl. The results are shown in Fig. 10. We find that fixing parameters has a significant effect on mortality forecasting when populations are very small (w = 0.001) in Fig. 10a. We can see that the widths of prediction intervals for our experiments 1 and 3 are much narrower than for experiments 2 and 4, and the difference of variance is greater for long term projections. The major difference between these two scenarios is that we fix the co-variance matrix V to its estimate obtained from England and Wales data in experiments 1 and 3. Thus we conclude that a major source of uncertainty for our mortality forecasts comes from the bias in the estimated covariance matrix for small populations.

Summary
To summarise, forecasts levels of uncertainty in future mortality are biased upwards for two reasons. First, and most obvious, the Poisson noise in the data biases up estimates of the random walk covariance matrix to a significant extent (Fig. 10). Second, when we include a Bayesian analysis of parameter uncertainty, uncertainty in the random walk drift resulting from observation over a relatively small number of years is pushed up by the small population bias in the covariance matrix, V. This has its greatest impact in longer term forecasts, and less impact in the short term.

Likelihood ratio test for systematic parameter difference
We have seen that the size of a population has a substantial impact on the level of uncertainty about the parameters of the model in (1-3) when this model is fitted to the population's mortality data. This raises the question whether the estimated c Þ for a small a population are significantly different from those in a given, typically much larger, reference population. To address this question we apply a likelihood ratio test to test for significant deviations of estimated parameters from a given null hypothesis using the maximum likelihood estimatorĥ w j defined in (7) for simulated mortality data D w j as in (6). We are particularly interested in the finite sample distribution of the test statistic as compared to its asymptotic distribution. As in Sect. 3 we will use simulated deaths scenarios to investigate the finite sample distribution and the power of the likelihood ratio test (LRT) applied to mortality data. We will start with a short review of the LRT.

Review of likelihood ratio test
The LRT used in this study follows the generalized form of the LRT as defined in Ref. [19]. For a random variable X with a distribution that depends on a parameter vector h, the likelihood function is defined as usual: where f i ð:jhÞ is the probability density function of X i given the parameter vector h. We assume that h :¼ ðh r ; h s Þ is a vector of r þ s parameters. The null hypothesis and alternative for the LRT concern only the parameters in h r , that is, In order to calculate the test statistic, we first find the MLE of ðh r ; h s Þ, which leads to the unconditional maximum of the likelihood function h :¼ ðĥ r ;ĥ s Þ :¼ arg max ðh r ;h s Þ Lðxjh r ; h s Þ: ð15Þ We then find the MLE of h s assuming that the null hypothesis is fulfilled, that is, In generalh s h s ðh r0 Þ 6 ¼ĥ s . We use the notationh s ðh r0 Þ to emphasis thath s is conditional on the value of h r0 . We now define the test statistic in the usual way: [33] proved that when H 0 holds, C asymptotically follows a central v 2 distribution with r degrees of freedom. From the central limit theorem, it follows that the v 2 r distribution can be approximated by a normal distribution with mean r, given r is sufficiently large. 2 Thus we expect that the distribution of C should approximately be symmetric around r. Before we start testing our null hypothesis, it is worth considering the testability of the hypothesis. 3 In our approach the constraints in Equation (4) in Sect. 2 are part of the model and therefore the effective number of parameters that are identifiable is the total number of parameters reduced by the number of constraints. In this paper, we formulate the constraints in terms of the cohort effect c since we will in particular consider the case h r ¼ c in our empirical study. If the test is about one of the period effects we could reformulate the constraints in terms of that period effect (strictly, therefore, a different model). In that way, the constraints are always fulfilled under H 0 . In short, the constraints should be chosen such that the null hypothesis fulfils the constraints. In other words, we are testing the null hypothesis that the mortality experience is generated by mortality rates that follow model M7 with the constraints in Equation (4) and h r ¼ h r 0 .
In the reminder of this section we will consider a null hypothesis about the entire parameter vector h setting s ¼ 0. In Sect. 7 we will then consider a null hypothesis about the cohort effect c only, that is s [ 0.

Finite sample distribution of LRT
As mentioned above, we now consider a test for systematic parameter differences involving all period effects and the cohort effect, that is, To investigate the finite sample properties of the LRT in small populations we apply a parametric bootstrap procedure in which we simulate N mortality scenarios, estimate the parameter vector h as in Sect. 3 and apply the LRT in each scenario. More precisely we use the following steps to find a bootstrap approximation of the finite sample distribution of C: for different values of w and for each scenario j ¼ 1. . .N we 1. simulate D w j as in (6), 2. find the estimateĥ w j as in (7), 3. calculate the realisation of the LRT statistic C w j as in (18) and 4. calculate the p-value P w j based on the asymptotic v 2 -distribution as P w The degrees of freedom of the v 2 distribution in step 4 should be the effective number of parameters denoted by a, which is the total number of parameters r less the number of constraints, that is where n y is the number of years, and n c ¼ n y þ n a À 1 is total number of cohorts in a given dataset without removing short cohorts. In our case, n y ¼ 51, n c ¼ 51 þ 40 À 1 ¼ 90, hence a ¼ 240. After applying the parametric bootstrap method we can generate the distribution of the test statistic. We expect that the distribution of C w should be approximately symmetric around 240. For any population size w we can now find the empirical distribution function of C W based on the sample C w 1 ; . . .; C w N . Furthermore, if the asymptotic v 2 approximation is accurate, the p-values P w 1 ; . . .; P w N should be independent and uniformly distributed on [0, 1]. The cumulative distribution of the test statistic C w and the pvalues P w for all considered population sizes w are shown in Fig. 11 for N ¼ 1000. Figure 11a shows that the empirical distribution of C w is indeed centred around a ¼ 240. We also observe in Fig. 11b that the cumulative distribution function of the p-values resembles the distribution function of the uniform distribution on [0, 1]. Both results indicate that the v 2 approximation for the distribution of C w under the null hypothesis is very good for all values of w considered.

Power of the likelihood ratio test
In the last section, we carried out the likelihood ratio test for the parameter difference and found that the v 2 approximation does not fail to capture the feature of the test statistic C w when H 0 holds. We will now investigate how the population size affects the power of LRT. In general, the power of a binary hypothesis is the probability of correctly accepting the alternative hypothesis when it is true. 4 As usual the power of a test is defined as ProbðReject H 0 j H 1 is TrueÞ: To evaluate the power of the LRT with a parametric bootstrap procedure similar to the one used in the previous section we need to generate scenarios under the alternative. So far we have considered a very general alternative h r 6 ¼ h r0 . We will now need to specify this alternative further. To this end we define four alternative models and investigate the power assuming that the ''true'' data generating model is one of those alternatives. The four models we consider for the alternative shift or scale one of the period effects or the cohort effect estimated from the England and Wales data. More specifically, the alternatives we consider are: in h ð1Þ . We note that a more general alternative could be considered by allowing for combinations of the above. However, we wish to focus on the impact of misspecifying individual parameters and the power of the test to detect those misspecification.
We can now proceed as in the previous section with simulating death counts and then apply the LRT for different alternatives and different values of k. We define  (2) and (3). Note that death counts also depend on k.
Using the simulated death counts D w;ðiÞ j we obtain the MLEĥ w;ðiÞ j as in (7). We then use the asymptotic v 2 -distribution to test the null hypothesis that the parameters of our model are equal to the parameters obtained from the England and Wales populations. The p-values P w;ðiÞ j ¼ P w;ðiÞ j ðkÞ are then calculated as in step 4 in the previous section, and the null hypothesis is rejected in any scenario j for which P w;ðiÞ j \0:05, that is, the significance level of the test is 0.05. The power of the LRT for any fixed alternative i, relative population size w and fixed k is the proportion of the simulated p-values which are less than 0.05, that is, we count the number of scenarios for which the null hypothesis is rejected. More specifically, we define the random variables  where p w;ðiÞ ðkÞ is the (unknown) power of the LRT if alternative h ðiÞ with parameter k is the true parameter set for the simulated death counts. Therefore, the empirical rejection rate R w;ðiÞ ðkÞ is an unbiased estimator for the power p w;ðiÞ ðkÞ and the estimated standard deviation of R w;ðiÞ ðkÞ can easily be found from (20) in the usual way.
Then we investigate sensitivity of the power with respect to the size of k and the size of population w. for each of the four cases, h ð1Þ ; . . .; h ð4Þ , we consider a set of values for k that are regularly spaced. Figure 12 shows the obtained estimates R w;ðiÞ j ðkÞ for the power as a function of k for different relative population sizes w. Note that for each alternative h ðiÞ and any fixed k we have simulated N ¼ 100 scenarios, which is less than in the previous section. The reason is that we need to simulate those scenarios for each combination of i (alternative) and k, which makes the total number of simulated scenarios very large. Unsurprisingly, the power of the LRT is increasing in k for any h ðiÞ and relative population size w; the more we shift/scale the null hypothesis, the easier it is for the test to detect any shift/scaling. For the three period effects, decreasing the population size will greatly reduce the capability of LRT to detect the same amount of shift to a single parameter. We can also compare these plots with the earlier Fig. 2 which includes distributions of parameter estimates resulting from sampling variation. By way of example, for w ¼ 0:01 the width of the confidence interval in Fig. 2e for j   ð3Þ t;w is about 0.005. This is much larger than the shifts that are considered in the power plot in Fig. 12. The reason why the latter values are so much lower is because we apply a systematic adjustment to all of the j ð3Þ t;w , in contrast to random adjustments (due to sampling variation) in the former.

Impact of parameter misspecification on mortality rates and annuities
We now investigate how significant the impact of shifting and scaling parameters is on the fitted mortality rates and corresponding annuity prices. We consider again the four alternatives in the previous section. For each of those and for each relative population size w we determine the value of k that results in a power of 50% of the LRT, that is, there is a 50% probability that the LRT will detect the wrong model and reject the null hypothesis. Those values are denoted by k w;ðiÞ 0:5 and shown in Table 3.  We then calculate fitted mortality rates using the model in (2) and (3) with the following parameter constellations:    parameters which produce significant changes in the fitted mortality rates are hard to detect with an LRT when the exposures are small.
From a financial point of view the effect on fitted mortality rates is only relevant in so far as annuity prices are affected. We will therefore consider the following annuities and discuss the effect of the four alternatives specified above on their values: where v is the discount factor, SðT þ t; xÞ is the survival index for the probability of an individual aged x exactly at the start of year T, that will survive for the next t years. We assume the interest rate of i ¼ 2%. The reason for investigating the deferred annuity is that Fig. 2g suggests that the estimates of cohort effect at c ¼ 1946 is approximately zero and the effect of scaling cohort estimates may not be obvious on the annuity price but more obvious for .
We project the period and cohort effects in h w;ðiÞ 0:5 (i ¼ 1; 2; 3; 4) andĥ EW forward for 35 years as in Sect. 4 where we use the point estimates defined in (9) and (10) for the parameters of the random walk for the shifted period effects, that is, we do not consider uncertainty about the drift and variance matrix of the random walk. Annuity prices are calculated for each sample path and we then calculate the average annuity price for each w with the ith parameter shifted or scaled. The results are shown in Tables 4 and 5.
The effects of shifting the period effects and scaling the cohort effect are somewhat varied. As might be expected, the impact on prices is most obvious for w ¼ 0:01. The impact on both types of annuity is straightforward to see for j ð1Þ : the shift pushes up mortality rates at all ages and lowers prices. For j ð2Þ there is more impact on the age-65 annuity than the age-55 deferred annuity as the shift lowers mortality at younger ages and raises it at higher ages. For j ð3Þ , also, the impact is different at different ages. Finally, for c ð4Þ , the impact of scaling simply depends on the sign and magnitude of the value of c ð4Þ for the cohort being priced.
Generally shifting or scaling the parameter estimates has no obvious effect on the annuity price and smaller populations can be affected more. Thus for testing a null hypothesis H 0 : h w¼0:01 r ¼ĥ EW , if we accept H 0 when, in fact, they are actually different (type II error) the financial consequence of this type II error will be small in our case. In other words, the fact that we have accepted H 0 means that h w¼0:01 r , while not identical, must be very close toĥ EW , and that, therefore, any error in pricing will also be very small.

Likelihood ratio test for the cohort effect
The general form of the LRT as reviewed in Sect. 5.1 allows us to test a null hypothesis about parts of the parameter vector h (restricted by the specified identifiability constraints as part of the model) rather than the entire h ¼ ðj  The shift is determined when it results in 50% power for each population w ¼ 1; 0:1; 0:01, which are shown in Table 3. We assume an interest rate of 2% Table 5 The impact of shifting each parameter separately on the price of a ten-year deferred twenty fiveyear temporary annuity for an individual aged at 55 Parameter shifted England and Wales w ¼ 1 The shift is determined when it results in 50% power for each population w ¼ 1; 0:1; 0:01, which are shown in Table 3. We assume an interest rate of 2% reduces the dimension of the parameter vector which needs to be estimated from the small population where parameter uncertainty is rather strong as we have seen in Sect. 3. The example we have in mind is a pension fund that uses national mortality data to improve its mortality models, or when the mortality experience in a small country is modelled based on the combined experience of other similar countries.
In the reminder of this section we will use the LRT to test a null hypothesis about the cohort effect c. In our general setting of Sect. 5.1 this means that Our null hypothesis is then that c ¼ c 0 where c 0 is a given vector of cohort effects, for which we later use an estimated cohort effect from a different population. We can now write the hypotheses as in (14) and proceed as in Sect. 5.2 to find the distribution of the LRT statistic in (17) for a finite sample of death counts from a small population.
For practical relevance we base our simulation study on the female and male populations in England and Wales. We choose c 0 ¼ĉ EW , which is the estimated cohort effect from the mortality data for males in England and Wales. It is worth noting that, asĉ EW already satisfies the identifiability constraints, the null hypothesis H 0 : c ¼ c 0 has no testability problems under the given identifiability constraints defined in the model system. To investigate finite sample properties of C we will need to specify a full parameter vector h to simulate scenarios for the death counts. Having fixed the cohort effect c 0 we choose the period effects to be the estimated period effects from data for the female population in England and Wales assuming that the cohort effect for those data is actually c 0 . As we are mainly interested in small populations we will consider deaths count scenarios for populations which have exposures equal to wE 0 where E 0 is here the exposure for the female population in England and Wales.
More specifically, we first find the MLEh s ¼ arg max h s Lðxjh r0 ¼ c 0 ; h s Þ of the period effects h s ¼ ðj t Þ from data for females assuming that the cohort effect is indeed c 0 (which is the estimated cohort effect for males), see (16). Note that no constraints are applied for findingh s since the cohort is fixed and therefore there is no identifiability problem. We then generate N realisations of the value of the test statistic C w for different values of the relative population size w using the following algorithm: 1. Simulate death counts D w j as in (6) using the parameter vector t ; c 0 Þ to obtain scenarios D w j for different values of the relative population size w. The period effectsj are estimated from data for females with the cohort effect fixed to c 0 . The exposure is wE 0 where E 0 is the exposure for the female population in England and Wales. 2. Find the MLEh s;j of period effects j in scenario j assuming that the null hypothesis holds, as in (16). 3. Find the unrestricted MLEĥ j as in (15). 4. Calculate the value of the test statistic C w j in (17) in each scenario j. 5. Calculate the p-values P w j based on the asymptotic v 2 -distribution with a degrees of freedom, where a is the number of parameters (cohorts) r minus the number of constraints as in Sect. 5.2. For our data set we obtain a ¼ 87.
The simulated distribution functions of the LRT statistic C w and the p-values P w are shown in Fig. 14. The results suggest that changing the size of the population has no significant impact on the distribution of C w and that the p-values are roughly uniformly distributed for all w, which is an indication that the v 2 -approximation works well for our data set as we have also found in Sect. 5.2 where the full parameter vector was tested.

Empirical examples
We apply the LRT for the cohort effect in two empirical studies.

Females vs. males in England and Wales
The population for which we wish to test the cohort effect first is the female population in England and Wales that we already considered in our simulation study. Our null hypothesis is therefore that the true cohort effect for the female population in England and Wales is equal to the estimated cohort effect for males in England and Wales. Note that this is different from testing the hypothesis that the male and female population share the same (true) cohort effect since we ignore the uncertainty about the estimated cohort effect for males. To illustrate the difference between the two cohort effects we plot in Fig. 15 the estimated cohort effects for females and males. There are fairly strong similarities between the two curves after about 1910, but there are also significant qualitative differences before 1900. To check empirically, that these differences are not simply the result of the identifiability constraints, one can plotĉ ð4Þ;M Àĉ ð4Þ;F . If this looks quadratic then the differences could, simply, be due to the identifiability constraint. But for these data, a plot ofĉ ð4Þ;M Àĉ ð4Þ;F would clearly not be quadratic (exhibiting more of a cubic shape).
This difference can be confirmed more formally using the LRT with the null hypothesis that the females have the same cohort effect as the previously estimated males cohort effect. The test statistic C is approximately 6311, which is an extremely high value for a v 2 -distribution with 87 degrees of freedom and is also very high compared to the values of C observed in our simulation study, see Fig. 14. The p-value is therefore very close to zero, and we reject the null hypothesis that the cohort effect fro the mortality of the female population is the same as the previously estimated cohort effect for the male population.

Male mortality in Scotland vs. England and Wales
A second, and more intriguing, empirical example concerns the cohort effects estimated from mortality data for the male population in England and Wales versus the male population in Scotland. Figure 16 compares the independently-estimated cohort effects with a confidence interval added around the Scottish estimates. Compared to Fig. 15, the two curves here look much more similar, with the pattern ofĉ ð4Þ;EW Àĉ ð4Þ;S again not like a quadratic function with respect to cohort year c. On the other hand, we find that most of the cohort effects for males in England and Wales lie outside of the confidence interval calculated for Scottish males. This suggests that although the two populations have similar pattern for the cohort estimates, the difference might still be significant. For the LRT we again choose c 0 ¼ĉ EW and then test the hypothesis that the true cohort effect for Scottish males is equal to c 0 . The 99% quantile of a v 2 -distribution with 87 degrees of freedom is approximately 121. For the test statistic we find C ¼ 193:37 and we therefore reject the null hypothesis and conclude that the cohort effect in Scotland is significantly different from the estimated cohort effect for England and Wales. This indicates that there might be factors in the Scottish male population that result in significant differences throughout time. However, we might speculate that there is a common cohort effect, that is, for some reason, magnified in Scotland. Investigating this in detail is beyond the scope of this paper, but we speculate that a magnified effect might be the result of socio-economic differences between the two populations: for example, cohort effects might be greater in lower socio-economic groups.

Conclusion
In this paper, we investigated the finite sample distribution of the maximum likelihood estimators for the parameters of a stochastic mortality model. We found that the size of a population has a significant effect on the uncertainty about the estimated parameters and mortality projections. In particular, we found that there exists a bias in the estimated covariance matrix of the random walk fitted to the period effects when the size of the underlying population is small. As a consequence, prediction intervals are rather wide for small populations even when parameter uncertainty is ignored.
To investigate if parameters estimated from larger populations can be used to generate scenarios for smaller populations we investigated how a likelihood ratio Fig. 16 The estimates of cohort effect, for the males of England and Wales (solid line )and Scotland (dotted line), age 50 to 89 last birthday, over year 1961 to 2011. The dashedlines are the CI for the cohort effect of Scotland. The upper bound is 95% quantile ofthe distribution and the lower bound is 5% quantile test performs when applied to the mortality experience of a small population. We found that the finite sample distribution of the test statistic is very close to the asymptotically correct v 2 distribution and, therefore, the observed rejection rates are close to the chosen significance level. However, we also found that the power of the test depends strongly on the population size with the ability of the test to detect deviations from the null hypothesis being significantly reduced when the size of the underlying populations is small.
A brief investigation of annuity prices has shown that the misspecification of parameters has a limited financial impact. Considering shifts in the parameter values which the LR test would detect with a 50% chance we have seen that the impact of a small population size is significant for deferred annuities. To have a complete picture of possible further financial consequences, a more detailed study is required, which is beyond the scope of this paper.
In our empirical analysis we then applied the LRT, and found that neither of the mortality rates of the female population in England and Wales and the male population in Scotland should be modelled with a cohort effect estimated from the male population in England and Wales.
In this paper, we used the traditional two-stage fitting approach whereby the period and cohort effects are estimated using the Poisson maximum likelihood method in the first stage and a time series model is fitted to these effects in the second stage. We have found that sampling variation in the small population datasets has significant impact, which can then obscure the true signal in those effects, and giving rise to misleading forecasts. Bayesian approaches that combines the two stages into one, e.g., [29],  and [12]) can be used to provide a way to address this problem. However, as use of the two-stage approach is widespread (perhaps because of its relative simplicity) we have, here, attempted the first systematic analysis of the impact of population size on parameter estimates and forecasts using the two-stage approach. In this way, users of the two-stage approach will be better informed about its limitations as well as understanding how the likelihood ratio test might be used to exploit data from larger populations.
where f ðt; x; h w Þ ¼D w t;x log½mðt; x; h w Þ gðt; x; h w Þ ¼wE w t;x mðt; x; h w Þ and h(t, x) is not a function with respect to h w . The form of the column vector h w with 4ny þ na À 1 dimensions is The second derivative of lðh w ; D w t;x ; E w t;x Þ is For every pair of (t, x), mðt; x; h w Þ is a single value, thus the second derivative of f and g with respect to h w is a Hessian matrix with 4ny þ na À 1 rows and columns, with the form as