More often than not, we are presented with lies in our media and even in our academic papers. This is not because researchers consciously produce dishonest material, but it is due to the constant underestimation of the extreme complexity of statistical analysis methods. Gary King (1985) published a workshop on the Scholars at Harvard, in which he describes “a set of serious theoretical mistakes appearing with troublingly high frequency throughout the quantitative political science literature”. These mistakes include simple things such as the omission of exogenous variables in multivariate linear regressions and “are all based on faulty statistical theory or on erroneous statistical analysis”. Often, these mistakes could be spotted (and prevented) if one would go back to the core of statistical analysis: the Gauss-Markov assumptions. In this article, I will explain a common issue in micro-econometric research; introducing, the age-period-cohort (APC) problem.
While doing micro-econometric research using panel data, one risks to violate the Gauss-Markov assumption of no perfect multicollinearity. The classical Gauss-Markov assumption dictates that none of the independent variables can be linearly dependent of one another, nor can an independent variable be constant over all observations. Linear dependence between independent variables is very common when one works with micro-econometric panel data as it is often interesting to research the effects of age-, period- and cohort effects on a dependent variable. The issue here is that for an individual in year
:
(1)
where denotes calendar year,
is age (in years) of individual
at year
, and
is year of birth of individual
. Given (1) and considering the case where age-, period- and cohort effects are additive (which we do by default using a linear model without interaction effects), the following issue arises:
Let be some dependent variable which we regress on age effects
, period effects
and cohort effects
, while including an intercept. Then the model could be written as follows:
(2)
This clearly shows that ,
and
cannot be independently estimated, since the first and the last line of equation (2) are equal. Therefore, we cannot simply run a linear regression using these three independent variables without making any adjustments, as this would violate the no perfect multicollinearity Gauss-Markov assumption. Since we will not specify a dependent, nor other independent variables throughout this article, we will assume that the expected value of
conditional on the regressors is zero for all
. Therefore, we will only have difficulties with the no perfect multicollinearity Gauss-Markov assumption and not also with the strict exogeneity assumption.
The point identification problem inherent to the linear age-period-cohort model has been widely researched. De Ree & Alessie (2011) mention that this problem has been a point of discussion since the 1970s and have investigated which information can be extracted from the data without making any assumptions. Mason & Fienberg (1985) published a book with most of the noteworthy literature on the APC problem until then with papers from e.g. Heckman & Robb, which describes the use of proxies for period and cohort effects that describe the underlying processes causing these effects. This is later applied by researchers for different applications, such as health care (Portrait et al. (2003)) and labour economics (Euwals et al. (2010)). In the first sections of Browning et al. (2012), the APC problem is extensively explained and many references are made to potential and explored applications. For more background on the APC problem, I would recommend you to read this paper.
There are three operations one should undertake in sequence to circumvent the issues around the APC problem:
1. Find some “performance indicators” on your data without doing regression.
2. Make an assumption on ,
or
, so that you can do a regression close to equation (2) to obtain results. Typically, you do this step several times with different assumptions.
3. See which model from step 2 has the same “performance indicators” as the ones that you extracted from your data in step 1.
It is interesting to note that APC problem is an (on first sight small) issue that generates substantial extra statistical work while it only violates one Gauss-Markov assumption; this makes it an excellent example of the frailty of statistical analysis without the right tools.
Step 1
By replacing model (2) with a model where each age, each year and each year of birth is assigned a dummy variable (so basically by creating a load of dummies), De Ree & Alessie (2011) show a method to identify age-, period- and cohort profiles orthogonal to their linear trends without making any assumptions on the data. This gives insight into the curvature of the age- period and cohort effects, since they are usually not linear (e.g. the effect of being 40 years old on an individual’s income is clearly not half of the effects of being 80 years old). This process is quite complicated and extremely interesting and impressive, so I recommend reading the methodology section of De Ree \& Alessie (2011) for the interested reader. For now, we will accept that they developed this method to extract some “performance indicators” from data suffering from the APC problem without making any assumptions.
Step 2
Essentially, some kind of assumption is necessary to identify a model which controls for all three APC variables and actually portrays the linear and non-linear effects. Such an assumption should be based on some existing knowledge or literature as to avoid arbitrary results. There are three more known assumptions that are useful to know:
1. The proxy variable approach.
2. The functional form approach.
3. The clustering “approach”.
Approach 1
In Euwals et al. (2010), they make the following educated assumption: the year effect on female labour force participation correlates with the unemployment rate. This is because an increment in the unemployment rate demoralizes those without a job to join the labour market. They reason that replacing the year effect dummies by the unemployment rate per education group in that year is a correct assumption to prevent perfect multicollinearity: the unemployment rate at time is obviously not a linear combination of an observation’s age at time
and it’s year of birth. Euwals et al. (2010) used this assumption, which is often called the proxy variable approach, to prevent perfect multicollinearity for data on Dutch women.
Approach 2
In the same paper, Euwals et al. apply a transformation function on the cohort variable that is applicable to female labour force participation, which is known as the functional form (specification) approach. They correctly observe that over the past years, there have been large increases to the female labour market participation rate and it will be impossible to observe similar growth forever, as labour market participation obviously has a maximum of 100%. Therefore, there is reason to suspect that the cohort effect might not be linear, but logarithmic (and slowly decreasing in growth). Therefore, instead of using dummies for cohort effects, they use as a regressor, which automatically addresses the linear dependency between age, period and cohort.
Approach 3
The reason I use quotation marks for this last one, is because the approach is often used; however, many econometricians dispute its use because of its possible inaccuracy (yet another way to get results that are most probably untrue). Researchers assume that, for instance, cohort effect does not change literally with every year of birth. This sounds reasonable as the impact of year of birth on income is most probably not enourmous between 1891 and 1894. Therefore, one could cluster cohort dummies together in blocks of e.g. 5 years, instantly relieving the linearly dependency between the APC variables:
For , an individual born in year
and an individual born in year
can be in the same cohort block (which observationally gives them the same year of birth), while they do not have the same age. Consequently, individual
and
in a given period
can have a different age while being part of the same cohort block, i.e. for
with
and
, we could have that
, which results in the following:
Although, this approach appears very logical, in reality it gives peculiar results. One could apply the clustering approach with age blocks of 2 years and of 5 years on the same data and the same dependent and independent variables and observe diametrically opposed results! This means that choosing the size of your clusters determines the result of your research, which it absolutely should not! I will finish this chapter as I started it: an identifying assumption should be based on some existing knowledge or literature as to avoid arbitrary results.
Figure 1: age profiles on the same data using 5-year and 2-year age clusters. Source: De Ree & Alessie (2011)
Step 3
Once you have chosen which assumptions you would like to make, you can test the different models using the different identifying assumptions. At this point, there is no way to understand which one of the identifying assumptions represents the data in the most genuine way. The only thing one can do is compare the results to the estimates orthogonal to the linear trend and ‘guesstimate’ the steepness of the linear trend by existing literature on your specific research topic (we all know for instance that there will be a positive linear cohort effect on income). Finally, even when one does every extra step necessary, it is still impossible with a sound set of assumptions.
The Gauss-Markov assumptions can be very easily violated and the APC problem describes just one violation to one of the assumptions. Through a series of strenuous steps (definitely the statistics of the first one), it is still possible to obtain decent results while circumventing the APC problem. This is part of the beauty of the science; however, it also raises concerns, since running normal OLS on any computer program without all of these steps will still return some results. The computer only does the algebra, so the person that gives the computer orders should know the statistics. Otherwise, we will keep reading inaccuracies in our media “based on faulty statistical theory or on erroneous statistical analysis.”
References
Browning M, Crawford I, Knoef M (2002) The age-period cohort problem: set identification and point identification, The institute for Fiscal Studies, & Department of Economics, UCL, working paper CWP02/12
De Ree J, Alessie R (2011) Life satisfaction and age: Dealing with underidentification in age-period-cohort models, Social Science & Medicine 73, 177-182
Deaton A, Paxson C (1994) Intertemporal Choice and Inequality, Journal of Political Economy, Vol. 102, 437-467
Euwals R, Knoef M, Van Vuuren D (2010) The trend in female labour force participation: what can be expected for the future?, Springer-Verlag 2010, published online
King G (1985) How not to lie with statistics, New York University \& Scholars at Harvard
Mason W M, Fienberg S E (1985) Cohort analysis in social research – Beyond the identification problem, Springer-Verlag
Portrait F, Alessie R, Deeg D (2002) Disentangling the age, period, and cohort effects using a modeling approach, Tinbergen Institute Discussion Paper 2002-120/3
Dit artikel is geschreven door Wouter Nientker