Including collinear variables (such as previous employment state and low pay state)

Andrew Black

Join Date: Aug 2014

Posts: 7
#1

Including collinear variables (such as previous employment state and low pay state)

23 Aug 2014, 04:36

Hello,
I am trying to estimate a model for low pay, which is binary (above or below threshold) and two of the variables I want to include are if they were unemployed last period, and if they were low paid last period. However, stata always omits the unemployment indicator for last period due to collinearity. The model is:

probit lowpay L.lowpay L.unemp variable1 variable2.....

and "L.unemp omitted due to collinearity" is the message. I suppose it must be related to the fact that if you are unemployed last period, then you don't have data on pay last period. However I know that it is possible to include both as several papers that I have read have done so. Could anyone give me any advice on how to include both? I have tried using the "collinear" option but it still drops it.
Thanks so much
Tags: None
Joe Molitoris

Join Date: Aug 2014

Posts: 5
#2

23 Aug 2014, 05:27

Hi Andrew,

It would be helpful if you could post the output.

How have you defined "low pay"? Specifically, did you also define people with no pay as people with low pay? If so, then they will always fall into the "Low pay" category. Also, is the dependent variable based on labor income or wages or does it also capture non-labor income, such as welfare benefits or capital gains? What I am getting at is this: if all unemployed individuals are listed as having zero or missing income, and thus show no variation in lagged pay within the unemployed group, then no matter how you categorize "low pay", the unemployment dummy will be collinear with one of those categories.

On another note, if you have continuous data for wages, why not take advantage of that instead of using a binary model?
Comment
Andrew Black

Join Date: Aug 2014

Posts: 7
#3

23 Aug 2014, 05:53

Thanks for your response

Low pay is defined as being paid below a certain threshold, and a binary variable has been created for this. This binary variable is missing for those who are not in employment, and the variable does not take into account non-labour income as the concern is specifically paid employment. The ultimate goal is to model transitions in and out of low pay so it has to be a binary variable rather than continuous.
What you are saying does make sense, and that of course must be the problem I'm encountering but in the literature people have included both in the same kind of model as I'm estimating.
This is the relevant output (the rest is just coefficients etc on the rest of the variables, and it's lbelnmw and lunemp together which form the problem). I don't know how to present it properly yet so I hope it's legible

. probit belnmw lbelnmw lunemp schyears potexp potexp2 sex health $ethnicity $years ireland scot wales lonse mastat

note: lunemp omitted because of collinearity
note: year_8 omitted because of collinearity
Iteration 0: log likelihood = -6004.5698
Iteration 1: log likelihood = -4839.2359
Iteration 2: log likelihood = -4776.7988
Iteration 3: log likelihood = -4776.4514
Iteration 4: log likelihood = -4776.4514

Probit regression Number of obs = 28173
LR chi2(20) = 2456.24
Prob > chi2 = 0.0000
Log likelihood = -4776.4514 Pseudo R2 = 0.2045

belnmw Coef. Std. Err. z P>z [95% Conf. Interval]

lbelnmw 1.410978 .0361735 39.01 0.000 1.34008 1.481877

lunemp 0 (omitted)

schyears -.0651803 .0057091 -11.42 0.000 -.07637 -.0539906
.....
Comment
Joe Molitoris

Join Date: Aug 2014

Posts: 5
#4

23 Aug 2014, 06:13

For future reference: you can paste the output in a more reader-friendly way by clicking on the A in the upper right-hand corner of the text box, and then clicking on the # button.

If the lagged low wage variable has a missing value whenever lagged unemployment =1, then Stata won't be able to estimate a coefficient for unemployment because all unemployed people will have missing lagged wages. If you think of the effect of lagged unemployment as conditional upon lagged wages, it can't be estimated if there are only missing lagged wages. When you include variables in your regressions, those observations with missing values will not be included in the analysis sample. In your case, this means all individuals with lagged unemployment=1 are being dropped because they have missing lagged wage values.
1 like
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#5

23 Aug 2014, 06:16

Originally posted by Andrew Black View Post

I don't know how to present it properly yet so I hope it's legible

Copy/paste exact Stata results and use the CODE delimiters to present them. In this particular case, the formatting is not that bad. But it's good practice to present code and results using appropriate formatting.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#6

23 Aug 2014, 07:08

Andrew: in addition to the wise words from Roberto and Joe, may I suggest that it would be helpful if you were to provide explicit bibliographic references to the papers that you referring to (full bibliographic details, including DOI or URL if possible ). Then readers might be able to comment in a more enlightened fashion. As someone who's written a bit on low pay dynamics using UK panel data (Google on "Cappellari Jenkins low pay"), and without knowing precisely which papers you are trying to emulate, my reaction is:

It would be rather contrary to the main literature to include lagged unemployment state as a predictor in a one-equation probit model for low pay.

There is an additional issue too, which you don't refer to but which is relevant with your inclusion of lagged low pay status -- the so-called "initial conditions" issue. The error term in your probit plausibly consists of 2 components: a time-invariant individual component and an idiosyncratic error. The former is likely correlated with the low pay status first observed. Not accounting for this is likely to lead to biased estimates of your coefficient of interest -- the coefficient on lagged low pay (a "state dependence" parameter.

There are various ways of dealing with all these sorts of problems, most of which to date have involved two sorts of approach:

One is fitting some sort of multivariate probit model accounting for the relevant selections in the states. So, for example, equations for (1) low pay <-- lagged low pay, covariates; (2) [initial conditions] initial low pay <-- covariates, plus covariate which acts as "instrument". If you want to include factors like lagged (un)employment, then you need to model jointly and take account of the relevant sample selections. Start with a paper such as Stewart M. B, and Swaffield, J. K. (1999). ‘Low pay dynamics and transition probabilities’, Economica, 66, 23–42., and then work forwards to e.g. Stewart, M. B. (2007). The interrelated dynamics of unemployment and low pay. Journal of Applied Econometrics, 22(3), 511–531. A paper of mine with Cappellari has equations for low pay, employment, initial conditions, and panel retention, with the appropriate selections controlled for and cross-equation error correlations. The code we used draws on that we put out with the Stata Journal, 6(2) 2006, article freely downloadable from SJ website.

Another approach is to use a so-called "dynamic random effects probit" model. Googling on that should bring up references. [It deals with initial conditions though not the additional complications arising if you want lagged (un)employment in there. For DREP models, Jeff Wooldridge (Forum member) has proposed a neat way to fit the models to account for initial conditions that it is much easier to implement that the so-called Heckman approach: see Wooldridge, J. M. (2005). Simple solutions to the initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity. Journal of Applied Econometrics, 20(1), 39–54. His method can be implemented using xtprobit and its Stata 13 counterparts.
Comment
Andrew Black

Join Date: Aug 2014

Posts: 7
#7

23 Aug 2014, 08:18

Thanks Stephen for a fantastic answer. Thanks Joe and Roberto as well for advice on communicating results on here.
I am aware of the initial conditions and the Wooldridge method is one that I will employ in order to deal with that. However I want to estimate a pooled probit for the sake of comparison, and I thought it would be easier to discuss my query here using that. Also, thanks a lot for the reference to the Stata journal as I'm familiar with a lot of your work and the 2008 RSS paper I was especially interested in but as I'm new to stata it seemed beyond me.
("Cappellari, L., Jenkins, S.P., 2008b. Estimating low pay transition probabilities accounting for endogenous selection mechanisms. Journal of the Royal Statistical Society: Series C (Applied Statistics) 57, 165–186")

One of the papers I was referring to was the "Stewart, M. B. (2007). The interrelated dynamics of unemployment and low pay. Journal of Applied Econometrics, 22(3), 511–531" paper you mention. In table III of that paper he reports results with unemployment status as the dependent variable, and includes both unemployment, and low pay status at t-1 (top two lines of that table) for both pooled and Heckman estimators. This is essentially the same principal, at least, when I run a pooled probit with my data using unemployment as the dependent variable, I can't include both lagged unemployment status and lagged low wage status as he does. Also an IZA paper by Jones, Jones, Murphy and Sloane runs a pooled probit and Wooldridge model including both lagged variables.

The dynamic random effects probit model is m chosen method due to being easier to program in stata. However it does trouble me that I haven't seen anyone use that method and control for endogeneity of employment retention as you and Cappellari do in your 2008 paper; clearly looking at low pay dynamics necessitates people being paid at all. I apologise as this is not on the original topic but is controlling for that possible with dynamic re probit? I wondered if Heckman correction would be possible, running a probit for employment in next period and using the inverse mills ratio as a variable in the Wooldridge specification for the low wage equation. Would this be made inconsistent/difficult by the initial conditions etc?
Thanks again
Comment
Andrew Black

Join Date: Aug 2014

Posts: 7
#8

23 Aug 2014, 08:20

Apologies, reference for the Jones et al paper: http://www.iza.org/en/webcontent/pub...act?dp_id=2595
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#9

23 Aug 2014, 09:11

Thanks for the citation(s). It's unclear to me how Jones et al. in the IZA DP paper (Table 4A) that you cite get around the problem of having both lagged low pay status and lagged non-employment status as predictors. I think you'll have to chase the authors and ask precisely how they defined those variables to take account of the issue that you first raised above.
I don't think there is a "straightforward" way! Hence the more complicated modelling approaches to take account of the selective nature of whether someone has a job at period t-1. If you don't have employment at t-1, then the lagged low pay status indicator variable is not equal to zero; its value is missing!
Mark Stewart got around this problem in one particular way in one of the sections of his great 2007 JAE paper -- but I recall that he doesn't have both lagged statuses in all his equations. Not entirely satisfied with Mark's approach to the observability issue, Lorenzo Cappellari and I took another approach in our papers [the Applied Statistics one and ‘Transitions between low pay and unemployment’, Chapter 8, pp. 57–79, in S. Polachek and K. Tatsiramos (eds), Research in Labor Economics, Volume 28, Elsevier, Amsterdam, 2008 .]
For your exploratory/initial modelling, e.g. getting your DREP model sorted out, I'd recommend dropping the lagged (non)employment predictor.
Comment
Andrew Black

Join Date: Aug 2014

Posts: 7
#10

24 Aug 2014, 03:21

Thanks again Stephen. As you say I'll try and ask the authors but work without the lagged (non)employment predictor in the mean time. Although I have to say Mark presents a coefficient, and average partial effects for both lagged indicators in every model he reports (in that paper). I have started researching the multivariate probit models, maybe I can supplement the Wooldridge approach with results for these. Just to be clear, when you say Mark's approach to "observability" are you referring to his treatment of the potential selection bias arising from ignoring attrition and employment retention?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#11

24 Aug 2014, 03:37

I looked again quickly at Mark's paper. Mark Stewart (2007, e.g. Table III) has current unemployment status as the depvar, and lagged unemployment and lagged low pay status as predictors in his DREP model. Unemployment is a 0/1 variable. Low pay status is also 0/1 (low pay; high pay), but presumably unobserved in current and previous period if unemployment indicator = 1. I don't know what values Mark attributed to lagged low pay in the cases when lagged unemployment indicator = 1. [The potential endogeneity of sample retention is a separate issue.]
Comment

Announcement

Including collinear variables (such as previous employment state and low pay state)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment