Running a regression with total annual hours worked and total annual wages

Solomon Lin

Join Date: May 2016

Posts: 36
#1

Running a regression with total annual hours worked and total annual wages

07 May 2016, 18:52

I am currently running two fixed effects regressions. One is to illustrate the impact of gambling on total annual hours worked. The other is to illustrate the impact of gambling on total annual wages.

I have decided to use Log Wages and also Log Hours which means that all individuals with zero wages and/or zero hours worked are excluded (so only employed individuals are in the sample).

Currently, pretty much all of the variables that are significant for the regression of total annual wages are also significant for the regression of total hours worked. Is this an issue? I don't think that I can realistically find any variables that wouldn't be in both equations... I've read some literature and people use variables like non-labour income for hours worked and experience for wages but I don't have these variables in my dataset. I tried using marital status (which some literature says only affects hours worked) but I find them significant for wages too and in my head, I can understand why they would be.

If it is important for me to have different variables in each equation, should I maybe ignore the regression results and put marital status and number of kids solely in the regression for hours worked?

Also, given my panel data is only 6 years apart (with individuals ranging from ages 21 to 40 years old), do you think I can say that age shouldn't make a difference in the total hours worked as those are prime working ages? My regression shows that age does make a difference to hours worked though...
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

07 May 2016, 19:23

I have decided to use Log Wages and also Log Hours which means that all individuals with zero wages and/or zero hours worked are excluded (so only employed individuals are in the sample).

That is almost certainly a bad idea. It makes your model meaningless because it does not generalize to people who do not work or do not earn wages from their work. But since these are your outcome variables, it means that it is impossible to predict the outcome for any case unless you already know the outcome, at least to the extent of whether or not it is zero. Using a Poisson model is probably a much better idea, and because it is based on a logarithmic link, it will accomplish most of what a log transformation with -xtreg, fe- would if that were sensible to do.

If it is important for me to have different variables in each equation, should I maybe ignore the regression results and put marital status and number of kids solely in the regression for hours worked?

But why do you think it might be important to have different variables in each equation. It should surprise nobody that hours worked and wages earned would exhibit a pretty strong correlation, and that any variable that is a good predictor of one is likely to be a good predictor of the other. You should select your variables on the basis of their expected relevance to the outcomes you are modeling. If that means you use the same predictors for both outcomes, there is no problem with that.
Comment
Solomon Lin

Join Date: May 2016

Posts: 36
#3

07 May 2016, 20:13

Originally posted by Clyde Schechter View Post

That is almost certainly a bad idea. It makes your model meaningless because it does not generalize to people who do not work or do not earn wages from their work. But since these are your outcome variables, it means that it is impossible to predict the outcome for any case unless you already know the outcome, at least to the extent of whether or not it is zero. Using a Poisson model is probably a much better idea, and because it is based on a logarithmic link, it will accomplish most of what a log transformation with -xtreg, fe- would if that were sensible to do.

Hi Clyde,

Thank you so much for the pointers and sorry for the basic questions.

Last edited by Solomon Lin; 07 May 2016, 20:29.
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#4

08 May 2016, 02:19

when logging variables with 0 values, it's possible to curicmvent this by:
1) using two part models (google it, there's also a user written stata package for this called tpm if I remember correctly)
2) log the variable +1 .

meaning, for 2, instead of :

Code:

gen log_y = log(y)

which will result in missing values for every observation where y=0, as you well know.
you can use:

Code:

gen log_y = log(y+1)

will will result in a value of 0 (log(1)=0) for every observation where y=0.

obviously if you do the second option you need to make it clear that that's what your'e doing and not "hide" it or something.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#5

08 May 2016, 08:42

Solomon Lin No need to apologize for your questions. First, they are not beginner questions, in the first place. More important, no question, no matter how elementary, is unwelcome on Statalist. Yes, we expect people to first consult the help files and manuals, and do a little thinking, but that doesn't always lead to an answer, and sometimes it isn't clear where in those resources the answer may lie. And we were all beginners at one time.

Ariel Karlinsky I think two part models are an excellent approach to this situation, especially if there is reason to think that the determinants of a zero outcome may be rather different from the determinants of non-zero outcomes--a reasonable guess in Solomon's situation. I disagree, however, with using log(y+1). The problem is that the choice of 1 as an offset is totally arbitrary. If the goal is just to get rid of zeroes, any positive offset lower than the minimum positive value in the data will clobber the zeroes and preserve ordinal properties. But the corresponding values of log(0+offset) can range from some maximum value all the way down to negative infinity. The results of the subsequent analysis are often sensitive to the choice of the offset used because these data points are bound to be outliers and may also exert considerable leverage if their predictors are not located in the center. It can be an awful mess, especially if the zeroes constitute more than just a handful of observations.

I think there are three different reasons people use for log-transforming an outcome. (Maybe there are more that I'm not aware of.) Sometimes it's done because theory or exploratory analysis have suggested that the actual form of the outcome-predictor relationship is logarithmic, except perhaps where 0's pop up. In that case, using a generalized linear model with a log link, Poisson being one such model but not the only one, is both theoretically justified and works beautifully: the zeroes are not a problem as they just represent sampling variation around a predicted non-zero value. Another reason people log-transform is just because the range of magnitudes covered by the outcome variable is very large and they want to compress it. Again, a generalized linear model with a log link is often helpful here. Or the purpose may also be served by using another range-compressing transformation such as the square root (if there are no negative values) or the cube root (which handles negative values as well), as well as other less frequently used possibilities. The third reason people sometimes use a log transformation is because they want to interpret their regression coefficients as elasticities or semi-elasticities. Here, too, a generalized linear model with log link will fit the bill.
Comment
Solomon Lin

Join Date: May 2016

Posts: 36
#6

08 May 2016, 12:18

Originally posted by Clyde Schechter View Post

I think two part models are an excellent approach to this situation, especially if there is reason to think that the determinants of a zero outcome may be rather different from the determinants of non-zero outcomes--a reasonable guess in Solomon's situation. I disagree, however, with using log(y+1). The problem is that the choice of 1 as an offset is totally arbitrary. If the goal is just to get rid of zeroes, any positive offset lower than the minimum positive value in the data will clobber the zeroes and preserve ordinal properties. But the corresponding values of log(0+offset) can range from some maximum value all the way down to negative infinity. The results of the subsequent analysis are often sensitive to the choice of the offset used because these data points are bound to be outliers and may also exert considerable leverage if their predictors are not located in the center. It can be an awful mess, especially if the zeroes constitute more than just a handful of observations.

Hi Clyde,

Thanks again for the detailed reply. I have looked into various methods for solving selection bias (e.g. Heckman and Poisson) but I think the main issue is that I need to be able to run both:

1) xtreg, fe cluster
2) xtivreg2, fe cluster

Basically, I need to be able to run fixed effects estimation and also fixed effects estimation with IVs (with robust or clustered standard errors).

I will check the two part model but I believe that Heckman can not be used in panel data (at least on Stata) and I also can't see the instrumental variable regression form for Poisson (I can only find xtpois).

I think ultimately, I may have to live with the bias.
Comment

Announcement

Running a regression with total annual hours worked and total annual wages

Comment

Comment

Comment

Comment

Comment