Returns to education regression

Ryan Matthews

Join Date: Nov 2018

Posts: 11
#1

Returns to education regression

29 Nov 2018, 20:57

Hi everyone, I am new to STATA and have a question about a returns to education regression I would like to run. My goal is to find out if returns to education differ for U.S. citizens vs. non-citizens. This is the functional form that I think would be most appropriate: log of wages= B0+ B1 years of education+ B2 male+ B3 experience + B4 experience^2+ B5 citizenship. What are your thoughts? Do you recommend I add or subtract anything? Also, I was thinking about adding an age and age^2 parameter, but the way in which I am calculating experience is ( age-years of education-6) therefore, I thought that would cause collinearity between age and experience? I would be appreciative of any feedback.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

29 Nov 2018, 21:40

Welcome to the Stata Forum / Statalist,

Theoritically speaking, your query relates to modelilng. Since it is an overarching query, without output or command, I fail to envisage what kind of reply would suffice to guarantee your decisions are correct.

Since you said you're new to Stata, you may start by reading the topic about the - regress - command. Then, please take a look at the postestimation tests.

Best regards,

Marcos
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

29 Nov 2018, 22:45

The model proposed in #1 would not answer a question about whether the returns to education differ between citizens and non-citizens. It would, instead, answer the different question: do citizens and non-citizens have different wage levels, after adjustment for, among other things, education. To determine whether the returns to education differ, you would need to add to this model an interaction term between citizenship and years of education.

I also worry a bit about using log wages as the outcome variable. If there are people who are not working in your data set, or whose income comes only from non-wage sources, these people will have a wage of 0, and you cannot take the logarithm of that. A better alternative, if this happens in your data set, is to use a generalized linear model with a log link function, and one that is especially attractive here might be Poisson regression.

Given what is already in your model, an age term would indeed be colinear with experience and education, so if you do add it, Stata will omit it (or one of the others with which it is colinear). The separate effects of age, education and experience (as you define experience) are not identifiable, so pick the two you want to focus on and forget about the other.

Finally, this is going to be a fairly complex model, and interpretation of the results will be greatly facilitated by the use of the -margins- command after you run the regression. Bear in mind that in order to use the -margins- command you must properly use factor variable notation to, a) distinguish categorical from continuous variables, b) construct quadratic terms, and c) construct interaction terms. If you fail to do that correctly, your -margins- results will be incorrect. Since you are new to Stata, I suggest you read -help fvvarlist- first, and then introduce yourself to the -margins command with the excellent Richard Williams' handout https://www3.nd.edu/~rwilliam/stats/Margins01.pdf.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

30 Nov 2018, 00:34

Ryan:
as an aside to previous helpful advice, I would check whether your model suffers from endogeneity: individual ability (that I was not able to find out among your predictors) is correlated with both education attainments and wage negotation.
See chapter 6 of https://www.stata.com/bookstore/micr...metrics-stata/.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#5

30 Nov 2018, 08:11

A huge thank you to everyone, especially to Clyde. I am so appreciative. I did not think of adding an interaction term. I will run a probit or logit command to find the margins. Thank you for the additional reading.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

30 Nov 2018, 10:49

Be aware that a log link function (as underlined in #3) is not a logit model, neither probit.

To run such model(s) you need a binary DV.

To learn more about factor notation, please read this Stata Tip, written by Maarten Buis and published in the Stata Journal.

Best regards,

Marcos
2 likes
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#7

30 Nov 2018, 14:36

Yes, I am sorry it has nothing to do with probability. I do have another question as far as F-tests. My regression would be: log of wages= B0+ B1 years of education+ B2 male+ B3 experience + B4 experience^2+ +B5 citizenship+ B6 citizenship*Average years of education. Would I do an F-test on B0 and B5. Would my restricted model not have B6 the interaction term on citizenship and average years of education? I also feel like my b2 male coefficient is messing things up and I should get rid of it an in order for interpretation purposes?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

30 Nov 2018, 15:35

If you are looking for a significance test for the difference in marginal returns to education between citizens and non-citizens, that would just be the t- or z-test (depending on what regression you are using) that you get in the output along with coefficient B6. If you would prefer a likelihood ratio test, and assuming your regression is one that is estimated by maximum likelihood, you can get that by running the model with and without the interaction term and then doing a likelihood ratio test for the difference between the models.

I can't honestly find a way to express in words what an F-test on B0 and B5 would represent as a statistical hypothesis. It seems incoherent to put those two together. I'm wondering what you have in mind here?

I also feel like my b2 male coefficient is messing things up and I should get rid of it an in order for interpretation purposes?

In what sense do you perceive the male variable to be "messing things up?" Even apart from how you might answer that question it is, in general, not a good idea to add or remove variables to a model based on how they affect the results. That's not science--it's noise mining. The inclusion or exclusion of variables from models should be decided before you actually run any model, and should be based on the best scientific knowledge of the content matter available to you. Sometimes you are forced to remove a variable from a model for technical reasons: its coefficient is not identifiable in your data, or the variable is a "perfect predictor" or turns out to be colinear with other variables, or makes the model unable to converge (which, in turn usually reflects one of the previously mentioned problems). But those are situations where there is no choice but to omit a variable you would otherwise have included.
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#9

30 Nov 2018, 16:45

Thank you for the feedback. When I ran my regression I got an error for collinearity among my dummy variables for 2 citizenship dummy variables even though I left one out for the base group. I thought maybe having two types of dummy variables, one for gender and then the four I have for citizenship status caused STATA to omit two of my dummies for citizenship.
reg loghrwage citizen yearsofeducation avrgyearsofeducation*citz exper exper2 dniu (not in universe) dnotcitizen (not a citizen) dbornabroad1 (born abroad) I left out the U.S. citizen dummy
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

30 Nov 2018, 17:48

You are making this much more complicated than it needs to be, because you are not using factor-variable notation, and it's going to be even harder when it comes time to interpret the results. Read -help fvvarlist-. There you will learn that you should not create a separate variable for the square of exper2, and you should not create your own dummy variables. It should be like this:

Code:

regress loghrwage i.citizenship##c.yrs_of_education i.male c.experience##c.experience margins citizenship, dydx(yrs_of_education)

to implement the model discussed in #1 through #3 earlier and output (from -margins-) the marginal effect of yrs_of_education in each category of the citizenship variable..

I don't understand what your variable avrgyearsofeducation is supposed to be and how it relates to the original model. I also don't get what some of your homebrew dummy variables are supposed to be. For example, if an observation is not in universe that ordinarily means you exclude it from the analysis altogether. And I'm guessing that the citizenship variable needs to have three levels: one for US citizen, one for non-citizen. What is dbornabroad1 about? That wasn't mentioned in the earlier discussion. Is it a category of citizenship, or is it something else that is just a separate covariate you are adding to the model?

If you use this approach, you will only run into colinearity among your indicator ("dummy") variables if there is a problem in your data (such as, say, all your citizens are male, or something like that.)
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#11

30 Nov 2018, 18:00

Thank you.
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#12

02 Dec 2018, 11:03

I have one more question, I hope you don't mind answering. My data comes from the current population survey (https://cps.ipums.org/cps-action/var...bility_section) It has four codes for citizen= naturalized citizen, born abroad to American parents, not a citizen, and not in universe. Like Clyde suggested, I decided not to include not in universe, but browsing through the data, the birth place for all not in universe respondents is the United States. They have data for all other variables like years of education for these individuals. I am confused with what to do with the not in universe.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#13

02 Dec 2018, 11:22

Well, I can't answer the question. You will need to review the survey documentation to determine why those observations were designated "not in universe." Then you have to figure out whether to include them or not based on that. The designation "not in universe" is usually given to observations that, for whatever reason, do not meet the inclusion criteria for a study. But the fact that they didn't meet inclusion criteria in somebody else's study doesn't necessarily make them inappropriate for yours. If they are included in your study, you will also need to decide whether they remain a separate citizenship category for the purposes of your study or whether they need to be reallocated to one (or more) of the other categories. (If you do decide to keep them in your study and have them as a separate category, I suggest you find a new name for the category.)
Comment
Ryan Matthews

Join Date: Nov 2018

Posts: 11
#14

02 Dec 2018, 11:55

Thank you for the feedback. I believe reallocating it to the U.S. citizen category is what I should do.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#15

21 Dec 2018, 05:00

It might be useful to look at some of Miles Corak’s work. While his focus was between countries, his work usually includes using splines with knot locations at points related to the attainment of education credentials. In other words, in the context of the US we would expect a sudden shift in the slope and intercept from earnings of individuals who’ve only completed 11 grades and individuals who attained a high school diploma at the end of the subsequent year.
1 like
Comment

Announcement

Returns to education regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment