Sample Selection Bias

Olivia Roberts

Join Date: Dec 2020

Posts: 21
#1

Sample Selection Bias

13 Mar 2021, 07:35

I have some data from the labour force survey and I'm estimating wage regression. I want to include degree classification into the wage regression. However, I have noticed that the number of observations in the regression drops when I include this variable, since not everyone has done a degree and hence there is missing data. How do I overcome this sample selection issue? I understand I shouldn't create dummy variables and replace the missing data into 0's because this would skew the results. But I am unsure how to deal with it. Any help?
Tags: categorical, data, observations, regression, sample selection
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

13 Mar 2021, 08:52

Treating this as a substantive rather than purely statistical issue, I'd say that "no degree" is a legitimate and non-missing value of "degree," which I'd code with an (ordinally) categorical value lower than any other non-missing value. Presumably you've thought of something like this. Is there some reason you think of this as misleading?
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

13 Mar 2021, 08:54

First thing to ask yourself is why educational attainment is missing in some respondents.

Also, when you say "degree classification", what does "degree" mean exactly? I don't mean to be pedantic, it's merely that I don't quite understand. In the US, at least, "degree" would imply a bachelor's degree you obtain from a four-year university program, and perhaps an associate's degree (two year university program) as well. If the variable really is type of university degree, and some people have it missing because they have high school or lower education, then that would be really nice. If the issue is that some people really refuse to respond, then it's harder.

If it's a large public survey sponsored by a government or big NGO, then the survey should have documentation, and you should read it for clues - I realize that the manuals are intimidating at first glance, having had to familiarize myself with some, but everyone starts somewhere. Of course, these large public surveys will be designed by people who know what they are doing, and they will train the interviewers to prompt the respondents or clarify the question if they don't respond. Since you said "the labor force survey" (emphasis mine), I'd tend to assume an official survey, hence I'm more surprised that there's missing data, and I'd guess that it was more a deliberate choice - but we don't know, because there's not enough information to answer.

This sort of extremely general question is hard to help with, and any specific information you can provide will aid immeasurably. If you give us the name of the survey, it's possible but definitely not guaranteed that someone here might have experience with the survey and might be able to provide a more definite answer.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

13 Mar 2021, 10:52

From Olivia's "not everyone has done a degree," I was assuming that she meant that they were *known* to have not completed a degree. My suggestion is incorrect if I misinterpreted her.
Comment
Olivia Roberts

Join Date: Dec 2020

Posts: 21
#5

13 Mar 2021, 11:11

Sorry if I had not been clear - I meant an undergraduate degree from the UK; a 3/4 year length course.

Yes Mike you are correct, I did mean that, sorry for the confusion!

Essentially, I have already included an education level variable, but I wanted to further add a degree classification (1st, 2:1, 2:2 etc..) variable to see if this further affected the wage regression. Therefore, I didn't think a 'no degree' category would be correct Mike because I already have information on who has a degree from the education variable. Does that make sense? What would you suggest?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

15 Mar 2021, 11:34

I'm understanding "education level" to be some kind of categorical variable, with "no degree" being the lowest possible value. Presuming that is correct, I'd just make "classification" to be categories of that education level variable. I'm not familiar in any detail with the UK degree classification scheme, but let's say, for example, that education level is 1 "no univ. degree" 2 "undergrad. univ. degree" 3 "post grad univ. degree." Further, let's assume you have a classification variable coded as (say) missing, 1, 2.1, 2.2, etc. I would mean that you should do something like:

Code:

replace edlevel = 9 if edlevel == 3 // make room for new categories replace edlevel = 3 if (edlevel ==2) & (classification == 2.1) replace edlevel = 4 if (edlevel ==2) & (classification == 2.2) // ... etc.

If you mean instead that "edlevel" is something like numeric years of schooling, and you also want to do something to assess the additional effect of a particular undergrad classification, I don't have any great ideas beyond, just coding "no degree" as a categorical classification and including c.edlevel and i.classification in your model, which I understand is an imperfect representation.
Comment
Olivia Roberts

Join Date: Dec 2020

Posts: 21
#7

15 Mar 2021, 12:10

Thats really helpful Mike, thank you !!!
Comment

Announcement

Sample Selection Bias

Comment

Comment

Comment

Comment

Comment

Comment