LSDV and collinearity

Maria Ravio

Join Date: Nov 2022

Posts: 8
#1

LSDV and collinearity

21 Nov 2022, 16:32

Hi everyone,
I am having trouble implementing a simple least square dummy variable (LSDV) model.
The model I am implementing is the following:

reg log_wage i.vet_yes c.age i.vet_yes#c.age vetcountry female $control daustria dcanada dczechrepublic ddenmark destonia dfinland dfrance dgermany direland djapan dkorea dnetherlands dnorway dpoland dslovakrepublic dspain dsweden duk dusa [pw=weight_adjusted] , vce(robust)

In particular, vetcountry takes value of one if the country is a vocational oriented country, zero otherwise.
Now the main problem is that when running this regression stata drops two variables (dcanada and dgermany: a vocational and a general oriented country).
Am I right to believe that if the coefficients for Vetcountry is estimated, is only due to the fact that two country dummy have dropped from the model? Given that, does it mean that the coefficient estimated for Vetcountry might not be reliable? If so, do you have any suggestions on how to overcome this problem as I really need to estimate this essential independent variable?

Many thanks
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#2

22 Nov 2022, 01:26

Maria:
welcome to this forum.
why bothering yourself with -regress- when -xtreg,fe- is available?
In addition:
1) as per FAQ, please share what you typed (as you did) and what Stata gave you back (as you didn't) and/or share an excerpt/example of your dataset via -dataex-. Thanks;
2) please note that, unlike -xtreg-, the -robust- option available from -regress- takes heteroskedasticity only into account, whereas my guess is that you want -vce(cluster clusterid)- standard errors.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10254
#3

22 Nov 2022, 01:28

You can think of LSDV as one method to estimate the fixed effects model. Including an indicator for each country save one amounts to including country fixed effects, so if "vetcountry" is your treatment variable, then it must vary within countries. Otherwise, it is collinear with the country effects and you will not be able to identify its coefficient. Anyway, whether you choose to estimate the model using LSDV or the within-estimator (xtreg, -fe-), there is no need to include the indicators manually. Create a categorical variable called country and use factor variables. See

Code:

h fvvarlist

Also, White standard errors are not appropriate with panel data. See https://www.princeton.edu/~mwatson/papers/ecta6489. Instead, cluster on the panel variable.

Note: Crossed with #2 that makes the same points.
1 like
Comment

Maria Ravio

Join Date: Nov 2022
Posts: 8

22 Nov 2022, 02:05

Dear all,
thank you so much for getting back to me!
So, the main problem is that Vetc is constant within country as a country educational system can either be vocational or general.
I have tried to use dataex as requested by Carlo:

Code:

input double cntryid float(log_trim_earnhrppp_cont_dcl vet_yes centered_age vetcnew female)
40  2.397895 1  -20.08927 1 0
40  2.985682 1   4.910728 1 1
40  2.747271 1   9.910728 1 0
40  2.397895 1   4.910728 1 0
40   2.85647 1   9.910728 1 1
40  2.747271 1 -15.089272 1 1
40  3.152736 1 -.08927155 1 0
40  3.152736 1  14.910728 1 0
40  2.985682 1   4.910728 1 1
40  3.443618 1   9.910728 1 1
40         . 0   4.910728 1 0
40  2.747271 1 -.08927155 1 0
40 2.6390574 1  -20.08927 1 1
40 2.6390574 1 -15.089272 1 0
40  2.533697 1  -5.089272 1 1
40   2.85647 1 -.08927155 1 1
40  3.443618 0  -5.089272 1 1
40  3.443618 1   9.910728 1 0
40  3.443618 1  14.910728 1 0
40  2.985682 1   4.910728 1 0
40  2.747271 1 -.08927155 1 0
40  3.152736 1  14.910728 1 0
40 2.6390574 1   9.910728 1 1
40  2.985682 0  -5.089272 1 1
40  2.397895 1   9.910728 1 1
40         . 1 -.08927155 1 0
40  2.533697 0   9.910728 1 1
40  3.443618 1  14.910728 1 0
40         . 1 -.08927155 1 0
40  2.533697 1 -15.089272 1 1
40 2.6390574 1  -20.08927 1 0
40         . 0  -5.089272 1 0
40  2.747271 1  -5.089272 1 0
40  3.443618 1   9.910728 1 0
40         . 1  -5.089272 1 1
40         . 1 -.08927155 1 1
40 2.2512918 1 -10.089272 1 1
40         . 1  -5.089272 1 0
40 2.6390574 1   9.910728 1 1
40 2.6390574 1 -.08927155 1 1
40  2.397895 1  -20.08927 1 1
40  2.747271 1 -10.089272 1 1
40         . 1   20.41073 1 1
40         . 0  14.910728 1 1
40  2.397895 1  -20.08927 1 1
40  2.533697 1 -.08927155 1 1
40         . 1 -.08927155 1 1
40  2.747271 1   9.910728 1 1
40         . 1  -5.089272 1 0
40  3.152736 0   9.910728 1 1
40  2.533697 1   9.910728 1 0
40 1.9315214 1 -15.089272 1 1
40   2.85647 1   4.910728 1 1
40  2.533697 1 -15.089272 1 0
40  2.747271 0  -5.089272 1 0
40         . 1   4.910728 1 0
40 2.6390574 1   9.910728 1 0
40 2.2512918 1 -10.089272 1 0
40         . 1 -15.089272 1 0
40         . 1   4.910728 1 0
40   2.85647 1 -.08927155 1 0
40  3.152736 1 -.08927155 1 0
40  2.397895 1  -20.08927 1 1
40  2.985682 1   4.910728 1 1
40         . 1  14.910728 1 0
40  2.747271 1  14.910728 1 0
40         . 0 -10.089272 1 0
40         . 0   9.910728 1 0
40 2.2512918 1  -5.089272 1 0
40  2.397895 0 -10.089272 1 1
40         . 0  -5.089272 1 1
40 2.6390574 0 -15.089272 1 1
40 2.6390574 1 -15.089272 1 1
40  2.985682 0 -15.089272 1 0
40  2.985682 1   9.910728 1 0
40  3.152736 1  -5.089272 1 0
40 2.2512918 1 -10.089272 1 1
40  3.152736 1 -15.089272 1 1
40   2.85647 1 -10.089272 1 1
40  2.533697 1   9.910728 1 1
40  3.443618 0  -5.089272 1 0
40  2.985682 1   9.910728 1 1
40         . 1  -20.08927 1 0
40   2.85647 1 -10.089272 1 1
40         . 1   9.910728 1 0
40         . 1 -10.089272 1 1
40         . 0 -.08927155 1 0
40         . 1   4.910728 1 1
40  3.443618 1   20.41073 1 0
40         . 1   4.910728 1 1
40 1.9315214 1 -15.089272 1 1
40 2.6390574 1   4.910728 1 1
40         . 1   20.41073 1 1
40  3.443618 0   4.910728 1 0
40         . 0   20.41073 1 0
40         . 1   4.910728 1 0
40  3.443618 0   20.41073 1 0
40  2.397895 1  14.910728 1 0
40 2.2512918 1   9.910728 1 1
40         . 1  -5.089272 1 0
end
label values cntryid CNTRYID
label def CNTRYID 40 "Austria", modify

the dataset used is PIAAC (cross-sectional) PIAAC and my model is similar to the one used by Hampf and Woessmann (2016) https://papers.ssrn.com/sol3/papers....ct_id=2871126#

Here my results

Click image for larger version

Name: results.PNG
Views: 1
Size: 82.1 KB
ID: 1690444

Vetcnew is a very important variable for my study and I really need to include it in my model. Is there any way around this problem?

Thank you so much for your time.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#5

22 Nov 2022, 02:45

Maria:
unfortunately, your data excerpt cannot help, as data for Austria ony are reported.
That said:
1) I'm not clear with your dataset being a panel or a repeated cross-sectional one (and the paper you point out to does not seem to be that clear in this respect);
2) if there's perfect collinearity, there's nothing you can do but change your set of predictors;
3) going -fe- means getting rid of time-invariant variables;
4) again, I think you should go -vce(cluster clusterid)- instad of robust (provided that you decide to stick with -regress-, something that I'd not sponsor if you're actually dealing with a panel dataset).

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ravio

Join Date: Nov 2022

Posts: 8
#6

22 Nov 2022, 02:58

Thank you so much Carlo for your quick reply. I do appreciate your help and your feedback.
1) My study (ad the one in the paper i mentioned before) uses the first cycle of PIAAC (data collected between 2011-2012). My understanding is that it is a simple cross-section (one point in time) as I only use the first cycle.
"The PIAAC strategy entails repeating a cross-sectional survey design at regular intervals. The data collection for the first round of PIAAC was conducted in parallel in 24 countries—including Germany—in 2011/2012. This first round, the results of which were published in 2013, marked the starting point of this multi-cycle program. Further cycles are planned at 10-year intervals and will allow us to monitor and analyze how key skills are changing in our adult populations."
2) if i do include the predictor (vetcnew - the only one for which i do have problems of collinearity) does it mean that the estimation i get is biased?
3) going fe will make me drop the vetcnew....
4) i have to stick with regress as i do not have a panel dataset
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#7

22 Nov 2022, 03:14

Maria:
1) and 4) thanks for clarifying. If you are interested in the first wave of data only, you can only go -regress-. In addition, even if you were interested in >1 wave of data, you shoud not go -xtreg- either, because the original study is a repeated cross-sectional one;
2) and 3) the issue with perfect collinearity is that it goes against the original goal of each and every regression model, that is disentangling the contribution of each predictor (when adjusted for the other ones) to variations in the conditional mean of the regressand. When the contributions of two or more variables cannot be disentangled, the only fix is to eliminate the culprits from the right-hand side of the regression equation. Therefore, to make a long story short, you simply have to change your model specification if you want a coefficient for -vetcnew-. Unfortunately, this fix comes at the risk that your model is poor/misspecified (see -linktest-) due to the omission of possibly relevant predictors.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ravio

Join Date: Nov 2022

Posts: 8
#8

22 Nov 2022, 03:14

Carlo,
So the only solution that I have to include vetcnew is to not include the dummy variable for countries. Am i right?

Last edited by Maria Ravio; 22 Nov 2022, 03:17.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#9

22 Nov 2022, 03:24

Maria:
the explantion for what you experience may be the following one:
1) in the first case, Stata drops two countries: one for perfect collinearity and one to, protect your regression from the so called dummy trap (https://en.wikipedia.org/wiki/Dummy_...le_(statistics)). As you know, if you do not use the highly recomended -fvvarlist- notation for categorical variables (and interactions), you should leave one categorical variable out (that is, countries-1);
2) in the second case, in all likelihood is the protection from the dummy trap that rules and omit one categorical variable only.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ravio

Join Date: Nov 2022

Posts: 8
#10

22 Nov 2022, 03:33

Thank you so much Carlo.It is very clear now.
So I guess the only solution for me to estimate the effect of vetcnew is to not include dummy countries. So basically use an OLS rather than an LSDV. I just wonder: given that the analysis is across countries with different institutional structures (for taxes and transfers etc) can this create problems ( given that my dependent variable is gross wage)? What are the direct consequences of not including those dummies?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#11

22 Nov 2022, 04:00

Maria:
another option may be grouping the countries in macro areas (say, ddenmark destonia dfinland dnorway dsweden as -Nordid_Baltic_Europe-) and see what happens.
I would also test the functinal form of the regressand via -linktest-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Maria Ravio

Join Date: Nov 2022
Posts: 8

#12

22 Nov 2022, 04:59

Dear Carlo,
I have followed your advice and i have generated the following dummies:

Code:

generate west_europe=0
 replace west_europe=1 if cntryid==40| cntryid==250| cntryid==528| cntryid==276
 generate east_europe=0
 replace east_europe=1 if cntryid==203| cntryid==616| cntryid==703
  generate north_europe=0
 replace north_europe=1 if cntryid==208| cntryid==233| cntryid==246| cntryid==372|cntryid==578| cntryid==752| cntryid==826
 generate south_europe=0
 replace south_europe=1 if cntryid==724
 generate usa=0
 replace usa=1 if cntryid==840
 generate canada=0
 replace canada=1 if cntryid==124
 generate japan=0
 replace japan=1 if cntryid==392
 generate korea=0
 replace korea=1 if cntryid==410

I have then run the regression and I got the following results. So now only Korea seems to drop out (dummy trap, so I think it makes sense). Do you think it makes sense?
Do you think i should go for -vce(cluster cntryid)?

Click image for larger version

Name: results2.PNG
Views: 1
Size: 55.7 KB
ID: 1690476

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#13

22 Nov 2022, 05:16

Maria:
yes, sounds good.
Two further advice:
1) try -linktest- to check the functional form of the regressand (loosely speaking, -linktest- checks if the model specification is correct);
2) go -vce(cluster cntryid)- intead of -robust-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ravio

Join Date: Nov 2022

Posts: 8
#14

22 Nov 2022, 05:24

Carlo,
May I ask what the difference would be if using vce(cluster cntryid)? Is it that vce (cluster cntryid) takes into account the autocorrelation as well?
From a quick glance, it seems that the majority of my coefficients turn insignificant.

I will try linktest on all my models (i have 7 different specifications to test). Thank you for suggesting this useful test!

many many thanks!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17725
#15

22 Nov 2022, 08:09

Maria:
could you please share via CODE delimiters what you typed and what Stata gave you back when you applied -vce(cluster cntryid)-? Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement