Difference between using only dummies and only one categorical variable

Anshul Anand

Join Date: May 2015
Posts: 113

Difference between using only dummies and only one categorical variable

20 Nov 2018, 15:50

Hello,

I am a little bit confused over my model. I am regressing log wages on being an Immigrant or not. I have splitted the Immigrant Population into different arrival waves. Now, I dont know whether I have to use dummies for each immigrant arrival wave or can I used just the categorical variable with the values of immigrant arrival waves and natives? Is there a difference between these two following models:

1. Model: Variable arrival has native (=9999) as the reference group

Code:

svy: regress lnhourlyw_w i.ib9999.arrival if year==2004
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs     =    10,726
Number of PSUs     =    10,726                  Population size   =    1,317,293
Design df         =    10,725
F(   6,  10720)   =    99.52
Prob > F          =    0.0000
R-squared         =    0.0279

    
Linearized
lnhourlyw_w       Coef.   Std. Err.      t    P>t     [95% Conf.    Interval]
    
arrival
pre 1980    -.1686351   .0151419   -11.14   0.000     -.198316    -.1389543
1980-84    -.1678049   .0202635    -8.28   0.000     -.207525    -.1280847
1985-89    -.2158353   .0165672   -13.03   0.000    -.2483101    -.1833604
1990-94    -.2542113   .0122076   -20.82   0.000    -.2781405    -.2302822
1995-99    -.1508089   .0222109    -6.79   0.000    -.1943463    -.1072715
2000-04    -.0889885   .0228124    -3.90   0.000     -.133705    -.044272

_cons    3.689774   .0057737   639.07   0.000     3.678457    3.701092

2. Model: Making Dummies for each Immigrant arrival wave from the Variable -arrival- such that the intercept is the native reference group

Code:

svy: regress lnhourlyw_w i.arvpre1980    i.arv1980 i.arv1985 i.arv1990    i.arv1995    i.arv2000 if year==2004
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1    Number of obs     =    10,726
Number of PSUs     =    10,726    Population size   =    1,317,293
    Design df         =    10,725
    F(   6,  10720)   =    99.52
    Prob > F          =    0.0000
    R-squared         =    0.0279

        
Linearized
lnhourlyw_w       Coef.   Std. Err.    t    P>t     [95% Conf.    Interval]
        
1.arvpre1980    .0406217   .0139277    2.92   0.004     .0133207    .0679226
1.arv1980    .0414519   .0174006    2.38   0.017     .0073435    .0755603
1.arv1985   -.0065785   .0148694    -0.44   0.658    -.0357253    .0225684
1.arv1990   -.0449545   .0120761    -3.72   0.000    -.0686259    -.0212832
1.arv1995    .0584479   .0187726    3.11   0.002     .0216502    .0952457
1.arv2000    .1202683   .0192005    6.26   0.000     .0826317    .1579048
_cons    3.480518   .0087417    398.15   0.000     3.463382    3.497653

I see that the coefficients are different, but I don't see why since the reference group in both are natives.

Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

20 Nov 2018, 16:34

It looks like it actually is changing the reference group in some way (how is i.ib9999.arrival coded?) (Also, what is the omitted category for time?). But I pasted your coefficients from the 1st regression in col (1) below, and those from 2nd regression in col (2). When I do col(2) - col(1), they all add up to the same number--I just don't know which category it represents.

Click image for larger version

Name: Statalist - coefficients.png
Views: 1
Size: 7.2 KB
ID: 1471350

As text:

Variable	(1)	Col (2) - (1)	(2)
pre 1980	-0.1686351	0.2092568	0.0406217
1980-84	-0.1678049	0.2092568	0.0414519
1985-89	-0.2158353	0.2092568	-0.0065785
1990-94	-0.2542113	0.2092568	-0.0449545
1995-99	-0.1508089	0.2092568	0.0584479
2000-04	-0.0889885	0.2092568	0.1202683

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

20 Nov 2018, 17:15

Note that the constant term in the first model is 3.689774 and in the second model is 3.480518, the difference is -0.209256. Thus the second model has a lower estimate for the reference group (natives) and offsets this by increasing the coefficients for all other groups so that the estimates for those groups are the same in both models. That is, the estimate for pre-1980 is in the first model is 3.689-0.168 and in the second model is 3.480+0.040, so with the exception of the reference group the estimates for each group are the same in both models.

Without any footnotes to back me up, I'm going to guess that the second model is incorrect; by hiding the dependence of the dummy variables on each other, svy is computing variances as if i.arvpre1980 were independent of i.arv1980, and so forth, and this is not the case.

If anyone reading this can comment with more authority that I have on this, please confirm or refute my mere supposition.
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#4

20 Nov 2018, 18:12

This looks weird to me: i.ib9999.arrival. What is the extra i. doing there?

My first guess, though, is that one of the dummies you created is labeled or computed wrong. Double check the coding or show us the coding so we can assess it. (My Master's thesis includes a dummy variable coded 0/3. I don't think that is what I intended.)

As a sidelight, you should be using the svy subpop option instead of the if qualifier. See pp. 3-4 of

https://www3.nd.edu/~rwilliam/stats3/SvyCautionsX.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
4 likes
Comment
Anshul Anand

Join Date: May 2015

Posts: 113
#5

21 Nov 2018, 05:33

I will presents in the evening an overview of the coding of these variables. Many thanks for your support all!
Comment
Anshul Anand

Join Date: May 2015

Posts: 113
#6

22 Nov 2018, 15:15

As Mr Williams guessed, my coding was wrong since of wrong thinking. After doing it in the right way, there is no difference between these two models. Sorry for the trouble.

Really appreciate your support all. Rethinking about my data helped me a step further.
1 like
Comment

Announcement

Difference between using only dummies and only one categorical variable

Comment

Comment

Comment

Comment

Comment