Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference between using only dummies and only one categorical variable

    Hello,

    I am a little bit confused over my model. I am regressing log wages on being an Immigrant or not. I have splitted the Immigrant Population into different arrival waves. Now, I dont know whether I have to use dummies for each immigrant arrival wave or can I used just the categorical variable with the values of immigrant arrival waves and natives? Is there a difference between these two following models:

    1. Model: Variable arrival has native (=9999) as the reference group
    Code:
    svy: regress lnhourlyw_w i.ib9999.arrival if year==2004
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =         1                  Number of obs     =    10,726
    Number of PSUs     =    10,726                  Population size   =    1,317,293
    Design df         =    10,725
    F(   6,  10720)   =    99.52
    Prob > F          =    0.0000
    R-squared         =    0.0279
    
        
    Linearized
    lnhourlyw_w       Coef.   Std. Err.      t    P>t     [95% Conf.    Interval]
        
    arrival
    pre 1980    -.1686351   .0151419   -11.14   0.000     -.198316    -.1389543
    1980-84    -.1678049   .0202635    -8.28   0.000     -.207525    -.1280847
    1985-89    -.2158353   .0165672   -13.03   0.000    -.2483101    -.1833604
    1990-94    -.2542113   .0122076   -20.82   0.000    -.2781405    -.2302822
    1995-99    -.1508089   .0222109    -6.79   0.000    -.1943463    -.1072715
    2000-04    -.0889885   .0228124    -3.90   0.000     -.133705    -.044272
    
    _cons    3.689774   .0057737   639.07   0.000     3.678457    3.701092
    2. Model: Making Dummies for each Immigrant arrival wave from the Variable -arrival- such that the intercept is the native reference group

    Code:
    svy: regress lnhourlyw_w i.arvpre1980    i.arv1980 i.arv1985 i.arv1990    i.arv1995    i.arv2000 if year==2004
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =         1    Number of obs     =    10,726
    Number of PSUs     =    10,726    Population size   =    1,317,293
        Design df         =    10,725
        F(   6,  10720)   =    99.52
        Prob > F          =    0.0000
        R-squared         =    0.0279
    
            
    Linearized
    lnhourlyw_w       Coef.   Std. Err.    t    P>t     [95% Conf.    Interval]
            
    1.arvpre1980    .0406217   .0139277    2.92   0.004     .0133207    .0679226
    1.arv1980    .0414519   .0174006    2.38   0.017     .0073435    .0755603
    1.arv1985   -.0065785   .0148694    -0.44   0.658    -.0357253    .0225684
    1.arv1990   -.0449545   .0120761    -3.72   0.000    -.0686259    -.0212832
    1.arv1995    .0584479   .0187726    3.11   0.002     .0216502    .0952457
    1.arv2000    .1202683   .0192005    6.26   0.000     .0826317    .1579048
    _cons    3.480518   .0087417    398.15   0.000     3.463382    3.497653
    I see that the coefficients are different, but I don't see why since the reference group in both are natives.

  • #2
    It looks like it actually is changing the reference group in some way (how is i.ib9999.arrival coded?) (Also, what is the omitted category for time?). But I pasted your coefficients from the 1st regression in col (1) below, and those from 2nd regression in col (2). When I do col(2) - col(1), they all add up to the same number--I just don't know which category it represents.

    Click image for larger version

Name:	Statalist - coefficients.png
Views:	1
Size:	7.2 KB
ID:	1471350


    As text:
    Variable (1) Col (2) - (1) (2)
    pre 1980 -0.1686351 0.2092568 0.0406217
    1980-84 -0.1678049 0.2092568 0.0414519
    1985-89 -0.2158353 0.2092568 -0.0065785
    1990-94 -0.2542113 0.2092568 -0.0449545
    1995-99 -0.1508089 0.2092568 0.0584479
    2000-04 -0.0889885 0.2092568 0.1202683

    Comment


    • #3
      Note that the constant term in the first model is 3.689774 and in the second model is 3.480518, the difference is -0.209256. Thus the second model has a lower estimate for the reference group (natives) and offsets this by increasing the coefficients for all other groups so that the estimates for those groups are the same in both models. That is, the estimate for pre-1980 is in the first model is 3.689-0.168 and in the second model is 3.480+0.040, so with the exception of the reference group the estimates for each group are the same in both models.

      Without any footnotes to back me up, I'm going to guess that the second model is incorrect; by hiding the dependence of the dummy variables on each other, svy is computing variances as if i.arvpre1980 were independent of i.arv1980, and so forth, and this is not the case.

      If anyone reading this can comment with more authority that I have on this, please confirm or refute my mere supposition.

      Comment


      • #4
        This looks weird to me: i.ib9999.arrival. What is the extra i. doing there?

        My first guess, though, is that one of the dummies you created is labeled or computed wrong. Double check the coding or show us the coding so we can assess it. (My Master's thesis includes a dummy variable coded 0/3. I don't think that is what I intended.)

        As a sidelight, you should be using the svy subpop option instead of the if qualifier. See pp. 3-4 of

        https://www3.nd.edu/~rwilliam/stats3/SvyCautionsX.pdf
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          I will presents in the evening an overview of the coding of these variables. Many thanks for your support all!

          Comment


          • #6
            As Mr Williams guessed, my coding was wrong since of wrong thinking. After doing it in the right way, there is no difference between these two models. Sorry for the trouble.

            Really appreciate your support all. Rethinking about my data helped me a step further.

            Comment

            Working...
            X