Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrepancy of result linear regression with coded variable between Stata and R

    Possibly I am unable to grasp something that is most obvious with a linear regression model whereby a coded binary variable is regressed on a variable for reading accuracy with values bounded between 0 and 1. Note that I am reading a paper on this subject and I am exercising one of its examples (Smithson & Verkuilen, 2006).
    The case data includes 44 Australian primary school children, the dependent variable is a test score for reading accuracy and the regressors are a binary indicator for dyslexia (yes=1/no=0) and a nonverbal iq score. You can get this data:
    Code:
    clear
    input accuracy dyslexia iq
    .64662 0 .59
    .66535 0 .471
    .70281 0 -.043
    .70905 0 -.795
    .73402 0 -.281
    .76524 0 .59
    .77148 0 -.281
    .88386 0 -.676
    .88386 0 .827
    .91508 0 .471
    .95878 0 1.144
    .98376 0 1.144
    .99 0 1.856
    .99 0 -.201
    .99 0 1.223
    .99 0 -.399
    .99 0 -.043
    .99 0 -.914
    .99 0 1.619
    .99 0 .59
    .99 0 .907
    .99 0 1.738
    .99 0 .59
    .99 0 1.777
    .99 0 .511
    .45932 1 -.795
    .53424 1 -.993
    .54048 1 .709
    .54673 1 -1.745
    .56546 1 -.162
    .5717 1 1.223
    .57794 1 -.083
    .57794 1 -1.191
    .59043 1 -1.666
    .60916 1 -.874
    .60916 1 .313
    .62165 1 -1.507
    .64038 1 -.162
    .65286 1 -.281
    .66535 1 -1.27
    .67159 1 -.518
    .68408 1 -.439
    .69032 1 -1.745
    .70281 1 -1.23
    end
    
    label data "Case data from: https://www.ncbi.nlm.nih.gov/pubmed/28306155"
    Next I run the most simple linear regression model (for educational purposes, which the paper discusses to point at the need for an alternative method, i.e. beta regression):
    Code:
    set cformat %9.6f
    fre dyslexia                // 0=Not 1=Indicated
    sum iq                        // Non-verbal IQ score
    regress accuracy iq            // Result identical to R
    which result is:
    Code:
          Source |       SS           df       MS      Number of obs   =        44
    -------------+----------------------------------   F(1, 42)        =     19.99
           Model |  .444360883         1  .444360883   Prob > F        =    0.0001
        Residual |  .933498675        42  .022226159   R-squared       =    0.3225
    -------------+----------------------------------   Adj R-squared   =    0.3064
           Total |  1.37785956        43  .032043246   Root MSE        =    .14908
    
    ------------------------------------------------------------------------------
        accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              iq |   0.101650   0.022734     4.47   0.000     0.055771    0.147528
           _cons |   0.772764   0.022475    34.38   0.000     0.727407    0.818121
    ------------------------------------------------------------------------------
    This result compares perfectly when the same model is run in R (assuming that the required R packages have been installed before):
    Code:
    library("lmtest")
    library("betareg")
    data("ReadingSkills", package = "betareg")
    rs_ols_IQ <- lm(accuracy ~iq, data = ReadingSkills)
    coeftest(rs_ols_IQ)
    which result will be:
    Code:
    t test of coefficients:
                Estimate Std. Error t value  Pr(>|t|)    
    (Intercept) 0.772764   0.022475 34.3828 < 2.2e-16 ***
    iq          0.101650   0.022734  4.4713 5.805e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    My 'problem' starts when the coded variable for dyslexia is regressed.
    First the R syntax:
    Code:
    rs_ols_DYSL <- lm(accuracy ~ dyslexia , data = ReadingSkills)
    coeftest(rs_ols_DYSL)
    which result will be:
    Code:
    t test of coefficients:
    
                 Estimate Std. Error t value  Pr(>|t|)    
    (Intercept)  0.752735   0.015691 47.9739 < 2.2e-16 ***
    dyslexia    -0.146861   0.015691 -9.3599 7.783e-12 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Note that the standard error of the constant and the variable are identical (0.015691).
    In Stata that is different:
    Code:
    regress accuracy dyslexia    // Result NOT identical to R
    which result is:
    Code:
          Source |       SS           df       MS      Number of obs   =        44
    -------------+----------------------------------   F(1, 42)        =     87.61
           Model |  .931356641         1  .931356641   Prob > F        =    0.0000
        Residual |  .446502918        42  .010631022   R-squared       =    0.6759
    -------------+----------------------------------   Adj R-squared   =    0.6682
           Total |  1.37785956        43  .032043246   Root MSE        =    .10311
    
    ------------------------------------------------------------------------------
        accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
        dyslexia |  -0.293723   0.031381    -9.36   0.000    -0.357052   -0.230393
           _cons |   0.899596   0.020621    43.62   0.000     0.857981    0.941212
    ------------------------------------------------------------------------------
    Stata provides different standard errors for the constant and the variable, respectively 0.020621 and 0.031381.
    More importantly, also their coefficients differ, respectively 0.899596 and -0.293723, compared to the R result of, respectively 0.752735 and -0.146861.
    How can I explain this difference between R and Stata?
    My best guess is that R regresses binary variables (and possible categorial variables as well) in a different manner.
    If so why? And how, or better put: can Stata regress with some option or particular way of coding, deliver the same regression results as R?
    I think it is important to understand the reasons behind this difference as well as to be able to generate it in Stata for sake of replication purposes.
    Note that this difference persists when both terms are included or an interaction term. For R that is:
    Code:
    # Both variables
    rs_ols1 <- lm(accuracy ~ dyslexia + iq, data = ReadingSkills)
    coeftest(rs_ols1)
    # Interaction
    rs_ols <- lm(accuracy ~ dyslexia * iq, data = ReadingSkills)
    coeftest(rs_ols)
    which result will be:
    Code:
    # Both variables
    t test of coefficients:
                 Estimate Std. Error t value  Pr(>|t|)    
    (Intercept)  0.754714   0.015624 48.3052 < 2.2e-16 ***
    dyslexia    -0.132353   0.019029 -6.9555 1.905e-08 ***
    iq           0.025230   0.019068  1.3232    0.1931    
    
    # Interaction
    t test of coefficients:
                 Estimate Std. Error t value  Pr(>|t|)    
    (Intercept)  0.733935   0.018645 39.3641 < 2.2e-16 ***
    dyslexia    -0.137588   0.018645 -7.3794 5.586e-09 ***
    iq           0.020976   0.018614  1.1269    0.2665    
    dyslexia:iq -0.035555   0.018614 -1.9102    0.0633 .  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Which is in Stata:
    Code:
    * Both variables
    regress accuracy dyslexia iq    // Result NOT identical to R
    * Interaction
    regress accuracy dyslexia##c.iq    // Result NOT identical to R
    which results in:
    Code:
    * Both variables
          Source |       SS           df       MS      Number of obs   =        44
    -------------+----------------------------------   F(2, 41)        =     45.46
           Model |  .949642004         2  .474821002   Prob > F        =    0.0000
        Residual |  .428217554        41  .010444331   R-squared       =    0.6892
    -------------+----------------------------------   Adj R-squared   =    0.6741
           Total |  1.37785956        43  .032043246   Root MSE        =     .1022
    
    ------------------------------------------------------------------------------
        accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
        dyslexia |  -0.264707   0.038057    -6.96   0.000    -0.341565   -0.187848
              iq |   0.025230   0.019068     1.32   0.193    -0.013278    0.063737
           _cons |   0.887067   0.022526    39.38   0.000     0.841575    0.932560
    ------------------------------------------------------------------------------
    
    * Interaction
          Source |       SS           df       MS      Number of obs   =        44
    -------------+----------------------------------   F(3, 40)        =     33.48
           Model |  .985438392         3  .328479464   Prob > F        =    0.0000
        Residual |  .392421167        40  .009810529   R-squared       =    0.7152
    -------------+----------------------------------   Adj R-squared   =    0.6938
           Total |  1.37785956        43  .032043246   Root MSE        =    .09905
    
    -------------------------------------------------------------------------------
         accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
       1.dyslexia |  -0.275176   0.037290    -7.38   0.000    -0.350541   -0.199811
               iq |   0.056532   0.024699     2.29   0.027     0.006613    0.106450
                  |
    dyslexia#c.iq |
               1  |  -0.071111   0.037227    -1.91   0.063    -0.146350    0.004129
                  |
            _cons |   0.871523   0.023299    37.41   0.000     0.824433    0.918613
    -------------------------------------------------------------------------------
    Again, maybe it is all too obvious, but for me it would be helpful to understand why we get this difference in results between R and Stata, and how the R result can be replicated in Stata?
    Last edited by ericmelse; 07 Nov 2018, 08:08.
    http://publicationslist.org/eric.melse

  • #2
    This is surely something you must have checked but, out of curiosity, I wonder whether the categorical variables in R were taken "as factors", as they probably should.
    Best regards,

    Marcos

    Comment


    • #3
      Something must have gone wrong when importing the data in R, because I can't reproduce this behaviour.
      Code:
      * Input code omitted, copy from original post
      
      reg accuracy dyslexia
      ------------------------------------------------------------------------------
          accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
        1.dyslexia |  -.2937227    .031381    -9.36   0.000    -.3570522   -.2303933
             _cons |   .8995964   .0206214    43.62   0.000     .8579808     .941212
      ------------------------------------------------------------------------------
      
      export delimited using "dyslexia.csv", replace
      Code:
      data = read.csv('dyslexia.csv',header=TRUE, sep=",")
      lmodel = lm(accuracy ~dyslexia, data=data)
      summary(lmodel)
      
       
       Coefficients:            
      Estimate Std. Error t value
      (Intercept)  0.89960    0.02062   43.62
      dyslexia    -0.29372    0.03138   -9.36

      Comment


      • #4
        Using the factor function will return the same results as Stata:

        Code:
        > rs_ols_IQ <- lm(accuracy ~factor(dyslexia), data = ReadingSkills)
        > 
        > coeftest(rs_ols_IQ)
        
        t test of coefficients:
        
                             Estimate Std. Error t value  Pr(>|t|)    
        (Intercept)          0.899596   0.020621 43.6245 < 2.2e-16 ***
        factor(dyslexia)yes -0.293723   0.031381 -9.3599 7.783e-12 ***
        ---
        Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        Comment


        • #5
          Dear all, thank you for thinking with me.
          Dear Scott, yes, you are right, with this syntax I get that same result in R, identical to Stata's result (something to be comfortable with).
          So, should I understand it that R computes the variable dyslexia, when not set as a factor, as if it is data on a ratio scale?
          Is it possible to model that with Stata?
          http://publicationslist.org/eric.melse

          Comment


          • #6
            Scott Merryman
            Using the factor function will return the same results as Stata
            As stated in #2, that's exactly my suspicion!

            Thanks for the demonstration, Scott.

            ericmelse :

            So, should I understand it that R computes the variable dyslexia, when not set as a factor, as if it is data on a ratio scale?
            Yes, as underlined in #2. We should "tell" R that the variable is categorical so as to get the right result (logically, whenever the variable is categorical).

            Code:
            Is it possible to model that with Stata?
            I failed to understand why modeling Stata to produce wrong results.
            Last edited by Marcos Almeida; 07 Nov 2018, 12:03.
            Best regards,

            Marcos

            Comment


            • #7
              Yes, Marcos, but note that the (published) R models are all using the (factor) variable NOT set as such. We can agree that is possible not proper, possibly wrong. But I am puzzled about:
              1. What is the substantive interpretation of these model results, with near identical standard errors? Where are these s.e.'s coming from? Is this simply 'wrong' modelling, or is there some underpinning to model like this?
              2. Can the same (possibly wrong) result be 'forced' somehow with a Stata model (like regress or possibly using a sem model)?
              I am very curious about this and looking for answers.
              http://publicationslist.org/eric.melse

              Comment


              • #8
                R is coding the variable as contrast - see the output form attributes()
                Code:
                > rs_ols_IQ <- lm(accuracy ~dyslexia, data = ReadingSkills)
                > coeftest(rs_ols_IQ)
                
                t test of coefficients:
                
                             Estimate Std. Error t value  Pr(>|t|)    
                (Intercept)  0.752735   0.015691 47.9739 < 2.2e-16 ***
                dyslexia    -0.146861   0.015691 -9.3599 7.783e-12 ***
                ---
                Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
                
                > attributes(as.factor(ReadingSkills$dyslexia))
                $`levels`
                [1] "no"  "yes"
                
                $class
                [1] "factor"
                
                $contrasts
                      
                no  -1
                yes  1
                To obtain the same results in Stata:

                Code:
                . recode dy (0 = -1)
                (dyslexia: 25 changes made)
                
                . reg accuracy dyslexia
                
                      Source |       SS           df       MS      Number of obs   =        44
                -------------+----------------------------------   F(1, 42)        =     87.61
                       Model |  .931356641         1  .931356641   Prob > F        =    0.0000
                    Residual |  .446502918        42  .010631022   R-squared       =    0.6759
                -------------+----------------------------------   Adj R-squared   =    0.6682
                       Total |  1.37785956        43  .032043246   Root MSE        =    .10311
                
                ------------------------------------------------------------------------------
                    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                    dyslexia |  -.1468614   .0156905    -9.36   0.000    -.1785261   -.1151966
                       _cons |    .752735   .0156905    47.97   0.000     .7210703    .7843998
                ------------------------------------------------------------------------------

                Comment


                • #9
                  Dear Scott, many thanks for your explanation and example syntax. I can work further with this.
                  http://publicationslist.org/eric.melse

                  Comment

                  Working...
                  X