Discrepancy of result linear regression with coded variable between Stata and R

ericmelse

Join Date: May 2014
Posts: 445

Discrepancy of result linear regression with coded variable between Stata and R

07 Nov 2018, 08:04

Possibly I am unable to grasp something that is most obvious with a linear regression model whereby a coded binary variable is regressed on a variable for reading accuracy with values bounded between 0 and 1. Note that I am reading a paper on this subject and I am exercising one of its examples (Smithson & Verkuilen, 2006).
The case data includes 44 Australian primary school children, the dependent variable is a test score for reading accuracy and the regressors are a binary indicator for dyslexia (yes=1/no=0) and a nonverbal iq score. You can get this data:

Code:

clear
input accuracy dyslexia iq
.64662 0 .59
.66535 0 .471
.70281 0 -.043
.70905 0 -.795
.73402 0 -.281
.76524 0 .59
.77148 0 -.281
.88386 0 -.676
.88386 0 .827
.91508 0 .471
.95878 0 1.144
.98376 0 1.144
.99 0 1.856
.99 0 -.201
.99 0 1.223
.99 0 -.399
.99 0 -.043
.99 0 -.914
.99 0 1.619
.99 0 .59
.99 0 .907
.99 0 1.738
.99 0 .59
.99 0 1.777
.99 0 .511
.45932 1 -.795
.53424 1 -.993
.54048 1 .709
.54673 1 -1.745
.56546 1 -.162
.5717 1 1.223
.57794 1 -.083
.57794 1 -1.191
.59043 1 -1.666
.60916 1 -.874
.60916 1 .313
.62165 1 -1.507
.64038 1 -.162
.65286 1 -.281
.66535 1 -1.27
.67159 1 -.518
.68408 1 -.439
.69032 1 -1.745
.70281 1 -1.23
end

label data "Case data from: https://www.ncbi.nlm.nih.gov/pubmed/28306155"

Next I run the most simple linear regression model (for educational purposes, which the paper discusses to point at the need for an alternative method, i.e. beta regression):

Code:

set cformat %9.6f
fre dyslexia                // 0=Not 1=Indicated
sum iq                        // Non-verbal IQ score
regress accuracy iq            // Result identical to R

which result is:

Code:

      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(1, 42)        =     19.99
       Model |  .444360883         1  .444360883   Prob > F        =    0.0001
    Residual |  .933498675        42  .022226159   R-squared       =    0.3225
-------------+----------------------------------   Adj R-squared   =    0.3064
       Total |  1.37785956        43  .032043246   Root MSE        =    .14908

------------------------------------------------------------------------------
    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          iq |   0.101650   0.022734     4.47   0.000     0.055771    0.147528
       _cons |   0.772764   0.022475    34.38   0.000     0.727407    0.818121
------------------------------------------------------------------------------

This result compares perfectly when the same model is run in R (assuming that the required R packages have been installed before):

Code:

library("lmtest")
library("betareg")
data("ReadingSkills", package = "betareg")
rs_ols_IQ <- lm(accuracy ~iq, data = ReadingSkills)
coeftest(rs_ols_IQ)

which result will be:

Code:

t test of coefficients:
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 0.772764   0.022475 34.3828 < 2.2e-16 ***
iq          0.101650   0.022734  4.4713 5.805e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

My 'problem' starts when the coded variable for dyslexia is regressed.
First the R syntax:

Code:

rs_ols_DYSL <- lm(accuracy ~ dyslexia , data = ReadingSkills)
coeftest(rs_ols_DYSL)

which result will be:

Code:

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  0.752735   0.015691 47.9739 < 2.2e-16 ***
dyslexia    -0.146861   0.015691 -9.3599 7.783e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the standard error of the constant and the variable are identical (0.015691).
In Stata that is different:

Code:

regress accuracy dyslexia    // Result NOT identical to R

which result is:

Code:

      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(1, 42)        =     87.61
       Model |  .931356641         1  .931356641   Prob > F        =    0.0000
    Residual |  .446502918        42  .010631022   R-squared       =    0.6759
-------------+----------------------------------   Adj R-squared   =    0.6682
       Total |  1.37785956        43  .032043246   Root MSE        =    .10311

------------------------------------------------------------------------------
    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    dyslexia |  -0.293723   0.031381    -9.36   0.000    -0.357052   -0.230393
       _cons |   0.899596   0.020621    43.62   0.000     0.857981    0.941212
------------------------------------------------------------------------------

Stata provides different standard errors for the constant and the variable, respectively 0.020621 and 0.031381.
More importantly, also their coefficients differ, respectively 0.899596 and -0.293723, compared to the R result of, respectively 0.752735 and -0.146861.
How can I explain this difference between R and Stata?
My best guess is that R regresses binary variables (and possible categorial variables as well) in a different manner.
If so why? And how, or better put: can Stata regress with some option or particular way of coding, deliver the same regression results as R?
I think it is important to understand the reasons behind this difference as well as to be able to generate it in Stata for sake of replication purposes.
Note that this difference persists when both terms are included or an interaction term. For R that is:

Code:

# Both variables
rs_ols1 <- lm(accuracy ~ dyslexia + iq, data = ReadingSkills)
coeftest(rs_ols1)
# Interaction
rs_ols <- lm(accuracy ~ dyslexia * iq, data = ReadingSkills)
coeftest(rs_ols)

which result will be:

Code:

# Both variables
t test of coefficients:
             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  0.754714   0.015624 48.3052 < 2.2e-16 ***
dyslexia    -0.132353   0.019029 -6.9555 1.905e-08 ***
iq           0.025230   0.019068  1.3232    0.1931    

# Interaction
t test of coefficients:
             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  0.733935   0.018645 39.3641 < 2.2e-16 ***
dyslexia    -0.137588   0.018645 -7.3794 5.586e-09 ***
iq           0.020976   0.018614  1.1269    0.2665    
dyslexia:iq -0.035555   0.018614 -1.9102    0.0633 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Which is in Stata:

Code:

* Both variables
regress accuracy dyslexia iq    // Result NOT identical to R
* Interaction
regress accuracy dyslexia##c.iq    // Result NOT identical to R

which results in:

Code:

* Both variables
      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(2, 41)        =     45.46
       Model |  .949642004         2  .474821002   Prob > F        =    0.0000
    Residual |  .428217554        41  .010444331   R-squared       =    0.6892
-------------+----------------------------------   Adj R-squared   =    0.6741
       Total |  1.37785956        43  .032043246   Root MSE        =     .1022

------------------------------------------------------------------------------
    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    dyslexia |  -0.264707   0.038057    -6.96   0.000    -0.341565   -0.187848
          iq |   0.025230   0.019068     1.32   0.193    -0.013278    0.063737
       _cons |   0.887067   0.022526    39.38   0.000     0.841575    0.932560
------------------------------------------------------------------------------

* Interaction
      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(3, 40)        =     33.48
       Model |  .985438392         3  .328479464   Prob > F        =    0.0000
    Residual |  .392421167        40  .009810529   R-squared       =    0.7152
-------------+----------------------------------   Adj R-squared   =    0.6938
       Total |  1.37785956        43  .032043246   Root MSE        =    .09905

-------------------------------------------------------------------------------
     accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
   1.dyslexia |  -0.275176   0.037290    -7.38   0.000    -0.350541   -0.199811
           iq |   0.056532   0.024699     2.29   0.027     0.006613    0.106450
              |
dyslexia#c.iq |
           1  |  -0.071111   0.037227    -1.91   0.063    -0.146350    0.004129
              |
        _cons |   0.871523   0.023299    37.41   0.000     0.824433    0.918613
-------------------------------------------------------------------------------

Again, maybe it is all too obvious, but for me it would be helpful to understand why we get this difference in results between R and Stata, and how the R result can be replicated in Stata?

Last edited by ericmelse; 07 Nov 2018, 08:08.

http://publicationslist.org/eric.melse

Tags: interaction, R vs Stata, regression, standard error

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

07 Nov 2018, 08:12

This is surely something you must have checked but, out of curiosity, I wonder whether the categorical variables in R were taken "as factors", as they probably should.

Best regards,

Marcos
Comment

Jesse Wursten

Join Date: Jan 2016
Posts: 915

07 Nov 2018, 08:22

Something must have gone wrong when importing the data in R, because I can't reproduce this behaviour.

Code:

* Input code omitted, copy from original post

reg accuracy dyslexia
------------------------------------------------------------------------------
    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.dyslexia |  -.2937227    .031381    -9.36   0.000    -.3570522   -.2303933
       _cons |   .8995964   .0206214    43.62   0.000     .8579808     .941212
------------------------------------------------------------------------------

export delimited using "dyslexia.csv", replace

Code:

data = read.csv('dyslexia.csv',header=TRUE, sep=",")
lmodel = lm(accuracy ~dyslexia, data=data)
summary(lmodel)

 
 Coefficients:            
Estimate Std. Error t value
(Intercept)  0.89960    0.02062   43.62
dyslexia    -0.29372    0.03138   -9.36

Comment

Scott Merryman

Join Date: Mar 2014
Posts: 898

07 Nov 2018, 09:49

Using the factor function will return the same results as Stata:

Code:

> rs_ols_IQ <- lm(accuracy ~factor(dyslexia), data = ReadingSkills)
> 
> coeftest(rs_ols_IQ)

t test of coefficients:

                     Estimate Std. Error t value  Pr(>|t|)    
(Intercept)          0.899596   0.020621 43.6245 < 2.2e-16 ***
factor(dyslexia)yes -0.293723   0.031381 -9.3599 7.783e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Comment

ericmelse

Join Date: May 2014

Posts: 445
#5

07 Nov 2018, 11:53

Dear all, thank you for thinking with me.
Dear Scott, yes, you are right, with this syntax I get that same result in R, identical to Stata's result (something to be comfortable with).
So, should I understand it that R computes the variable dyslexia, when not set as a factor, as if it is data on a ratio scale?
Is it possible to model that with Stata?

http://publicationslist.org/eric.melse
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

07 Nov 2018, 11:55

Scott Merryman

Using the factor function will return the same results as Stata

As stated in #2, that's exactly my suspicion!

Thanks for the demonstration, Scott.

ericmelse :

So, should I understand it that R computes the variable dyslexia, when not set as a factor, as if it is data on a ratio scale?

Yes, as underlined in #2. We should "tell" R that the variable is categorical so as to get the right result (logically, whenever the variable is categorical).

Code:

Is it possible to model that with Stata?

I failed to understand why modeling Stata to produce wrong results.

Last edited by Marcos Almeida; 07 Nov 2018, 12:03.

Best regards,

Marcos
Comment
ericmelse

Join Date: May 2014

Posts: 445
#7

07 Nov 2018, 12:07

Yes, Marcos, but note that the (published) R models are all using the (factor) variable NOT set as such. We can agree that is possible not proper, possibly wrong. But I am puzzled about:
1. What is the substantive interpretation of these model results, with near identical standard errors? Where are these s.e.'s coming from? Is this simply 'wrong' modelling, or is there some underpinning to model like this?
2. Can the same (possibly wrong) result be 'forced' somehow with a Stata model (like regress or possibly using a sem model)?
I am very curious about this and looking for answers.

http://publicationslist.org/eric.melse
Comment

Scott Merryman

Join Date: Mar 2014
Posts: 898

07 Nov 2018, 12:23

R is coding the variable as contrast - see the output form attributes()

Code:

> rs_ols_IQ <- lm(accuracy ~dyslexia, data = ReadingSkills)
> coeftest(rs_ols_IQ)

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  0.752735   0.015691 47.9739 < 2.2e-16 ***
dyslexia    -0.146861   0.015691 -9.3599 7.783e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> attributes(as.factor(ReadingSkills$dyslexia))
$`levels`
[1] "no"  "yes"

$class
[1] "factor"

$contrasts
      
no  -1
yes  1

To obtain the same results in Stata:

Code:

. recode dy (0 = -1)
(dyslexia: 25 changes made)

. reg accuracy dyslexia

      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(1, 42)        =     87.61
       Model |  .931356641         1  .931356641   Prob > F        =    0.0000
    Residual |  .446502918        42  .010631022   R-squared       =    0.6759
-------------+----------------------------------   Adj R-squared   =    0.6682
       Total |  1.37785956        43  .032043246   Root MSE        =    .10311

------------------------------------------------------------------------------
    accuracy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    dyslexia |  -.1468614   .0156905    -9.36   0.000    -.1785261   -.1151966
       _cons |    .752735   .0156905    47.97   0.000     .7210703    .7843998
------------------------------------------------------------------------------

Comment

ericmelse

Join Date: May 2014

Posts: 445
#9

07 Nov 2018, 12:37

Dear Scott, many thanks for your explanation and example syntax. I can work further with this.

http://publicationslist.org/eric.melse
Comment

Announcement

Discrepancy of result linear regression with coded variable between Stata and R

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment