Possibly I am unable to grasp something that is most obvious with a linear regression model whereby a coded binary variable is regressed on a variable for reading accuracy with values bounded between 0 and 1. Note that I am reading a paper on this subject and I am exercising one of its examples (Smithson & Verkuilen, 2006).
The case data includes 44 Australian primary school children, the dependent variable is a test score for reading accuracy and the regressors are a binary indicator for dyslexia (yes=1/no=0) and a nonverbal iq score. You can get this data:
Next I run the most simple linear regression model (for educational purposes, which the paper discusses to point at the need for an alternative method, i.e. beta regression):
which result is:
This result compares perfectly when the same model is run in R (assuming that the required R packages have been installed before):
which result will be:
My 'problem' starts when the coded variable for dyslexia is regressed.
First the R syntax:
which result will be:
Note that the standard error of the constant and the variable are identical (0.015691).
In Stata that is different:
which result is:
Stata provides different standard errors for the constant and the variable, respectively 0.020621 and 0.031381.
More importantly, also their coefficients differ, respectively 0.899596 and -0.293723, compared to the R result of, respectively 0.752735 and -0.146861.
How can I explain this difference between R and Stata?
My best guess is that R regresses binary variables (and possible categorial variables as well) in a different manner.
If so why? And how, or better put: can Stata regress with some option or particular way of coding, deliver the same regression results as R?
I think it is important to understand the reasons behind this difference as well as to be able to generate it in Stata for sake of replication purposes.
Note that this difference persists when both terms are included or an interaction term. For R that is:
which result will be:
Which is in Stata:
which results in:
Again, maybe it is all too obvious, but for me it would be helpful to understand why we get this difference in results between R and Stata, and how the R result can be replicated in Stata?
The case data includes 44 Australian primary school children, the dependent variable is a test score for reading accuracy and the regressors are a binary indicator for dyslexia (yes=1/no=0) and a nonverbal iq score. You can get this data:
Code:
clear input accuracy dyslexia iq .64662 0 .59 .66535 0 .471 .70281 0 -.043 .70905 0 -.795 .73402 0 -.281 .76524 0 .59 .77148 0 -.281 .88386 0 -.676 .88386 0 .827 .91508 0 .471 .95878 0 1.144 .98376 0 1.144 .99 0 1.856 .99 0 -.201 .99 0 1.223 .99 0 -.399 .99 0 -.043 .99 0 -.914 .99 0 1.619 .99 0 .59 .99 0 .907 .99 0 1.738 .99 0 .59 .99 0 1.777 .99 0 .511 .45932 1 -.795 .53424 1 -.993 .54048 1 .709 .54673 1 -1.745 .56546 1 -.162 .5717 1 1.223 .57794 1 -.083 .57794 1 -1.191 .59043 1 -1.666 .60916 1 -.874 .60916 1 .313 .62165 1 -1.507 .64038 1 -.162 .65286 1 -.281 .66535 1 -1.27 .67159 1 -.518 .68408 1 -.439 .69032 1 -1.745 .70281 1 -1.23 end label data "Case data from: https://www.ncbi.nlm.nih.gov/pubmed/28306155"
Code:
set cformat %9.6f fre dyslexia // 0=Not 1=Indicated sum iq // Non-verbal IQ score regress accuracy iq // Result identical to R
Code:
Source | SS df MS Number of obs = 44 -------------+---------------------------------- F(1, 42) = 19.99 Model | .444360883 1 .444360883 Prob > F = 0.0001 Residual | .933498675 42 .022226159 R-squared = 0.3225 -------------+---------------------------------- Adj R-squared = 0.3064 Total | 1.37785956 43 .032043246 Root MSE = .14908 ------------------------------------------------------------------------------ accuracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- iq | 0.101650 0.022734 4.47 0.000 0.055771 0.147528 _cons | 0.772764 0.022475 34.38 0.000 0.727407 0.818121 ------------------------------------------------------------------------------
Code:
library("lmtest") library("betareg") data("ReadingSkills", package = "betareg") rs_ols_IQ <- lm(accuracy ~iq, data = ReadingSkills) coeftest(rs_ols_IQ)
Code:
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.772764 0.022475 34.3828 < 2.2e-16 *** iq 0.101650 0.022734 4.4713 5.805e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
First the R syntax:
Code:
rs_ols_DYSL <- lm(accuracy ~ dyslexia , data = ReadingSkills) coeftest(rs_ols_DYSL)
Code:
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.752735 0.015691 47.9739 < 2.2e-16 *** dyslexia -0.146861 0.015691 -9.3599 7.783e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In Stata that is different:
Code:
regress accuracy dyslexia // Result NOT identical to R
Code:
Source | SS df MS Number of obs = 44 -------------+---------------------------------- F(1, 42) = 87.61 Model | .931356641 1 .931356641 Prob > F = 0.0000 Residual | .446502918 42 .010631022 R-squared = 0.6759 -------------+---------------------------------- Adj R-squared = 0.6682 Total | 1.37785956 43 .032043246 Root MSE = .10311 ------------------------------------------------------------------------------ accuracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dyslexia | -0.293723 0.031381 -9.36 0.000 -0.357052 -0.230393 _cons | 0.899596 0.020621 43.62 0.000 0.857981 0.941212 ------------------------------------------------------------------------------
More importantly, also their coefficients differ, respectively 0.899596 and -0.293723, compared to the R result of, respectively 0.752735 and -0.146861.
How can I explain this difference between R and Stata?
My best guess is that R regresses binary variables (and possible categorial variables as well) in a different manner.
If so why? And how, or better put: can Stata regress with some option or particular way of coding, deliver the same regression results as R?
I think it is important to understand the reasons behind this difference as well as to be able to generate it in Stata for sake of replication purposes.
Note that this difference persists when both terms are included or an interaction term. For R that is:
Code:
# Both variables rs_ols1 <- lm(accuracy ~ dyslexia + iq, data = ReadingSkills) coeftest(rs_ols1) # Interaction rs_ols <- lm(accuracy ~ dyslexia * iq, data = ReadingSkills) coeftest(rs_ols)
Code:
# Both variables t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.754714 0.015624 48.3052 < 2.2e-16 *** dyslexia -0.132353 0.019029 -6.9555 1.905e-08 *** iq 0.025230 0.019068 1.3232 0.1931 # Interaction t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.733935 0.018645 39.3641 < 2.2e-16 *** dyslexia -0.137588 0.018645 -7.3794 5.586e-09 *** iq 0.020976 0.018614 1.1269 0.2665 dyslexia:iq -0.035555 0.018614 -1.9102 0.0633 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Code:
* Both variables regress accuracy dyslexia iq // Result NOT identical to R * Interaction regress accuracy dyslexia##c.iq // Result NOT identical to R
Code:
* Both variables Source | SS df MS Number of obs = 44 -------------+---------------------------------- F(2, 41) = 45.46 Model | .949642004 2 .474821002 Prob > F = 0.0000 Residual | .428217554 41 .010444331 R-squared = 0.6892 -------------+---------------------------------- Adj R-squared = 0.6741 Total | 1.37785956 43 .032043246 Root MSE = .1022 ------------------------------------------------------------------------------ accuracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dyslexia | -0.264707 0.038057 -6.96 0.000 -0.341565 -0.187848 iq | 0.025230 0.019068 1.32 0.193 -0.013278 0.063737 _cons | 0.887067 0.022526 39.38 0.000 0.841575 0.932560 ------------------------------------------------------------------------------ * Interaction Source | SS df MS Number of obs = 44 -------------+---------------------------------- F(3, 40) = 33.48 Model | .985438392 3 .328479464 Prob > F = 0.0000 Residual | .392421167 40 .009810529 R-squared = 0.7152 -------------+---------------------------------- Adj R-squared = 0.6938 Total | 1.37785956 43 .032043246 Root MSE = .09905 ------------------------------------------------------------------------------- accuracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- 1.dyslexia | -0.275176 0.037290 -7.38 0.000 -0.350541 -0.199811 iq | 0.056532 0.024699 2.29 0.027 0.006613 0.106450 | dyslexia#c.iq | 1 | -0.071111 0.037227 -1.91 0.063 -0.146350 0.004129 | _cons | 0.871523 0.023299 37.41 0.000 0.824433 0.918613 -------------------------------------------------------------------------------
Comment