F-test differences Stata and R

Lisa Rosenberger started a topic F-test differences Stata and R

05 Jul 2016, 02:36

F-test differences Stata and R

This is a cross post from stackexchange, see: http://stats.stackexchange.com/quest...es-stata-and-r

I have a question about what the difference is in how Stata and R compute ANOVAs. I have run exactly the same ANOVA in both softwares, but curiously get a different F-statistics for one of the predictors. I´m not too familiar with Stata, but as far as I understood it, I do a Type 2 SS ANOVA for both.
To understand my output, this is my model:
Outcome variable is a continuous variable called vertrauen (=trust)
predictor 1 is a 2-level factor called trustee in R and Goodguy in Stata
predictor 2 is also a 2 level factor called Group in R and uw in Stata.
This is the R output:

Code:

> m2-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50)
> Anova(m2,type="2")
> Anova Table (Type II tests)

>Response: vertrauen
>              Sum Sq Df F value    Pr(>F)   
>trustee       2.4928  1 24.5497    1.367e-05 *** 
>Group         0.0030  1  0.0292    0.8651     
>trustee:Group 0.1137  1  1.1200    0.2963     
>Residuals     4.0617 40                       
> 
>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>

This is the Stata output:

Code:

. anova vertrauen uw Goodguy uw#Goodguy
 
Number of obs =         44    R-squared     =  0.3912
Root MSE      =    .318658    Adj R-squared =  0.3455
 
               Source | Partial SS         df         MS        F    Prob>F
           -----------+----------------------------------------------------
                Model |  2.6095358          3   .86984526      8.57  0.0002
                      |
                   uw |  .00296733          1   .00296733      0.03  0.8651
              Goodguy |  1.2981586          1   1.2981586     12.78  0.0009
           uw#Goodguy |  .11373073          1   .11373073      1.12  0.2963
                      |
             Residual |  4.0617062         40   .10154266 
           -----------+----------------------------------------------------
                Total |   6.671242         43   .15514516

As you can see, the F-statistics for the Group (UW) main effect and for the Group (UW) x trustee (Goodguy) interaction are the same, but for the trustee (Goodguy) main effect they differ. In R it´s almost twice as high as in Stata. I tried to change the order of the predictor and the reference levels, but that didn´t change my R output.
Does anyone know what causes the difference in the F-statistic here? I´m really puzzled about it. I expected it to be the same.
Here is the Stata output without the interaction:

Code:

. anova vertrauen uw Goodguy
 
Number of obs =         44    R-squared     =  0.3741
Root MSE      =    .319124    Adj R-squared =  0.3436
 
               Source | Partial SS         df         MS        F    Prob>F
           -----------+----------------------------------------------------
                Model |   2.495805          2   1.2479025     12.25  0.0001
                      |
                   uw |  .00296733          1   .00296733      0.03  0.8653
              Goodguy |  2.4928377          1   2.4928377     24.48  0.0000
                      |
             Residual |   4.175437         41   .10183993 
           -----------+----------------------------------------------------
                Total |   6.671242         43   .15514516

And here is the R output without the interaction:

Code:

> m2.4-lm(vertrauen~trustee+Group,data=RTG.UWD.short.50)
> Anova(m2.4)
Anova Table (Type II tests)
 
Response: vertrauen
          Sum Sq Df F value    Pr(>F)   
trustee   2.4928  1 24.4780 1.328e-05 ***
Group     0.0030  1  0.0291    0.8653   
Residuals 4.1754 41                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

It´s the same, thus it has to do something with how the two softwares incorporate the interaction term.
I also tried to manually compute the interaction term and found something interesting:
Here is the R output:

Code:

RTG.UWD.short.50$interaction-as.numeric(RTG.UWD.short.50$trustee)*as.numeric(RTG.UWD.short.50$Group)
> m2.7 Anova(m2.7)
Anova Table (Type II tests)
 
Response: vertrauen
            Sum Sq Df F value    Pr(>F)   
trustee     1.2982  1 12.7844 0.0009316 ***
Group       0.0030  1  0.0292 0.8651282   
interaction 0.1137  1  1.1200 0.2962617   
Residuals   4.0617 40                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

And here is the Stata output:

Code:

. gen interaction=uw*Goodguy
 
. anova vertrauen uw Goodguy interaction
 
Number of obs =         44    R-squared     =  0.3912
Root MSE      =    .318658    Adj R-squared =  0.3455
 
               Source | Partial SS         df         MS        F    Prob>F
          ------------+----------------------------------------------------
                Model |  2.6095358          3   .86984526      8.57  0.0002
                      |
                   uw |   .0399785          1    .0399785      0.39  0.5339
              Goodguy |  2.3984067          1   2.3984067     23.62  0.0000
          interaction |  .11373073          1   .11373073      1.12  0.2963
                      |
             Residual |  4.0617062         40   .10154266 
          ------------+----------------------------------------------------
                Total |   6.671242         43   .15514516

Thus it seems that there is a difference in how R/ Stata computes the interactions. The R output of the manually computed interaction matches the automatically computed interaction output in Stata.
And finally the descriptives from R:

Code:

> describe(RTG.UWD.short.50$vertrauen)
RTG.UWD.short.50$vertrauen
      n missing  unique    Info    Mean  
     44       0      43       1  0.5046
> describe(RTG.UWD.short.50$Group)
RTG.UWD.short.50$Group
      n missing  unique
     44       0       2
 
1 (34, 77%), 2 (10, 23%)
> describe(RTG.UWD.short.50$trustee)
RTG.UWD.short.50$trustee
      n missing  unique
     44       0       2
 
bad (22, 50%), good (22, 50%)

and from Stata:

Code:

. sum vertrauen uw Goodguy
 
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   vertrauen |         44    .5045969    .3938847    .000998          1
          uw |         44    .2272727    .4239151          0          1
     Goodguy |         44          .5    .5057805          0          1

Tags: anova, F-Statistics, stata

Marcos Almeida replied

06 Jul 2016, 08:57
Great, Lisa. We may also observe that, contrary to the remaining outputs you showed in R (all of them with type II SS), the "correct one" (showed in #8) , I mean, the one whose results are similar to Stata's, underlines it is the type III SS.
Leave a comment:

Lisa Rosenberger replied

06 Jul 2016, 08:50

After reading Marcos´ link again more carefully, I figured out what the problem is. It´s not about what kind of SS are computed, but what contrasts are used. The default contrast in R is called contrast treatment, which compares every level with the baseline level and is non-orthogonal. But as Marcos´ link pointed out (which also referred to this very useful mail from John Fox: http://www.mail-archive.com/r-help@s.../msg69781.html) the contrasts must be orthogonal and thus sum to zero when computing type 3 SS. In R this is done by specifying contrast sum. When doing so, I get exactly the same output as the "default" output in Stata (where I don´t specify which SS I want to compute):

Code:

> options(contrasts=c('contr.sum','contr.poly'))
  > m2a<-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50)
			> Anova(m2a,type="3")
			Anova Table (Type III tests)
			 
			Response: vertrauen
			              Sum Sq Df F value    Pr(>F)   
			(Intercept)   7.7042  1 75.8714 8.841e-11 ***
			trustee       1.2982  1 12.7844 0.0009316 ***
			Group         0.0030  1  0.0292 0.8651282   
			trustee:Group 0.1137  1  1.1200 0.2962617   
			Residuals     4.0617 40                     
			---
			Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



 			>

Thus, while Stata is specifying the correct contrast & type of SS by default, R does not - this is a shame, because I don´t think a lot of people bother to check what kind of SS to use and whether their contrast settings fit the type of SS or not...

Thanks again for this very useful link Marcos!

Leave a comment:

Marcos Almeida replied

05 Jul 2016, 05:12

Maybe this will clarify what happens with a non-sequential versus a couple of models under a sequential ANOVA:

Code:

. use http://www.stata-press.com/data/r14/systolic.dta
(Systolic Blood Pressure Data)

. anova systolic drug disease drug#disease

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source | Partial SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  2997.4719          3   999.15729      9.05  0.0001
                 disease |  415.87305          2   207.93652      1.88  0.1637
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic drug disease drug#disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  3133.2385          3   1044.4128      9.46  0.0001
                 disease |  418.83374          2   209.41687      1.90  0.1617
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic disease drug drug#disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                 disease |  488.63938          2   244.31969      2.21  0.1210
                    drug |  3063.4329          3   1021.1443      9.25  0.0001
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic drug drug#disease disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  3133.2385          3   1044.4128      9.46  0.0001
            drug#disease |     882.45          6     147.075      1.33  0.2626
                 disease |     243.65          2     121.825      1.10  0.3405
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic disease drug#disease drug, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                 disease |  488.63938          2   244.31969      2.21  0.1210
            drug#disease |  2830.7412          6    471.7902      4.27  0.0017
                    drug |  939.95789          3    313.3193      2.84  0.0483
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237

Hopefully that helps!

Best,

Marcos

Announcement

F-test differences Stata and R

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: