F-test differences Stata and R

Lisa Rosenberger

Join Date: Jul 2016
Posts: 3

F-test differences Stata and R

05 Jul 2016, 02:36

This is a cross post from stackexchange, see: http://stats.stackexchange.com/quest...es-stata-and-r

I have a question about what the difference is in how Stata and R compute ANOVAs. I have run exactly the same ANOVA in both softwares, but curiously get a different F-statistics for one of the predictors. I´m not too familiar with Stata, but as far as I understood it, I do a Type 2 SS ANOVA for both.
To understand my output, this is my model:
Outcome variable is a continuous variable called vertrauen (=trust)
predictor 1 is a 2-level factor called trustee in R and Goodguy in Stata
predictor 2 is also a 2 level factor called Group in R and uw in Stata.
This is the R output:

Code:

> m2-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50)
> Anova(m2,type="2")
> Anova Table (Type II tests)

>Response: vertrauen
>              Sum Sq Df F value    Pr(>F)   
>trustee       2.4928  1 24.5497    1.367e-05 *** 
>Group         0.0030  1  0.0292    0.8651     
>trustee:Group 0.1137  1  1.1200    0.2963     
>Residuals     4.0617 40                       
> 
>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>

This is the Stata output:

Code:

. anova vertrauen uw Goodguy uw#Goodguy
 
Number of obs =         44    R-squared     =  0.3912
Root MSE      =    .318658    Adj R-squared =  0.3455
 
               Source | Partial SS         df         MS        F    Prob>F
           -----------+----------------------------------------------------
                Model |  2.6095358          3   .86984526      8.57  0.0002
                      |
                   uw |  .00296733          1   .00296733      0.03  0.8651
              Goodguy |  1.2981586          1   1.2981586     12.78  0.0009
           uw#Goodguy |  .11373073          1   .11373073      1.12  0.2963
                      |
             Residual |  4.0617062         40   .10154266 
           -----------+----------------------------------------------------
                Total |   6.671242         43   .15514516

As you can see, the F-statistics for the Group (UW) main effect and for the Group (UW) x trustee (Goodguy) interaction are the same, but for the trustee (Goodguy) main effect they differ. In R it´s almost twice as high as in Stata. I tried to change the order of the predictor and the reference levels, but that didn´t change my R output.
Does anyone know what causes the difference in the F-statistic here? I´m really puzzled about it. I expected it to be the same.
Here is the Stata output without the interaction:

Code:

. anova vertrauen uw Goodguy
 
Number of obs =         44    R-squared     =  0.3741
Root MSE      =    .319124    Adj R-squared =  0.3436
 
               Source | Partial SS         df         MS        F    Prob>F
           -----------+----------------------------------------------------
                Model |   2.495805          2   1.2479025     12.25  0.0001
                      |
                   uw |  .00296733          1   .00296733      0.03  0.8653
              Goodguy |  2.4928377          1   2.4928377     24.48  0.0000
                      |
             Residual |   4.175437         41   .10183993 
           -----------+----------------------------------------------------
                Total |   6.671242         43   .15514516

And here is the R output without the interaction:

Code:

> m2.4-lm(vertrauen~trustee+Group,data=RTG.UWD.short.50)
> Anova(m2.4)
Anova Table (Type II tests)
 
Response: vertrauen
          Sum Sq Df F value    Pr(>F)   
trustee   2.4928  1 24.4780 1.328e-05 ***
Group     0.0030  1  0.0291    0.8653   
Residuals 4.1754 41                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

It´s the same, thus it has to do something with how the two softwares incorporate the interaction term.
I also tried to manually compute the interaction term and found something interesting:
Here is the R output:

Code:

RTG.UWD.short.50$interaction-as.numeric(RTG.UWD.short.50$trustee)*as.numeric(RTG.UWD.short.50$Group)
> m2.7 Anova(m2.7)
Anova Table (Type II tests)
 
Response: vertrauen
            Sum Sq Df F value    Pr(>F)   
trustee     1.2982  1 12.7844 0.0009316 ***
Group       0.0030  1  0.0292 0.8651282   
interaction 0.1137  1  1.1200 0.2962617   
Residuals   4.0617 40                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

And here is the Stata output:

Code:

. gen interaction=uw*Goodguy
 
. anova vertrauen uw Goodguy interaction
 
Number of obs =         44    R-squared     =  0.3912
Root MSE      =    .318658    Adj R-squared =  0.3455
 
               Source | Partial SS         df         MS        F    Prob>F
          ------------+----------------------------------------------------
                Model |  2.6095358          3   .86984526      8.57  0.0002
                      |
                   uw |   .0399785          1    .0399785      0.39  0.5339
              Goodguy |  2.3984067          1   2.3984067     23.62  0.0000
          interaction |  .11373073          1   .11373073      1.12  0.2963
                      |
             Residual |  4.0617062         40   .10154266 
          ------------+----------------------------------------------------
                Total |   6.671242         43   .15514516

Thus it seems that there is a difference in how R/ Stata computes the interactions. The R output of the manually computed interaction matches the automatically computed interaction output in Stata.
And finally the descriptives from R:

Code:

> describe(RTG.UWD.short.50$vertrauen)
RTG.UWD.short.50$vertrauen
      n missing  unique    Info    Mean  
     44       0      43       1  0.5046
> describe(RTG.UWD.short.50$Group)
RTG.UWD.short.50$Group
      n missing  unique
     44       0       2
 
1 (34, 77%), 2 (10, 23%)
> describe(RTG.UWD.short.50$trustee)
RTG.UWD.short.50$trustee
      n missing  unique
     44       0       2
 
bad (22, 50%), good (22, 50%)

and from Stata:

Code:

. sum vertrauen uw Goodguy
 
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   vertrauen |         44    .5045969    .3938847    .000998          1
          uw |         44    .2272727    .4239151          0          1
     Goodguy |         44          .5    .5057805          0          1

Tags: anova, F-Statistics, stata

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

05 Jul 2016, 04:45

Hello LIsa,

Welcome to the Stata Forum.

It seems this issue is related to type II (sequential) sum of squares + unbalanced data + difference in the sequence of commands, I mean, the group variable is under a different sequence, when comparing both models, and that would change the results of a sequential ANOVA under unbalanced data.

You may wish to read this article to get further information.http://goanna.cs.rmit.edu.au/~fscholer/anova.php

Finally, I recommend you repeat the commands in R under the same sequence as you did in Stata, Please, share the results here in the Forum. My best bet: they will be quite similar.

Best,

Marcos

Last edited by Marcos Almeida; 05 Jul 2016, 04:47.

Best regards,

Marcos
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4457
#3

05 Jul 2016, 04:51

If you want SAS Type II sums of squares in Stata, you run anova . . ., sequential twice, reversing the position of the main effects factors, and using the sums of squares, F statistic etc. for the second main effects factor. So, you would

Code:

anova vertrauen Goodguy uw Goodguy#uw, sequential // Use the SS, MS and F from this for uw anova vertrauen uw Goodguy Goodguy#uw, sequential // Use the SS, MS and F from this for Goodguy
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

05 Jul 2016, 05:12

Maybe this will clarify what happens with a non-sequential versus a couple of models under a sequential ANOVA:

Code:

. use http://www.stata-press.com/data/r14/systolic.dta
(Systolic Blood Pressure Data)

. anova systolic drug disease drug#disease

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source | Partial SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  2997.4719          3   999.15729      9.05  0.0001
                 disease |  415.87305          2   207.93652      1.88  0.1637
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic drug disease drug#disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  3133.2385          3   1044.4128      9.46  0.0001
                 disease |  418.83374          2   209.41687      1.90  0.1617
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic disease drug drug#disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                 disease |  488.63938          2   244.31969      2.21  0.1210
                    drug |  3063.4329          3   1021.1443      9.25  0.0001
            drug#disease |  707.26626          6   117.87771      1.07  0.3958
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic drug drug#disease disease, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                    drug |  3133.2385          3   1044.4128      9.46  0.0001
            drug#disease |     882.45          6     147.075      1.33  0.2626
                 disease |     243.65          2     121.825      1.10  0.3405
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237  

. anova systolic disease drug#disease drug, sequential

                         Number of obs =         58    R-squared     =  0.4560
                         Root MSE      =    10.5096    Adj R-squared =  0.3259

                  Source |    Seq. SS         df         MS        F    Prob>F
            -------------+----------------------------------------------------
                   Model |  4259.3385         11   387.21259      3.51  0.0013
                         |
                 disease |  488.63938          2   244.31969      2.21  0.1210
            drug#disease |  2830.7412          6    471.7902      4.27  0.0017
                    drug |  939.95789          3    313.3193      2.84  0.0483
                         |
                Residual |  5080.8167         46   110.45254  
            -------------+----------------------------------------------------
                   Total |  9340.1552         57   163.86237

Hopefully that helps!

Best,

Marcos

Best regards,

Marcos

Comment

Lisa Rosenberger

Join Date: Jul 2016
Posts: 3

06 Jul 2016, 02:35

Thanks Markos & Joseph for your input!
When I read Markos´ first comment, I thought it made a lot of sense that the sequence of the factors would affect my F-values, after all I compute sequential SS. But, unfortunatelly, changing the order of the factors does not change my R output:

Code:

  > m2b<-lm(vertrauen~Group*trustee,data=RTG.UWD.short.50)
			> Anova(m2b,type="2")
			Anova Table (Type II tests)
			 
			Response: vertrauen
			                    Sum Sq Df F value    Pr(>F)   
			Group             0.0030  1  0.0292    0.8651   
			trustee            2.4928  1  24.5497 1.367e-05 ***
			Group:trustee  0.1137  1  1.1200    0.2963   
			Residuals        4.0617 40                     
			---
			Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



 			>

But then I looked at Markos link and figured that type 2 SS are actually not sequential SS, but it always takes the other main effect into account:

Type I, also called "sequential" sum of squares:

SS(A) for factor A.
SS(B | A) for factor B.
SS(AB | B, A) for interaction AB.
This tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects.
Because of the sequential nature and the fact that the two main factors are tested in a particular order, this type of sums of squares will give different results for unbalanced data depending on which main effect is considered first.

Type II:

SS(A | B) for factor A.
SS(B | A) for factor B.
This type tests for each main effect after the other main effect.

Thus it makes actually sense that the output doesn´t change, or am I confused now?

Do you know what kind of SS are computed in Stata as a default, i.e. if I don´t specify that I want sequential SS (like I did in my original post)?

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4457
#6

06 Jul 2016, 02:44

Originally posted by Lisa Rosenberger View Post

Do you know what kind of SS are computed in Stata as a default, i.e. if I don´t specify that I want sequential SS (like I did in my original post)?

From the Stata user's manual entry for anova: "If you are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares." ([R] Base Reference, anova, Technical Note, p. 24)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

06 Jul 2016, 04:53

With regards to type II Sum of Squares, as underlined in the text I shared:

Computationally, this is equivalent to running a type I analysis with different orders of the factors, and taking the appropriate output.

Best regards,

Marcos
Comment

Lisa Rosenberger

Join Date: Jul 2016
Posts: 3

06 Jul 2016, 08:50

After reading Marcos´ link again more carefully, I figured out what the problem is. It´s not about what kind of SS are computed, but what contrasts are used. The default contrast in R is called contrast treatment, which compares every level with the baseline level and is non-orthogonal. But as Marcos´ link pointed out (which also referred to this very useful mail from John Fox: http://www.mail-archive.com/r-help@s.../msg69781.html) the contrasts must be orthogonal and thus sum to zero when computing type 3 SS. In R this is done by specifying contrast sum. When doing so, I get exactly the same output as the "default" output in Stata (where I don´t specify which SS I want to compute):

Code:

> options(contrasts=c('contr.sum','contr.poly'))
  > m2a<-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50)
			> Anova(m2a,type="3")
			Anova Table (Type III tests)
			 
			Response: vertrauen
			              Sum Sq Df F value    Pr(>F)   
			(Intercept)   7.7042  1 75.8714 8.841e-11 ***
			trustee       1.2982  1 12.7844 0.0009316 ***
			Group         0.0030  1  0.0292 0.8651282   
			trustee:Group 0.1137  1  1.1200 0.2962617   
			Residuals     4.0617 40                     
			---
			Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



 			>

Thus, while Stata is specifying the correct contrast & type of SS by default, R does not - this is a shame, because I don´t think a lot of people bother to check what kind of SS to use and whether their contrast settings fit the type of SS or not...

Thanks again for this very useful link Marcos!

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

06 Jul 2016, 08:57

Great, Lisa. We may also observe that, contrary to the remaining outputs you showed in R (all of them with type II SS), the "correct one" (showed in #8) , I mean, the one whose results are similar to Stata's, underlines it is the type III SS.

Best regards,

Marcos
Comment

Announcement

F-test differences Stata and R

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment