Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lisa Rosenberger
    started a topic F-test differences Stata and R

    F-test differences Stata and R

    This is a cross post from stackexchange, see: http://stats.stackexchange.com/quest...es-stata-and-r

    I have a question about what the difference is in how Stata and R compute ANOVAs. I have run exactly the same ANOVA in both softwares, but curiously get a different F-statistics for one of the predictors. I´m not too familiar with Stata, but as far as I understood it, I do a Type 2 SS ANOVA for both.
    To understand my output, this is my model:
    Outcome variable is a continuous variable called vertrauen (=trust)
    predictor 1 is a 2-level factor called trustee in R and Goodguy in Stata
    predictor 2 is also a 2 level factor called Group in R and uw in Stata.
    This is the R output:
    Code:
    > m2-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50)
    > Anova(m2,type="2")
    > Anova Table (Type II tests)
    
    >Response: vertrauen
    >              Sum Sq Df F value    Pr(>F)   
    >trustee       2.4928  1 24.5497    1.367e-05 *** 
    >Group         0.0030  1  0.0292    0.8651     
    >trustee:Group 0.1137  1  1.1200    0.2963     
    >Residuals     4.0617 40                       
    > 
    >Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    >
    This is the Stata output:
    Code:
    . anova vertrauen uw Goodguy uw#Goodguy
     
    Number of obs =         44    R-squared     =  0.3912
    Root MSE      =    .318658    Adj R-squared =  0.3455
     
                   Source | Partial SS         df         MS        F    Prob>F
               -----------+----------------------------------------------------
                    Model |  2.6095358          3   .86984526      8.57  0.0002
                          |
                       uw |  .00296733          1   .00296733      0.03  0.8651
                  Goodguy |  1.2981586          1   1.2981586     12.78  0.0009
               uw#Goodguy |  .11373073          1   .11373073      1.12  0.2963
                          |
                 Residual |  4.0617062         40   .10154266 
               -----------+----------------------------------------------------
                    Total |   6.671242         43   .15514516
    As you can see, the F-statistics for the Group (UW) main effect and for the Group (UW) x trustee (Goodguy) interaction are the same, but for the trustee (Goodguy) main effect they differ. In R it´s almost twice as high as in Stata. I tried to change the order of the predictor and the reference levels, but that didn´t change my R output.
    Does anyone know what causes the difference in the F-statistic here? I´m really puzzled about it. I expected it to be the same.
    Here is the Stata output without the interaction:
    Code:
    . anova vertrauen uw Goodguy
     
    Number of obs =         44    R-squared     =  0.3741
    Root MSE      =    .319124    Adj R-squared =  0.3436
     
                   Source | Partial SS         df         MS        F    Prob>F
               -----------+----------------------------------------------------
                    Model |   2.495805          2   1.2479025     12.25  0.0001
                          |
                       uw |  .00296733          1   .00296733      0.03  0.8653
                  Goodguy |  2.4928377          1   2.4928377     24.48  0.0000
                          |
                 Residual |   4.175437         41   .10183993 
               -----------+----------------------------------------------------
                    Total |   6.671242         43   .15514516
    And here is the R output without the interaction:
    Code:
    > m2.4-lm(vertrauen~trustee+Group,data=RTG.UWD.short.50)
    > Anova(m2.4)
    Anova Table (Type II tests)
     
    Response: vertrauen
              Sum Sq Df F value    Pr(>F)   
    trustee   2.4928  1 24.4780 1.328e-05 ***
    Group     0.0030  1  0.0291    0.8653   
    Residuals 4.1754 41                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    >
    It´s the same, thus it has to do something with how the two softwares incorporate the interaction term.
    I also tried to manually compute the interaction term and found something interesting:
    Here is the R output:
    Code:
    RTG.UWD.short.50$interaction-as.numeric(RTG.UWD.short.50$trustee)*as.numeric(RTG.UWD.short.50$Group)
    > m2.7 Anova(m2.7)
    Anova Table (Type II tests)
     
    Response: vertrauen
                Sum Sq Df F value    Pr(>F)   
    trustee     1.2982  1 12.7844 0.0009316 ***
    Group       0.0030  1  0.0292 0.8651282   
    interaction 0.1137  1  1.1200 0.2962617   
    Residuals   4.0617 40                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    >
    And here is the Stata output:
    Code:
    . gen interaction=uw*Goodguy
     
    . anova vertrauen uw Goodguy interaction
     
    Number of obs =         44    R-squared     =  0.3912
    Root MSE      =    .318658    Adj R-squared =  0.3455
     
                   Source | Partial SS         df         MS        F    Prob>F
              ------------+----------------------------------------------------
                    Model |  2.6095358          3   .86984526      8.57  0.0002
                          |
                       uw |   .0399785          1    .0399785      0.39  0.5339
                  Goodguy |  2.3984067          1   2.3984067     23.62  0.0000
              interaction |  .11373073          1   .11373073      1.12  0.2963
                          |
                 Residual |  4.0617062         40   .10154266 
              ------------+----------------------------------------------------
                    Total |   6.671242         43   .15514516
    Thus it seems that there is a difference in how R/ Stata computes the interactions. The R output of the manually computed interaction matches the automatically computed interaction output in Stata.
    And finally the descriptives from R:
    Code:
    > describe(RTG.UWD.short.50$vertrauen)
    RTG.UWD.short.50$vertrauen
          n missing  unique    Info    Mean  
         44       0      43       1  0.5046
    > describe(RTG.UWD.short.50$Group)
    RTG.UWD.short.50$Group
          n missing  unique
         44       0       2
     
    1 (34, 77%), 2 (10, 23%)
    > describe(RTG.UWD.short.50$trustee)
    RTG.UWD.short.50$trustee
          n missing  unique
         44       0       2
     
    bad (22, 50%), good (22, 50%)
    and from Stata:
    Code:
    . sum vertrauen uw Goodguy
     
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
       vertrauen |         44    .5045969    .3938847    .000998          1
              uw |         44    .2272727    .4239151          0          1
         Goodguy |         44          .5    .5057805          0          1

  • Marcos Almeida
    replied
    Great, Lisa. We may also observe that, contrary to the remaining outputs you showed in R (all of them with type II SS), the "correct one" (showed in #8) , I mean, the one whose results are similar to Stata's, underlines it is the type III SS.

    Leave a comment:


  • Lisa Rosenberger
    replied
    After reading Marcos´ link again more carefully, I figured out what the problem is. It´s not about what kind of SS are computed, but what contrasts are used. The default contrast in R is called contrast treatment, which compares every level with the baseline level and is non-orthogonal. But as Marcos´ link pointed out (which also referred to this very useful mail from John Fox: http://www.mail-archive.com/r-help@s.../msg69781.html) the contrasts must be orthogonal and thus sum to zero when computing type 3 SS. In R this is done by specifying contrast sum. When doing so, I get exactly the same output as the "default" output in Stata (where I don´t specify which SS I want to compute):

    Code:
    > options(contrasts=c('contr.sum','contr.poly'))
      
    > m2a<-lm(vertrauen~trustee*Group,data=RTG.UWD.short.50) > Anova(m2a,type="3") Anova Table (Type III tests) Response: vertrauen Sum Sq Df F value Pr(>F) (Intercept) 7.7042 1 75.8714 8.841e-11 *** trustee 1.2982 1 12.7844 0.0009316 *** Group 0.0030 1 0.0292 0.8651282 trustee:Group 0.1137 1 1.1200 0.2962617 Residuals 4.0617 40 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    >
    Thus, while Stata is specifying the correct contrast & type of SS by default, R does not - this is a shame, because I don´t think a lot of people bother to check what kind of SS to use and whether their contrast settings fit the type of SS or not...

    Thanks again for this very useful link Marcos!

    Leave a comment:


  • Marcos Almeida
    replied
    With regards to type II Sum of Squares, as underlined in the text I shared:

    Computationally, this is equivalent to running a type I analysis with different orders of the factors, and taking the appropriate output.

    Leave a comment:


  • Joseph Coveney
    replied
    Originally posted by Lisa Rosenberger View Post
    Do you know what kind of SS are computed in Stata as a default, i.e. if I don´t specify that I want sequential SS (like I did in my original post)?
    From the Stata user's manual entry for anova: "If you are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares." ([R] Base Reference, anova, Technical Note, p. 24)

    Leave a comment:


  • Lisa Rosenberger
    replied
    Thanks Markos & Joseph for your input!
    When I read Markos´ first comment, I thought it made a lot of sense that the sequence of the factors would affect my F-values, after all I compute sequential SS. But, unfortunatelly, changing the order of the factors does not change my R output:

    Code:
      
    > m2b<-lm(vertrauen~Group*trustee,data=RTG.UWD.short.50) > Anova(m2b,type="2") Anova Table (Type II tests) Response: vertrauen Sum Sq Df F value Pr(>F) Group 0.0030 1 0.0292 0.8651 trustee 2.4928 1 24.5497 1.367e-05 *** Group:trustee 0.1137 1 1.1200 0.2963 Residuals 4.0617 40 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    >
    But then I looked at Markos link and figured that type 2 SS are actually not sequential SS, but it always takes the other main effect into account:


    Type I, also called "sequential" sum of squares:
    • SS(A) for factor A.
    • SS(B | A) for factor B.
    • SS(AB | B, A) for interaction AB.
    • This tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects.
    • Because of the sequential nature and the fact that the two main factors are tested in a particular order, this type of sums of squares will give different results for unbalanced data depending on which main effect is considered first.
    Type II:
    • SS(A | B) for factor A.
    • SS(B | A) for factor B.
    • This type tests for each main effect after the other main effect.
    Thus it makes actually sense that the output doesn´t change, or am I confused now?

    Do you know what kind of SS are computed in Stata as a default, i.e. if I don´t specify that I want sequential SS (like I did in my original post)?

    Leave a comment:


  • Marcos Almeida
    replied
    Maybe this will clarify what happens with a non-sequential versus a couple of models under a sequential ANOVA:

    Code:
    . use http://www.stata-press.com/data/r14/systolic.dta
    (Systolic Blood Pressure Data)
    
    . anova systolic drug disease drug#disease
    
                             Number of obs =         58    R-squared     =  0.4560
                             Root MSE      =    10.5096    Adj R-squared =  0.3259
    
                      Source | Partial SS         df         MS        F    Prob>F
                -------------+----------------------------------------------------
                       Model |  4259.3385         11   387.21259      3.51  0.0013
                             |
                        drug |  2997.4719          3   999.15729      9.05  0.0001
                     disease |  415.87305          2   207.93652      1.88  0.1637
                drug#disease |  707.26626          6   117.87771      1.07  0.3958
                             |
                    Residual |  5080.8167         46   110.45254  
                -------------+----------------------------------------------------
                       Total |  9340.1552         57   163.86237  
    
    . anova systolic drug disease drug#disease, sequential
    
                             Number of obs =         58    R-squared     =  0.4560
                             Root MSE      =    10.5096    Adj R-squared =  0.3259
    
                      Source |    Seq. SS         df         MS        F    Prob>F
                -------------+----------------------------------------------------
                       Model |  4259.3385         11   387.21259      3.51  0.0013
                             |
                        drug |  3133.2385          3   1044.4128      9.46  0.0001
                     disease |  418.83374          2   209.41687      1.90  0.1617
                drug#disease |  707.26626          6   117.87771      1.07  0.3958
                             |
                    Residual |  5080.8167         46   110.45254  
                -------------+----------------------------------------------------
                       Total |  9340.1552         57   163.86237  
    
    . anova systolic disease drug drug#disease, sequential
    
                             Number of obs =         58    R-squared     =  0.4560
                             Root MSE      =    10.5096    Adj R-squared =  0.3259
    
                      Source |    Seq. SS         df         MS        F    Prob>F
                -------------+----------------------------------------------------
                       Model |  4259.3385         11   387.21259      3.51  0.0013
                             |
                     disease |  488.63938          2   244.31969      2.21  0.1210
                        drug |  3063.4329          3   1021.1443      9.25  0.0001
                drug#disease |  707.26626          6   117.87771      1.07  0.3958
                             |
                    Residual |  5080.8167         46   110.45254  
                -------------+----------------------------------------------------
                       Total |  9340.1552         57   163.86237  
    
    . anova systolic drug drug#disease disease, sequential
    
                             Number of obs =         58    R-squared     =  0.4560
                             Root MSE      =    10.5096    Adj R-squared =  0.3259
    
                      Source |    Seq. SS         df         MS        F    Prob>F
                -------------+----------------------------------------------------
                       Model |  4259.3385         11   387.21259      3.51  0.0013
                             |
                        drug |  3133.2385          3   1044.4128      9.46  0.0001
                drug#disease |     882.45          6     147.075      1.33  0.2626
                     disease |     243.65          2     121.825      1.10  0.3405
                             |
                    Residual |  5080.8167         46   110.45254  
                -------------+----------------------------------------------------
                       Total |  9340.1552         57   163.86237  
    
    . anova systolic disease drug#disease drug, sequential
    
                             Number of obs =         58    R-squared     =  0.4560
                             Root MSE      =    10.5096    Adj R-squared =  0.3259
    
                      Source |    Seq. SS         df         MS        F    Prob>F
                -------------+----------------------------------------------------
                       Model |  4259.3385         11   387.21259      3.51  0.0013
                             |
                     disease |  488.63938          2   244.31969      2.21  0.1210
                drug#disease |  2830.7412          6    471.7902      4.27  0.0017
                        drug |  939.95789          3    313.3193      2.84  0.0483
                             |
                    Residual |  5080.8167         46   110.45254  
                -------------+----------------------------------------------------
                       Total |  9340.1552         57   163.86237

    Hopefully that helps!

    Best,

    Marcos

    Leave a comment:


  • Joseph Coveney
    replied
    If you want SAS Type II sums of squares in Stata, you run anova . . ., sequential twice, reversing the position of the main effects factors, and using the sums of squares, F statistic etc. for the second main effects factor. So, you would
    Code:
    anova vertrauen Goodguy uw Goodguy#uw, sequential // Use the SS, MS and F from this for uw
    anova vertrauen uw Goodguy Goodguy#uw, sequential // Use the SS, MS and F from this for Goodguy

    Leave a comment:


  • Marcos Almeida
    replied
    Hello LIsa,

    Welcome to the Stata Forum.

    It seems this issue is related to type II (sequential) sum of squares + unbalanced data + difference in the sequence of commands, I mean, the group variable is under a different sequence, when comparing both models, and that would change the results of a sequential ANOVA under unbalanced data.

    You may wish to read this article to get further information.http://goanna.cs.rmit.edu.au/~fscholer/anova.php

    Finally, I recommend you repeat the commands in R under the same sequence as you did in Stata, Please, share the results here in the Forum. My best bet: they will be quite similar.

    Best,

    Marcos
    Last edited by Marcos Almeida; 05 Jul 2016, 05:47.

    Leave a comment:

Working...
X