Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • R2 negatively correlated with F-stat of same set of variables

    I am playing around with FE regression models, including vs excluding FEs. My data is at physician-hospital-month level. My outcome variable informs how much money a given physician spends in a given hospital at a given month.

    First, I include FEs of year/month and hospital (named cnes below). [I also include a categorical variable for physician age.] The R2 is around 90% and the F-statistic of a set of variables (which I intend to use later as instruments) is around 52.

    When I drop hospital FE, my R2 decreases massively to around 3%, which tells us hospitals is a strong determinant of physicians' recorded costs. To my surprise, the F-statistic of the same set of variables increases massively in the other direction: it is almost 700.

    Is it normal that R2 is negatively correlated with F-stat? I found it here that the relationship is expected to be positive: https://stats.stackexchange.com/ques...e%20non%2Dzero.


    Code:
    . reghdfe avg_peer_val iv_age iv_fem iv_uni pat_fem pat_age, absorb(ym cnes age_int) vce(cluster pf_cpfid)
    (dropped 79 singleton observations)
    (MWFE estimator converged in 6 iterations)
    
    HDFE Linear regression                            Number of obs   =  7,148,919
    Absorbing 3 HDFE groups                           F(   5, 141296) =      39.71
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.8983
                                                      Adj R-squared   =     0.8982
                                                      Within R-sq.    =     0.0001
    Number of clusters (pf_cpfid) =    141,297        Root MSE        =   679.9987
    
                             (Std. Err. adjusted for 141,297 clusters in pf_cpfid)
    ------------------------------------------------------------------------------
                 |               Robust
    avg_peer_val |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          iv_age |   -1.87275   .2679516    -6.99   0.000     -2.39793    -1.34757
          iv_fem |  -75.88635   8.992532    -8.44   0.000    -93.51154   -58.26115
          iv_uni |   18.55822   7.547507     2.46   0.014     3.765247    33.35118
         pat_fem |  -7.576075   1.116572    -6.79   0.000    -9.764535   -5.387615
         pat_age |  -.0107603   .0209083    -0.51   0.607    -.0517401    .0302195
           _cons |   2243.674   12.12901   184.98   0.000     2219.902    2267.447
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
              ym |        90           0          90     |
            cnes |      4726           1        4725     |
         age_int |        17           1          16    ?|
    -----------------------------------------------------+
    ? = number of redundant parameters may be higher
    
    . test iv_age iv_fem iv_uni
    
     ( 1)  iv_age = 0
     ( 2)  iv_fem = 0
     ( 3)  iv_uni = 0
    
           F(  3,141296) =   52.74
                Prob > F =    0.0000
    
    . 
    end of do-file
    
    . do "C:\Users\Paula\AppData\Local\Temp\STD33cc_000000.tmp"
    
    . reghdfe avg_peer_val iv_age iv_fem iv_uni pat_fem pat_age, absorb(ym age_int) vce(cluster pf_cpfid)
    (MWFE estimator converged in 4 iterations)
    
    HDFE Linear regression                            Number of obs   =  7,148,998
    Absorbing 2 HDFE groups                           F(   5, 141344) =    1173.01
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.0330
                                                      Adj R-squared   =     0.0330
                                                      Within R-sq.    =     0.0260
    Number of clusters (pf_cpfid) =    141,345        Root MSE        =  2096.0537
    
                             (Std. Err. adjusted for 141,345 clusters in pf_cpfid)
    ------------------------------------------------------------------------------
                 |               Robust
    avg_peer_val |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          iv_age |  -31.59436   1.221342   -25.87   0.000    -33.98816   -29.20055
          iv_fem |    411.207   43.82209     9.38   0.000     325.3165    497.0974
          iv_uni |   738.8013   27.60647    26.76   0.000     684.6931    792.9094
         pat_fem |  -386.0751   8.452463   -45.68   0.000    -402.6417   -369.5084
         pat_age |   10.79922   .2144286    50.36   0.000     10.37895     11.2195
           _cons |   2902.447   62.06236    46.77   0.000     2780.806    3024.088
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
              ym |        90           0          90     |
         age_int |        17           1          16     |
    -----------------------------------------------------+
    
    . test iv_age iv_fem iv_uni // when excluding hospital FE, R2 decreases by a lot and F-stat increases
    
     ( 1)  iv_age = 0
     ( 2)  iv_fem = 0
     ( 3)  iv_uni = 0
    
           F(  3,141344) =  698.31
                Prob > F =    0.0000

  • #2
    It seems that the variables iv_age, iv_fem, and iv_uni have very different distributions in different distributions in the different hospitals. When you include the hospital effects, those fixed effects absorb much of the variation associated with those three variables, and the hospital fixed effects also explain a large proportion of the outcome variance. When you remove the hospital effects from the model, the additional explained variance is lost from the model, hence the large drop in R2. But, at the same time, these three variables, iv_ge, iv_fem, and iv_uni, now provide some of the information that was previously attributed to the hospital effects. Notice that coefficients of these variables are now much larger in magnitude than they were in the model that included fixed-effects. Because the coefficients have gotten so much larger, the F-statistic has grown with them.

    It is not uncommon to see this kind of thing happen. It highlights the importance of focusing on regression coefficients first and foremost, and interpreting test statistics in light of the coefficients. Frankly, your sample is so large that test statistics are almost meaningless anyway, even for people who take significance testing seriously.

    Comment


    • #3
      Thanks a million, Clyde Schechter! It makes a lot of sense, thank you for the clarification!

      I am still not clear about your very first and very last point.

      It seems that the variables iv_age, iv_fem, and iv_uni have very different distributions in different distributions in the different hospitals.
      I am not sure I understood this. These variables are supposed to be strongly correlated with hospital FE, right?

      Frankly, your sample is so large that test statistics are almost meaningless anyway, even for people who take significance testing seriously.
      Could you recommend further reading on this? I will use instrumental variables, this is why I care so much about F-stat. Should we interpret F-stats that are too high (around 700) a little suspicious?

      Thanks again!

      Comment


      • #4
        here are a couple of cites to get you started:

        Lin, M, et al. (2013), "Too big to fail: large samples and the p-value problem", Information Systems Research, 24(4): 906-917
        Callegor, A, et al. (2019), "A note on tests for relevant differences with extremely large sample sizes," Biometrical Journal, 61: 162-5

        but note that there is lots of lit on this and lots of knowledge about the effect of sample size on p-values

        Comment


        • #5
          It seems that the variables iv_age, iv_fem, and iv_uni have very different distributions in different distributions in the different hospitals.


          I am not sure I understood this. These variables are supposed to be strongly correlated with hospital FE, right?
          I'm not sure what you are asking here. While I can guess what iv_age and iv_fem might be, I don't have a gues what iv_uni is, and my guesses about the first two may be wrong. Whether they are "supposed" to be strongly correlated with hospital FE would depend on what they actually are and on facts about hospitals in your sample--information that is not available to me. Whether they are supposed to be strongly correlated with hospital FE or not, what I'm say is that the results you show suggest that they are strongly correlated with the hospital FE, and it is this association that leads to the rising F in the face of falling R2.

          Comment


          • #6
            Paula:
            as an aside to Clyde's helpful reply, I would only add that the within R_sq (the one that you should monitor when you deal with the -fe- estimator in panel data regression) is actually increasing (and remarkably so) in your second code.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment

            Working...
            X