Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data quality with REGHDFE

    Hi all, I am relative new to STATA and to panel data. I have a dataset with observations of 26 states, distributed in five socioeconomic macro-regions, from 2004 to 2015. The panel is strongly balanced and as I have multilevel time variant fixed effects, I run the reghdfe command and I guess my results are good for my proposal, desire to confirm the association between the dependent variable (imrr) and my independent variables. I clustered by factor variable (idh_f), for macro-region (mr_id) and for year. My results seem to be ok, I guess, but I am a little concerned about the loss of degrees of freedom (119) and about the quality of the parameters estimated and of the model as well. I guess that Root MS is ok and my R and R-adjusted statistics are fine as well?!? As far as I know I used a fair number of clusters (120), but I am not sure if this loss of DoF affects the quality of the model.
    Anyone could help me to evaluate my model?


    . reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id) vce(cluster idh_f#mr_id#year)
    (MWFE estimator converged in 3 iterations)

    HDFE Linear regression Number of obs = 312
    Absorbing 2 HDFE groups F( 8, 119) = 58.12
    Statistics robust to heteroskedasticity Prob > F = 0.0000
    R-squared = 0.8063
    Adj R-squared = 0.7979
    Within R-sq. = 0.6538
    Number of clusters (idh_f#mr_id#year) = 120Root MSE = 1.6719

    (Std. Err. adjusted for 120 clusters in idh_f#mr_id#year)
    ------------------------------------------------------------------------------
    | Robust
    imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    occ_1 | -.213045 .0614771 -3.47 0.001 -.3347758 -.0913143
    pib | -.0000948 .0000188 -5.04 0.000 -.000132 -.0000576
    pbf | -.0844561 .0278709 -3.03 0.003 -.1396433 -.029269
    gi | -.1085079 .0285647 -3.80 0.000 -.1650688 -.0519469
    ta | .2166834 .1631563 1.33 0.187 -.1063824 .5397491
    tf | 4.887433 .4666264 10.47 0.000 3.963466 5.8114
    prenat | -1.33e-06 5.71e-07 -2.33 0.021 -2.46e-06 -2.00e-07
    int_sinv | .0000319 7.36e-06 4.34 0.000 .0000174 .0000465
    _cons | 38.29141 6.988698 5.48 0.000 24.45309 52.12973
    ------------------------------------------------------------------------------

    Absorbed degrees of freedom:
    -----------------------------------------------------+
    Absorbed FE | Categories - Redundant = Num. Coefs |
    -------------+---------------------------------------|
    idh_f | 2 0 2 |
    mr_id | 5 1 4 |
    -----------------------------------------------------+


    Thanks all.

    Alexandre Bugelli

  • #2
    The loss of degrees of freedom comes from the use of cluster robust standard errors. If you remove the clustering, the degrees of freedom will return to the number of observations minus the number of predictors minus 1. But then your standard errors are based on assuming homoscedasticity and independence within clusters--which may be dubious assumptions. You could re-run the model omitting the clustering and see what happens. The coefficient estimates will be identical. The standard errors will change. But if they are almost the same, you could then use the unclustered results, if, for some reason, you feel more comfortable with a larger number of degrees of freedom. To be honest, I can't think of any real reason to care about the number of degrees of freedom here, but apparently you do.

    As for your R2, it's beyond fine. It's fantastic for socio-economic variables. If anything, it's so good that some people may wonder if you faked the data! What were you expecting? If you are analyzing data from physics experiments you have a right to sneer at R2 = 0.95 even. But socio-economic variables are very noisy, and R2 = 0.65 is amazingly good, bordering on too good to be true.

    The RMSE cannot be judged without the context of the variance of the outcome variable. But the judgment with that context is nothing more or less than R2, which, as already noted, is excellent in this context.

    Comment


    • #3
      Thanks Clyde for your answer.
      Maybe I am a little concerned because someone told me that 312 observations is maybe a too small sample to so many independent variables, specially, as you said, regarding socioeconomic variables. The option for reghdfe was mainly due to high socioeconomic disparities among states and regions. Indeed, I pursued for a long time quality variables, as I ran many -xtreg- FE models and results has been very fuzzy, with inverted signals and high p values etc.... As I change the conception of some critical variables, according to my proposal; example: I had some variables in rates, using the same population base and I replaced them to absolute values, so I figured that maybe the most important is the way variables behave (varies) concerning the I. variable. Fortunately I had these results and variables get the same quality level to my research
      I run the model without clustering, great idea, thank you, and here are the results. As you mentioned, we must have see "too good to be true" results with some suspicious, but hopefully hard work ends fine. I suppose loss of DoF is not a critical case in my case, I was just, as you did, suspicious about the statistics. I think, the unclustered model is more ‘fitted “ in ststatistic terms but the clustered one is more in line with the theoretical concepts of my study.

      Here are the results.
      Any suggestion to test the model, besides -test-?!?
      Thank you once again for your reply.

      . reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id)
      (dropped 2 singleton observations)
      (MWFE estimator converged in 3 iterations)

      HDFE Linear regression Number of obs = 307
      Absorbing 2 HDFE groups F( 8, 293) = 69.86
      Prob > F = 0.0000
      R-squared = 0.8061
      Adj R-squared = 0.7975
      Within R-sq. = 0.6560
      Root MSE = 1.6749

      ------------------------------------------------------------------------------
      imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      occ_1 | -.219474 .0608284 -3.61 0.000 -.33919 -.099758
      pib | -.0000875 .0000245 -3.57 0.000 -.0001358 -.0000392
      pbf | -.0826775 .0208863 -3.96 0.000 -.1237836 -.0415714
      gi | -.0996322 .0275957 -3.61 0.000 -.1539432 -.0453213
      ta | .2385579 .1342139 1.78 0.077 -.0255875 .5027033
      tf | 4.996489 .4951117 10.09 0.000 4.022063 5.970915
      prenat | -1.36e-06 6.49e-07 -2.09 0.037 -2.64e-06 -8.13e-08
      int_sinv | .0000321 6.50e-06 4.94 0.000 .0000193 .0000449
      _cons | 37.70095 6.935863 5.44 0.000 24.05053 51.35138
      ------------------------------------------------------------------------------

      Absorbed degrees of freedom:
      -----------------------------------------------------+
      Absorbed FE | Categories - Redundant = Num. Coefs |
      -------------+---------------------------------------|
      idh_f | 2 0 2 |
      mr_id | 5 1 4 |
      -----------------------------------------------------+

      Comment


      • #4
        Well, I agree that 312 observations is cutting it a bit close. At 8 predictors that's a bit under 40 observations per predictor. Not optimal, and perhaps a bit skimpy, but not at a level where serious overfitting of the noise looms large.

        Comment


        • #5
          Sorry, once, just to correct my last post,
          Now I run without clustering.

          Thnak you again.

          reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id) vce(robust) summ
          (dropped 2 singleton observations)
          (MWFE estimator converged in 3 iterations)

          HDFE Linear regression Number of obs = 307
          Absorbing 2 HDFE groups F( 8, 293) = 56.84
          Prob > F = 0.0000
          R-squared = 0.8061
          Adj R-squared = 0.7975
          Within R-sq. = 0.6560
          Root MSE = 1.6749

          ------------------------------------------------------------------------------
          | Robust
          imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
          -------------+----------------------------------------------------------------
          occ_1 | -.219474 .0700025 -3.14 0.002 -.3572455 -.0817026
          pib | -.0000875 .0000213 -4.10 0.000 -.0001295 -.0000455
          pbf | -.0826775 .0258893 -3.19 0.002 -.1336301 -.0317249
          gi | -.0996322 .0273527 -3.64 0.000 -.1534649 -.0457996
          ta | .2385579 .1435229 1.66 0.098 -.0439085 .5210243
          tf | 4.996489 .5192523 9.62 0.000 3.974552 6.018426
          prenat | -1.36e-06 5.97e-07 -2.28 0.024 -2.53e-06 -1.84e-07
          int_sinv | .0000321 6.67e-06 4.82 0.000 .000019 .0000453
          _cons | 37.70095 7.749275 4.87 0.000 22.44966 52.95225
          ------------------------------------------------------------------------------

          Absorbed degrees of freedom:
          -----------------------------------------------------+
          Absorbed FE | Categories - Redundant = Num. Coefs |
          -------------+---------------------------------------|
          idh_f | 2 0 2 |
          mr_id | 5 1 4 |
          -----------------------------------------------------+

          Regression Summary Statistics:
          -----------------------------------------------
          Variable | mean min max
          -------------+---------------------------------
          imrr | 18.54111 11.355 28.84
          occ_1 | 92.4486 84.571 96.992
          pib | 14306.01 2933.35 40608.7
          pbf | 23.99055 2.08 48.25
          gi | 77.31922 66 91
          ta | 4.79355 1.36 11.18
          tf | 2.088046 1.55 3.33
          prenat | 335640.2 11519 1300000
          int_sinv | 29521.43 798 125395
          -----------------------------------------------

          Comment


          • #6
            Hi Clyde,

            sorry to come back with this subject. I made a revision over my dataset and variables and found a problem modeling a new code for new variables variables.
            Just to remember, imrr is a health outcome and “occ_1" (unemployment rate lagged in one year), “pib" (GDP) and “policy_cover" (a social policy coverage), “tf” (fertility rate) and “resc” an educational indicator; all socioeconomic variables; and “med” and “enf" (number of health professionals) and adms (hospital admissions); health services indicators. I count on 26 states (panels), nested in 5 macro-socioeconomic regions, over 12 years, couting 312 observations. I have statistically significant estimations for almost all parameters, but for “pib" (GDP), “med” and “enf”; all calculated per thousand habitants (population in each state-“panels). I droped IDH (Human Development index) of the model, ince it is already nested in socioeconomic macro_regions (“mr_id”). (The mr_id’s re distributed as follows: 1 = North = 7 states; 2 = North-east = 9 states; 3 = South-east = 4 states, 4 = South = 3 states and 5 = Center-west = 3 states, so clusters are unequal distributed). I keep the model nested at state level (id) and year : xtset id year and clustered for macro-regions interacting with years.

            I have two questions, maybe you could waste a little of your time helping me.

            I guess there is some collinearity among variables. Not a big deal when regdhfe identifies and drops alll collinear variables. But, as I predicted residuals after using the code below:

            CODE: reghdfe imrr occ_1 pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

            I verified that the distribution of residuals are not Normal distributed and when I apply a Log transformation to the health indicator, the independent variable “imrr”, code below:
            CODE: reghdfe limrr occ_1 pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

            1. Is this correct to apply Log transformation for reghdfe, once this command also estimates for heterokedasticity and correlation?

            2. As unemployment (or employment rates) has a cumulative effect over time, is this correct to use “L." commands with occ_1 (unemployment rate) to estimate de effect of structural unemployment (more than one period of unemployment) as code below:

            CODE: reghdfe limrr L1.c.occ#L2.c.occ#L3.c.occ pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

            Maybe another option, as I have data for time lag for unemployment (occ_1, occ_2 and occ_3: 1, 2 and 3 years lagged)?!?

            Thanks in advance

            Alexandre Bugelli

            Comment


            • #7
              I have two questions, maybe you could waste a little of your time helping me.
              I won't enumerate here the many benefits I derive from my Statalist activities, but suffice it to say that when I see a post that responding to would be a waste of my time, I just pass it by.

              1. Is this correct to apply Log transformation for reghdfe, once this command also estimates for heterokedasticity and correlation?
              There are two different issues raised by your question in the context of your post. The first is whether it is appropriate to log-transform your outcome variable in response to the non-normality of residuals. To that my answer is: no. Normality of residuals is a sufficient, but not necessary condition for valid t-, z-, and F- statistic based inferences in regression. In a sufficiently large sample (and I would judge yours to be large enough for this purpose) the central limit theorem kicks in and makes the various components of those test statistics that ought to be normal actually be (asymptotically) normal. So the log transformation is unnecessary for this purpose. There remains, however, the modeling issue: does the regression model provide a better fit to the data if the outcome variable is log transformed? To decide that you have to explore the results of both models in your data. I would explore plots of predicted vs observed outcomes with both models and see which looks better. If the logged outcome is better fit, then go with it. Otherwise not.

              2. As unemployment (or employment rates) has a cumulative effect over time, is this correct to use “L." commands with occ_1 (unemployment rate) to estimate de effect of structural unemployment (more than one period of unemployment) as code below:

              CODE: reghdfe limrr L1.c.occ#L2.c.occ#L3.c.occ pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)
              I would not hesitate to say that it is fine to add a single lag (i.e. any one of L1.occ, L2.occ, or L3.occ) to the model. But when you start involving multiple lags of the same variable, you may be introduce serial correlation into the error structure, and I am not certain that using cluster-robust standard errors deals effectively with that. This kind of thing really doesn't come up in my line of work, so I have never delved into the issue in depth. This type of thing is common in finance and econometrics, and I hope that somebody from one of those disciplines will respond to this question, as I am not confident of my answer here.

              Comment


              • #8
                Thank you once Clyde for your attention and your comments.
                I do understand your points for both questions. Specially for the second one, I guess you are also right about the risk to introduce serial correlation in to the error structure. To my purpose of just to infer and to analyze the possible association between health outcome and socioeconomics variables.
                Thanks, again.

                Comment


                • #9
                  HI Clyde,
                  I hope you and yours are doing well through this pandemic crisis.

                  Comment


                  • #10
                    Hope all there are doing well during this pandemic crisis.
                    I wonder if someone could help me to evaluate when using areg vs reghdfe commnads as they produce exactly the same results?

                    Comment


                    • #11
                      Alexandre:
                      the main difference is that -areg-, unlike the community-contributed programme -reghdfe-, does not support absorbing more than one variable:
                      Code:
                      . use "https://www.stata-press.com/data/r16/nlswork.dta"
                      (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
                      
                      . xtset idcode year
                             panel variable:  idcode (unbalanced)
                              time variable:  year, 68 to 88, but with gaps
                                      delta:  1 unit
                      
                      . reghdfe ln_wage wks_ue , abs(idcode year)
                      (dropped 716 singleton observations)
                      (converged in 9 iterations)
                      
                      HDFE Linear regression                            Number of obs   =     22,114
                      Absorbing 2 HDFE groups                           F(   1,  18170) =      18.72
                                                                        Prob > F        =     0.0000
                                                                        R-squared       =     0.6380
                                                                        Adj R-squared   =     0.5594
                                                                        Within R-sq.    =     0.0010
                                                                        Root MSE        =     0.3083
                      
                      ------------------------------------------------------------------------------
                           ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                            wks_ue |  -.0014705   .0003399    -4.33   0.000    -.0021368   -.0008043
                      -------------+----------------------------------------------------------------
                          Absorbed |    F(3942, 18170) =      7.944   0.000             (Joint test)
                      ------------------------------------------------------------------------------
                      
                      Absorbed degrees of freedom:
                      ---------------------------------------------------------------+
                       Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     |
                      -------------+-------------------------------------------------|
                            idcode |         3929            3929              0     |
                              year |           14              15              1     |
                      ---------------------------------------------------------------+
                      
                      . areg ln_wagewks_ue,abs(idcode year) vce(cluster idcode)
                      absorb():  too many variables specified
                      r(103);
                      
                      .
                      Moreover, -areg-, unlike -reghdfe- was not specifically developed for panel data regression.
                      For more details, I would consider the -areg- entry in Stata .pdf manual.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Ho Carlo, thank you for the explanation. As I understand reghdfe fits better to multiway panel data. In my research I need to absorb only by year and use vce cluster for macro-region and year.
                        I have this model that is a variant of a basic model I run with 4 levels, but I decided to drop one socioeconomic level and use income stratified instead.
                        These are my codes:


                        Attached Files

                        Comment


                        • #13
                          Sorry, forgot to explain: mr is socioeconomic macro-region, nmr is a health outcome and variables are mainly socio-ceonomic and health factors.

                          Regards.

                          Alexandre

                          Comment


                          • #14
                            Alexandre:
                            the code you shared seems in line with what you're after.
                            In addition, 60 clusters are enough for invoking non-default standard errors (obviously, during yous Stata session you typed -cluster- instead of -clsuter-).
                            As an aside, for the future, please use CODE delimiters to share what you typed and what Stata gave you back. Thanks.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Hi Carlo. Here is the right code: reghdfe nmr occ rgdp bfpcov fr eda lbpre twc tsw, absorb(year) vce(cluster mr#year). Just replaced mr_year by mr#year and had the same results.
                              Best regards, and thank you so much.

                              Comment

                              Working...
                              X