Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate residuals for all observations (outside e(sample)) using reghdfe

    I’m seeing that when I run:
    reghdfe dep_var if treatID == 0, absorb(id year) resid(resid_all)
    the stored residuals (resid_all) are only non-missing for the treatID == 0 observations (i.e., those actually used in the regression). All treatID == 1 rows show missing.

    Because I restrict the estimation to the control subsample, reghdfe doesn’t compute or cache FE-adjusted predictions or residuals for the excluded (treated) group. However, for my two-stage DID I need residuals (or fitted values) across the entire panel, not just the sample that was regressed on.
    I cannot simply switch to regress + predict because my dataset is enormous (millions of obs) and I have dozens of high-dimensional fixed effects—regress cannot handle that many FEs efficiently.
    Has anyone discovered a workaround or option—within reghdfe or via a small manual step—that allows one to generate residuals for every observation, including those outside of e(sample)? Thanks!

  • #2
    regress cannot handle that many FEs efficiently
    Yes it can, if you are using the current version of Stata. -regress- now has an -absorb()- option.

    Of course, if you just do a regression on the entire sample, you will get different results from what you got on the treatID == 0 sample. Now, I suspect that the motivation for doing that subset analysis in the first place is because you believe there is, or may well be, a difference in what you are estimating between the treatID == 0 and the complementary group. So why not do
    Code:
    reghdfe dep_var i.treatID, absorb(id year) resid(resid_all)
    That will reflect any difference across the treatID groups. Or, if what you are really after is a model in which the difference between the treatID groups is deliberately ignored (constrained to be 0), -reghdfe dep_var, absorb(id year) resid(resid_all)- will do that.

    Comment


    • #3
      Hi Clyde, thanks for your reply.
      I’m using Kyle Butts’s heterogeneous‐robust two‐stage DID estimator (did2s) on a massive panel with dozens of high-dimensional fixed effects. I’m following his “Large Datasets or Many Fixed Effects” workflow (https://github.com/kylebutts/did2s_stata):
      1. Stage 1: Regress y on all fixed effects using only the control group (treat== 0), then predict residuals for every observation in the full sample.
      2. Stage 2: Regress those full-sample residuals on the treatment indicator (treat).
      It’s critical that I residualize exclusively on the control subsample but still obtain residuals for all units before moving to Stage 2. Since I’m on Stata 16 MP and don’t have the built-in regress, absorb() option, I’ll try upgrading to Stata 18 and see if that resolves the issue—any other tips or workarounds would be greatly appreciated!

      Comment


      • #4
        Dear Meng Zhang,

        See if the following trick works.

        Code:
        clear all
        sysuse auto
        reghdfe price mpg if rep78!=., a(fe=foreign)
        predict xb
        qui egen double pair_FE=max(fe), by(foreign)
        g yhat=xb+pair_FE
        reg price mpg i.foreign if rep78!=.
        predict yhat1
        su y*
        Best wishes,

        Joao

        Comment


        • #5
          Originally posted by Joao Santos Silva View Post
          Dear Meng Zhang,

          See if the following trick works.

          Code:
          clear all
          sysuse auto
          reghdfe price mpg if rep78!=., a(fe=foreign)
          predict xb
          qui egen double pair_FE=max(fe), by(foreign)
          g yhat=xb+pair_FE
          reg price mpg i.foreign if rep78!=.
          predict yhat1
          su y*
          Best wishes,

          Joao
          Dear Joao,

          Thank you for your rapid and insightful reply—it works beautifully. I also extended your approach to cases with multiple fixed effects, and it functions as expected. Below is the code I used:

          Code:
           use https://github.com/kylebutts/did2s_stata/raw/main/data/df_het.dta, clear
          egen unique_id = group(state unit)
           
          capture program drop did2s_est
          
          program did2s_est, rclass
              version 13.0
              reghdfe dep_var  if treat == 0, absorb(new_id year, savefe)
              cap drop xb pair_FE1 pair_FE2
              predict xb
              qui egen double pair_FE1=max(__hdfe1__), by(new_id)
              qui egen double pair_FE2=max(__hdfe2__), by(year)
              tempvar dep_var_resid
              gen `dep_var_resid'= dep_var- xb-pair_FE1-pair_FE2
              regress `dep_var_resid' ib0.treat, nocons
          end
          
          xtset unique_id year
          sort unique_id year
          bootstrap, cluster(state) idcluster(new_id) group(unique_id) reps(100): did2s_est
          I greatly appreciate your time and guidance on this. Any further suggestions are most welcome.

          Best regards,
          Meng

          Comment


          • #6
            Dear Meng Zhang,

            Glad it worked, credit should go to Tom Zylkin who showed me the trick (for ppmlhdfe).

            Best wishes,

            Joao

            Comment


            • #7
              Dear Joao Santos Silva,

              Could you give me a hand, please?

              I'm analysing the effects of Fiscal consolidation episodes on FDI inflows. But I face a technical problem when I use lagged variable. Following my code:

              ppmlhdfe in_Flow_per_r L2.Fisc_r log_GDP_r log_POP_r res_rents_r remit_gdp_r access_elec_r gov_eff_r gdp_growth_r fin_dev_r elec_consump_r , vce(cl iso_r iso_p) absorb(iso_p#iso_r iso_p#year) nolog d keepsingletons separation(none)


              return: ppmlhdfe in_Flow_per_r L2.Fisc_r log_GDP_r log_POP_r res_rents_r remit_gdp_r access_elec_r gov_eff_r gdp_growth_r fin_dev_r elec_consump_r , vce(cl iso_r iso_p) absorb(iso_p#iso_r iso_p#year) nolog d keepsingletons separation(none)
              not sorted

              But I already did this:
              egen ID=group(iso_r iso_p)
              xtset ID year

              sort iso_p iso_r year

              Comment


              • #8
                Dear Koko DIBLONI,

                I believe this question was answered elsewhere; please do not post questions multiple times. Anyway, are you sure you want to use the option "separation(none)"?

                Best wishes,

                Joao

                Comment


                • #9
                  Sorry for my mistake with the post.

                  Regarding the use of "separation (none)", I may have misunderstood its purpose. I'm still new to working with "ppmlhdfe". If you could kindly explain the issue with this option, that would be really helpful.

                  Best regards,
                  Koko

                  Comment


                  • #10
                    Dear Koko DIBLONI,

                    That option negated the main advantage of ppmlhdfe, which is to ensure the estimates exist. Please check the help file.

                    Best wishes,

                    Joao

                    Comment

                    Working...
                    X