Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate residuals for all observations (outside e(sample)) using reghdfe

    I’m seeing that when I run:
    reghdfe dep_var if treatID == 0, absorb(id year) resid(resid_all)
    the stored residuals (resid_all) are only non-missing for the treatID == 0 observations (i.e., those actually used in the regression). All treatID == 1 rows show missing.

    Because I restrict the estimation to the control subsample, reghdfe doesn’t compute or cache FE-adjusted predictions or residuals for the excluded (treated) group. However, for my two-stage DID I need residuals (or fitted values) across the entire panel, not just the sample that was regressed on.
    I cannot simply switch to regress + predict because my dataset is enormous (millions of obs) and I have dozens of high-dimensional fixed effects—regress cannot handle that many FEs efficiently.
    Has anyone discovered a workaround or option—within reghdfe or via a small manual step—that allows one to generate residuals for every observation, including those outside of e(sample)? Thanks!

  • #2
    regress cannot handle that many FEs efficiently
    Yes it can, if you are using the current version of Stata. -regress- now has an -absorb()- option.

    Of course, if you just do a regression on the entire sample, you will get different results from what you got on the treatID == 0 sample. Now, I suspect that the motivation for doing that subset analysis in the first place is because you believe there is, or may well be, a difference in what you are estimating between the treatID == 0 and the complementary group. So why not do
    Code:
    reghdfe dep_var i.treatID, absorb(id year) resid(resid_all)
    That will reflect any difference across the treatID groups. Or, if what you are really after is a model in which the difference between the treatID groups is deliberately ignored (constrained to be 0), -reghdfe dep_var, absorb(id year) resid(resid_all)- will do that.

    Comment


    • #3
      Hi Clyde, thanks for your reply.
      I’m using Kyle Butts’s heterogeneous‐robust two‐stage DID estimator (did2s) on a massive panel with dozens of high-dimensional fixed effects. I’m following his “Large Datasets or Many Fixed Effects” workflow (https://github.com/kylebutts/did2s_stata):
      1. Stage 1: Regress y on all fixed effects using only the control group (treat== 0), then predict residuals for every observation in the full sample.
      2. Stage 2: Regress those full-sample residuals on the treatment indicator (treat).
      It’s critical that I residualize exclusively on the control subsample but still obtain residuals for all units before moving to Stage 2. Since I’m on Stata 16 MP and don’t have the built-in regress, absorb() option, I’ll try upgrading to Stata 18 and see if that resolves the issue—any other tips or workarounds would be greatly appreciated!

      Comment


      • #4
        Dear Meng Zhang,

        See if the following trick works.

        Code:
        clear all
        sysuse auto
        reghdfe price mpg if rep78!=., a(fe=foreign)
        predict xb
        qui egen double pair_FE=max(fe), by(foreign)
        g yhat=xb+pair_FE
        reg price mpg i.foreign if rep78!=.
        predict yhat1
        su y*
        Best wishes,

        Joao

        Comment


        • #5
          Originally posted by Joao Santos Silva View Post
          Dear Meng Zhang,

          See if the following trick works.

          Code:
          clear all
          sysuse auto
          reghdfe price mpg if rep78!=., a(fe=foreign)
          predict xb
          qui egen double pair_FE=max(fe), by(foreign)
          g yhat=xb+pair_FE
          reg price mpg i.foreign if rep78!=.
          predict yhat1
          su y*
          Best wishes,

          Joao
          Dear Joao,

          Thank you for your rapid and insightful reply—it works beautifully. I also extended your approach to cases with multiple fixed effects, and it functions as expected. Below is the code I used:

          Code:
           use https://github.com/kylebutts/did2s_stata/raw/main/data/df_het.dta, clear
          egen unique_id = group(state unit)
           
          capture program drop did2s_est
          
          program did2s_est, rclass
              version 13.0
              reghdfe dep_var  if treat == 0, absorb(new_id year, savefe)
              cap drop xb pair_FE1 pair_FE2
              predict xb
              qui egen double pair_FE1=max(__hdfe1__), by(new_id)
              qui egen double pair_FE2=max(__hdfe2__), by(year)
              tempvar dep_var_resid
              gen `dep_var_resid'= dep_var- xb-pair_FE1-pair_FE2
              regress `dep_var_resid' ib0.treat, nocons
          end
          
          xtset unique_id year
          sort unique_id year
          bootstrap, cluster(state) idcluster(new_id) group(unique_id) reps(100): did2s_est
          I greatly appreciate your time and guidance on this. Any further suggestions are most welcome.

          Best regards,
          Meng

          Comment


          • #6
            Dear Meng Zhang,

            Glad it worked, credit should go to Tom Zylkin who showed me the trick (for ppmlhdfe).

            Best wishes,

            Joao

            Comment

            Working...
            X