Generate residuals for all observations (outside e(sample)) using reghdfe

Meng Zhang

Join Date: May 2016

Posts: 23
#1

Generate residuals for all observations (outside e(sample)) using reghdfe

23 Jun 2025, 22:04

I’m seeing that when I run:
reghdfe dep_var if treatID == 0, absorb(id year) resid(resid_all)
the stored residuals (resid_all) are only non-missing for the treatID == 0 observations (i.e., those actually used in the regression). All treatID == 1 rows show missing.

Because I restrict the estimation to the control subsample, reghdfe doesn’t compute or cache FE-adjusted predictions or residuals for the excluded (treated) group. However, for my two-stage DID I need residuals (or fitted values) across the entire panel, not just the sample that was regressed on.
I cannot simply switch to regress + predict because my dataset is enormous (millions of obs) and I have dozens of high-dimensional fixed effects—regress cannot handle that many FEs efficiently.
Has anyone discovered a workaround or option—within reghdfe or via a small manual step—that allows one to generate residuals for every observation, including those outside of e(sample)? Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#2

23 Jun 2025, 22:38

regress cannot handle that many FEs efficiently

Yes it can, if you are using the current version of Stata. -regress- now has an -absorb()- option.

Of course, if you just do a regression on the entire sample, you will get different results from what you got on the treatID == 0 sample. Now, I suspect that the motivation for doing that subset analysis in the first place is because you believe there is, or may well be, a difference in what you are estimating between the treatID == 0 and the complementary group. So why not do

Code:

reghdfe dep_var i.treatID, absorb(id year) resid(resid_all)

That will reflect any difference across the treatID groups. Or, if what you are really after is a model in which the difference between the treatID groups is deliberately ignored (constrained to be 0), -reghdfe dep_var, absorb(id year) resid(resid_all)- will do that.
Comment
Meng Zhang

Join Date: May 2016

Posts: 23
#3

24 Jun 2025, 00:22

Hi Clyde, thanks for your reply.
I’m using Kyle Butts’s heterogeneous‐robust two‐stage DID estimator (did2s) on a massive panel with dozens of high-dimensional fixed effects. I’m following his “Large Datasets or Many Fixed Effects” workflow (https://github.com/kylebutts/did2s_stata):
Stage 1: Regress y on all fixed effects using only the control group (treat== 0), then predict residuals for every observation in the full sample.

Stage 2: Regress those full-sample residuals on the treatment indicator (treat).

It’s critical that I residualize exclusively on the control subsample but still obtain residuals for all units before moving to Stage 2. Since I’m on Stata 16 MP and don’t have the built-in regress, absorb() option, I’ll try upgrading to Stata 18 and see if that resolves the issue—any other tips or workarounds would be greatly appreciated!
Comment

Joao Santos Silva

Join Date: Apr 2014
Posts: 3022

24 Jun 2025, 02:18

Dear Meng Zhang,

See if the following trick works.

Code:

clear all
sysuse auto
reghdfe price mpg if rep78!=., a(fe=foreign)
predict xb
qui egen double pair_FE=max(fe), by(foreign)
g yhat=xb+pair_FE
reg price mpg i.foreign if rep78!=.
predict yhat1
su y*

Best wishes,

Joao

Comment

Meng Zhang

Join Date: May 2016
Posts: 23

24 Jun 2025, 20:30

Originally posted by Joao Santos Silva View Post

Dear Meng Zhang,

See if the following trick works.

Code:

clear all
sysuse auto
reghdfe price mpg if rep78!=., a(fe=foreign)
predict xb
qui egen double pair_FE=max(fe), by(foreign)
g yhat=xb+pair_FE
reg price mpg i.foreign if rep78!=.
predict yhat1
su y*

Best wishes,

Joao

Dear Joao,

Thank you for your rapid and insightful reply—it works beautifully. I also extended your approach to cases with multiple fixed effects, and it functions as expected. Below is the code I used:

Code:

 use https://github.com/kylebutts/did2s_stata/raw/main/data/df_het.dta, clear
egen unique_id = group(state unit)
 
capture program drop did2s_est

program did2s_est, rclass
    version 13.0
    reghdfe dep_var  if treat == 0, absorb(new_id year, savefe)
    cap drop xb pair_FE1 pair_FE2
    predict xb
    qui egen double pair_FE1=max(__hdfe1__), by(new_id)
    qui egen double pair_FE2=max(__hdfe2__), by(year)
    tempvar dep_var_resid
    gen `dep_var_resid'= dep_var- xb-pair_FE1-pair_FE2
    regress `dep_var_resid' ib0.treat, nocons
end

xtset unique_id year
sort unique_id year
bootstrap, cluster(state) idcluster(new_id) group(unique_id) reps(100): did2s_est

I greatly appreciate your time and guidance on this. Any further suggestions are most welcome.

Best regards,
Meng

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#6

24 Jun 2025, 22:04

Dear Meng Zhang,

Glad it worked, credit should go to Tom Zylkin who showed me the trick (for ppmlhdfe).

Best wishes,

Joao
Comment
Koko DIBLONI

Join Date: Jul 2025

Posts: 12
#7

08 Jul 2025, 15:55

Dear Joao Santos Silva,

Could you give me a hand, please?

I'm analysing the effects of Fiscal consolidation episodes on FDI inflows. But I face a technical problem when I use lagged variable. Following my code:

ppmlhdfe in_Flow_per_r L2.Fisc_r log_GDP_r log_POP_r res_rents_r remit_gdp_r access_elec_r gov_eff_r gdp_growth_r fin_dev_r elec_consump_r , vce(cl iso_r iso_p) absorb(iso_p#iso_r iso_p#year) nolog d keepsingletons separation(none)

return: ppmlhdfe in_Flow_per_r L2.Fisc_r log_GDP_r log_POP_r res_rents_r remit_gdp_r access_elec_r gov_eff_r gdp_growth_r fin_dev_r elec_consump_r , vce(cl iso_r iso_p) absorb(iso_p#iso_r iso_p#year) nolog d keepsingletons separation(none)
not sorted

But I already did this:
egen ID=group(iso_r iso_p)
xtset ID year

sort iso_p iso_r year
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#8

08 Jul 2025, 23:03

Dear Koko DIBLONI,

I believe this question was answered elsewhere; please do not post questions multiple times. Anyway, are you sure you want to use the option "separation(none)"?

Best wishes,

Joao
Comment
Koko DIBLONI

Join Date: Jul 2025

Posts: 12
#9

09 Jul 2025, 02:52

Sorry for my mistake with the post.

Regarding the use of "separation (none)", I may have misunderstood its purpose. I'm still new to working with "ppmlhdfe". If you could kindly explain the issue with this option, that would be really helpful.

Best regards,
Koko
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#10

09 Jul 2025, 11:24

Dear Koko DIBLONI,

That option negated the main advantage of ppmlhdfe, which is to ensure the estimates exist. Please check the help file.

Best wishes,

Joao
Comment

Announcement