Generate residuals for all observations (outside e(sample)) using reghdfe

Meng Zhang

Join Date: May 2016

Posts: 21
#1

Generate residuals for all observations (outside e(sample)) using reghdfe

23 Jun 2025, 22:04

I’m seeing that when I run:
reghdfe dep_var if treatID == 0, absorb(id year) resid(resid_all)
the stored residuals (resid_all) are only non-missing for the treatID == 0 observations (i.e., those actually used in the regression). All treatID == 1 rows show missing.

Because I restrict the estimation to the control subsample, reghdfe doesn’t compute or cache FE-adjusted predictions or residuals for the excluded (treated) group. However, for my two-stage DID I need residuals (or fitted values) across the entire panel, not just the sample that was regressed on.
I cannot simply switch to regress + predict because my dataset is enormous (millions of obs) and I have dozens of high-dimensional fixed effects—regress cannot handle that many FEs efficiently.
Has anyone discovered a workaround or option—within reghdfe or via a small manual step—that allows one to generate residuals for every observation, including those outside of e(sample)? Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

23 Jun 2025, 22:38

regress cannot handle that many FEs efficiently

Yes it can, if you are using the current version of Stata. -regress- now has an -absorb()- option.

Of course, if you just do a regression on the entire sample, you will get different results from what you got on the treatID == 0 sample. Now, I suspect that the motivation for doing that subset analysis in the first place is because you believe there is, or may well be, a difference in what you are estimating between the treatID == 0 and the complementary group. So why not do

Code:

reghdfe dep_var i.treatID, absorb(id year) resid(resid_all)

That will reflect any difference across the treatID groups. Or, if what you are really after is a model in which the difference between the treatID groups is deliberately ignored (constrained to be 0), -reghdfe dep_var, absorb(id year) resid(resid_all)- will do that.
Comment
Meng Zhang

Join Date: May 2016

Posts: 21
#3

Yesterday, 00:22

Hi Clyde, thanks for your reply.
I’m using Kyle Butts’s heterogeneous‐robust two‐stage DID estimator (did2s) on a massive panel with dozens of high-dimensional fixed effects. I’m following his “Large Datasets or Many Fixed Effects” workflow (https://github.com/kylebutts/did2s_stata):
Stage 1: Regress y on all fixed effects using only the control group (treat== 0), then predict residuals for every observation in the full sample.

Stage 2: Regress those full-sample residuals on the treatment indicator (treat).

It’s critical that I residualize exclusively on the control subsample but still obtain residuals for all units before moving to Stage 2. Since I’m on Stata 16 MP and don’t have the built-in regress, absorb() option, I’ll try upgrading to Stata 18 and see if that resolves the issue—any other tips or workarounds would be greatly appreciated!
Comment

Joao Santos Silva

Join Date: Apr 2014
Posts: 3010

Yesterday, 02:18

Dear Meng Zhang,

See if the following trick works.

Code:

clear all
sysuse auto
reghdfe price mpg if rep78!=., a(fe=foreign)
predict xb
qui egen double pair_FE=max(fe), by(foreign)
g yhat=xb+pair_FE
reg price mpg i.foreign if rep78!=.
predict yhat1
su y*

Best wishes,

Joao

Comment

Meng Zhang

Join Date: May 2016
Posts: 21

Yesterday, 20:30

Originally posted by Joao Santos Silva View Post

Dear Meng Zhang,

See if the following trick works.

Code:

clear all
sysuse auto
reghdfe price mpg if rep78!=., a(fe=foreign)
predict xb
qui egen double pair_FE=max(fe), by(foreign)
g yhat=xb+pair_FE
reg price mpg i.foreign if rep78!=.
predict yhat1
su y*

Best wishes,

Joao

Dear Joao,

Thank you for your rapid and insightful reply—it works beautifully. I also extended your approach to cases with multiple fixed effects, and it functions as expected. Below is the code I used:

Code:

 use https://github.com/kylebutts/did2s_stata/raw/main/data/df_het.dta, clear
egen unique_id = group(state unit)
 
capture program drop did2s_est

program did2s_est, rclass
    version 13.0
    reghdfe dep_var  if treat == 0, absorb(new_id year, savefe)
    cap drop xb pair_FE1 pair_FE2
    predict xb
    qui egen double pair_FE1=max(__hdfe1__), by(new_id)
    qui egen double pair_FE2=max(__hdfe2__), by(year)
    tempvar dep_var_resid
    gen `dep_var_resid'= dep_var- xb-pair_FE1-pair_FE2
    regress `dep_var_resid' ib0.treat, nocons
end

xtset unique_id year
sort unique_id year
bootstrap, cluster(state) idcluster(new_id) group(unique_id) reps(100): did2s_est

I greatly appreciate your time and guidance on this. Any further suggestions are most welcome.

Best regards,
Meng

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3010
#6

Yesterday, 22:04

Dear Meng Zhang,

Glad it worked, credit should go to Tom Zylkin who showed me the trick (for ppmlhdfe).

Best wishes,

Joao
Comment

Announcement