how to create equal sample size for all the regressions

Helen Chang

Join Date: Apr 2018

Posts: 104
#1

how to create equal sample size for all the regressions

23 Mar 2020, 21:39

Hi, I have a panel data and will need to run several regressions that involve different dependent and independent variables. How do I make sure that I have the same sample size across all these regression models?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

23 Mar 2020, 22:00

Run each of the regressions, and after each, create a variable that denotes inclusion it that sample. So a loop something like this:

Code:

forvalues i = 1/10 { // OR HOWEVER MANY REGRESSIONS THERE ARE regression `i' gen sample`i' = e(sample) }

Then identify those observations that are common to all of the regressions:

Code:

egen common_sample = rowmin(sample*)

Now rerun all the regression restricted to the common sample:

Code:

forvalues iI = 1/10 { regression `i' if common_sample }

Note: if there is a nesting relationship among these regressions, so that you can predict in advance which regression has the smallest sample, and every other regression contains that sample, then you don't have to run them all the first time to identify the common sample: just identify the sample of the most deeply nested one (i.e. the one with the most predictor variables).
Comment
Anne-Claire Jo

Join Date: Feb 2021

Posts: 162
#3

02 Jun 2025, 09:01

hello Clyde Schechter , sorry to bring up the old post but i have similar question.
I am running several regressions (reghdfe) with several FE (ab, vce) by using cross-country panel data.
I am doing robustness check by restricting sample (common) samples to these regressions.
But i am encountering issues with restricting the sample (ie. not sure of how to find them)..

for instance, Im doing something like

Code:

loc varlist depvar1 depvar2 depvar3 depvar4 foreach var of local varlist { reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) reghdfe `var' indep1 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) reghdfe `var' indep1 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) reghdfe `var' indep1 indep4 indep5, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) reghdfe `var' indep1 indep4 indep5, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) }

so depending on the regression, the sample (or nb of observations) differs and i would like to restrict common samples.
does #2 could be implemented in similar way? If yes, I would be so grateful to understand how it can be applied!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#4

02 Jun 2025, 11:03

It would look like this:

Code:

loc varlist depvar1 depvar2 depvar3 depvar4 foreach var of local varlist { quietly { reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample1 = e(sample) reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample2 = e(sample) reghdfe `var' indep1 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample3 = e(sample) reghdfe `var' indep1 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample4 = e(sample) reghdfe `var' indep1 indep4 indep5, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample5 = e(sample) reghdfe `var' indep1 indep4 indep5, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) gen byte sample6 = e(sample) egen combined_sample = rowmin(sample*) } reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample reghdfe `var' indep1 indep2 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample reghdfe `var' indep1 indep3 indep4, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample reghdfe `var' indep1 indep3 indep4, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample reghdfe `var' indep1 indep4 indep5, a(i.country#i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample reghdfe `var' indep1 indep4 indep5, a(i.country#i.year i.sector#i.year) vce(cluster i.country#i.sector) if combined_sample drop sample* combined_sample }

Notes:

I assume that by common sample here you mean that you want a separate common sample for the 6 regressions with each dependent variable, not a single common sample for all 24 regressions.

In principle, this can be done more efficiently. Your first and fifth regressions, if I have read them all carefully enough, are going to be the most restrictive. This is because the first involves all the variables, other than indep5, that appear anywhere, the fifth brings indep5 into the picture (thereby providing the maximum risk for omissions due to missing variable values), and both the first and fifth use the triple interaction absorption (thereby providing the maximum risk for omissions due to singleton or empty groups). So I think anything that is in the sample for regressions 1 and 5 will be in the sample for all of them, and in theory only variables sample1 and sample5 are needed. Nevertheless, I recommend doing it as written with all 6. Unless you have a huge data set, -reghdfe- is very fast. And my "if I have read them all carefully enough" is doing a lot of work here. More to the point, from your recent series of posts, I understand that this project is a work in progress. If the regressions are subsequently changed, you would have to re-think through which are the most sample-restrictive ones and then change the code. By contrast, calculating all 6 samples as shown, if you subsequently change the regressions, the code will still work without any other modification.

More generally, with an ongoing project, when it is possible to anticipate subsequent additions or modifications to the original approach, it is often a good idea to write code that will accommodate those changes gracefully. Making the computer do more work to save human work is almost always the right move. (That was not true when I was starting out and use of computers far less powerful than a smartphone cost $900 per hour and programmers were typically paid less than $100 per week.)
Comment

Anne-Claire Jo

Join Date: Feb 2021
Posts: 162

03 Jun 2025, 01:41

Thanks so much for your reply Clyde!

Originally posted by Clyde Schechter View Post

Notes:

I assume that by common sample here you mean that you want a separate common sample for the 6 regressions with each dependent variable, not a single common sample for all 24 regressions.

I actually meant for the common sample for all regressions!

Just maybe additional information on my actual code, my initial code looks like:

Code:

* First Model

foreach var of local varlist {
foreach c of local country {
        capture reghdfe `var' i.quantile labor if country = `c' [aw=weightvar], a(i.country#i.sector#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'

                outreg2 using "reg.xls", replace ctitle(Model 1) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

* Second model
        capture reghdfe `var' i.quantile labor if country = `c' [aw=weightvar], a(i.country#i.sector i.country#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'

                outreg2 using "reg.xls", append ctitle(Model 2) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

* Third model
        capture reghdfe `var' i.quantile labor productivity if country = `c' [aw=weightvar], a(i.country#i.sector#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'

                outreg2 using "reg.xls", append ctitle(Model 3) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

** .....etc (it continues)
}
}

This is part of my code but basically it's quite repetitive as in #3. I already has something like if e(sample) with quite a lot of commands that follow regression, then is it suitable to use "gen byte sample5 = e(sample)" each time right after cap reghdfe? And re-run the same regression as above?

Last edited by Anne-Claire Jo; 03 Jun 2025, 01:51.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30089

03 Jun 2025, 08:28

Well, this calls for a little more efficiency. I still think it is not a good idea to depend on my perception that models 1 and 5 are the most restrictive: I might be wrong, and even if I'm right, that could change over the course of your project. But we can improve things a little bit:

Code:

gen byte common_sample = 1
* First Model


foreach var of local varlist {
foreach c of local country {
        capture reghdfe `var' i.quantile labor if country = `c' [aw=weightvar], a(i.country#i.sector#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'
                replace common_sample = min(common_sample, e(sample))
                

                outreg2 using "reg.xls", replace ctitle(Model 1) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

* Second model
        capture reghdfe `var' i.quantile labor if country = `c' [aw=weightvar], a(i.country#i.sector i.country#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'
                replace common_sample = min(common_sample, e(sample))
                

                outreg2 using "reg.xls", append ctitle(Model 2) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

* Third model
        capture reghdfe `var' i.quantile labor productivity if country = `c' [aw=weightvar], a(i.country#i.sector#i.year) vce(cluster i.country#i.sector)

            if c(rc) == 0 {
        
                levelsof country if e(sample), clean local(countrylist)
                local n_countries = `r(r)'
                levelsof ind if e(sample)
                local n_inds = `r(r)'
                replace common_sample = min(common_sample, e(sample))
                
                outreg2 using "reg.xls", append ctitle(Model 3) label ///    
                addtext(Fixed effects, C-I-Y, Country List, `countrylist', Countries, `n_countries', Industries, `n_inds')
            }
            else if !inlist(c(rc), 2000, 2001) {
                display as error `"Unexpected regression error: var = `var', country = `c'"'
            }

** .....etc (it continues)
    }
}

keep if common_sample
//    NOW REPRODUCE THE ENTIRE REGRESSION DOUBLE LOOP HERE, EXCEPT THAT YOU CAN DELETE ALL OF THE
//    -replace common_sample = min(common_sample, e(sample))- commands, AND YOU CAN REMOVE
//    -if e(sample)- FROM ANY COMMAND THAT HAS IT.

This code just keeps track of the common sample on the fly instead of creating 24 variables to clutter up the data set. And once that common sample is identified, the code drops all observations outside it, and then you re-run a simplified version of the loops.

Good luck with doing this. Even with just modest amounts of missing data scattered around the data set in a haphazard way, I fear that the common sample will prove to be very small and the results in the common sample may not be very useful. But there's no way to know that until we see it.

Announcement