Bootstrapping standard errors for two-stage program combining cross-sectional and panel data

Christian Rhind

Join Date: Oct 2016
Posts: 6

Bootstrapping standard errors for two-stage program combining cross-sectional and panel data

09 Dec 2022, 17:09

I have two samples from different populations from which I am conducting a two-stage estimation procedure. These samples share two common variables but have different structures: sample 1 is cross-sectional, sample 2 is a panel from which I am conducting my main analysis.

Sample 1 contains a variable of interest that is not available in sample 2. My solution is to impute values for the variable missing in sample 2 using estimates obtained from sample 1 for the variables common to both samples. For example, suppose sample 1 contains variables X Y Z and sample 2 contains W Y Z. My proposed procedure is thus:

1. From sample 1 regress X on Y and Z using OLS to obtain the marginal effects of Y and Z on X.
2. Use the marginal effects obtained from step 1 to generate predicted values X_hat in sample 2.
3. From sample 2 regress W on X_hat using fixed effects.

Since X_hat is a generated regressor I am attempting to bootstrap the standard errors for the estimates obtained from step 3.

To illustrate I have constructed the following 2 samples from Stata's 'auto' dataset.

Sample 1: X = weight, Y = mpg, Z = headroom

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(weight mpg) float headroom
4330 14   4
3900 14 3.5
4290 21   3
2110 29 2.5
3690 16   4
3180 22 3.5
3220 22   2
2750 24   2
3430 19 3.5
2120 30   2
3600 18   4
3600 16   4
3740 17 4.5
1800 28 1.5
2650 21   2
4840 12 3.5
4720 12 2.5
3830 14 3.5
2580 22   3
4060 14 3.5
3720 15 3.5
3370 18   3
4130 14   3
2830 20 3.5
4060 21   4
3310 19   2
3300 19 4.5
3690 18   4
3370 19 4.5
2730 24   2
4030 16 3.5
3260 28   2
1800 34 2.5
2200 25   4
2520 26 1.5
3330 18   5
3700 18   4
3470 18 1.5
3210 19   2
3200 19 3.5
3420 19 3.5
2690 24   2
2830 17   3
2070 23 2.5
2650 25 2.5
2370 23 1.5
2020 35   2
2280 24 2.5
2750 21 2.5
2130 21 2.5
2240 25   3
1760 28 2.5
1980 30 3.5
3420 14 3.5
1830 26   3
2050 35 2.5
2410 18 2.5
2200 31   3
2670 18   2
2160 23 2.5
2040 41   3
1930 25   3
1990 25   2
3170 17 2.5
end

Sample 2: W = weight

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id t price mpg headroom)
 1 1      4099        22       2.5
 1 2 4194.3037 26.966696  3.341133
 2 1      4749        17         3
 2 2  4754.272 16.204895  3.812516
 3 1      3799        22         3
 3 2  3813.818  24.62638  3.474947
 4 1      4816        20       4.5
 4 2 4755.9487  17.92939    4.2269
 5 1      7827        15         4
 5 2  7899.692 18.484156 4.0665107
 6 1      5788        18         4
 6 2   5788.18  14.68257 3.4874845
 7 1      4453        26         3
 7 2  4367.883 26.290855 2.0571663
 8 1      5189        20         2
 8 2  5236.714  17.27138  2.454724
 9 1     10372        16       3.5
 9 2   10386.5 16.689259  2.510926
10 1      4082        19       3.5
10 2 4024.6116 14.151664  4.370308
end

My proposed bootstrapping procedure is then:

Code:

use sample1.dta, replace
capture program drop example
program define example, eclass

    qui reg weight mpg headroom
    
    use sample2.dta, replace
    capture drop weight_hat
    predict weight_hat
    
    xtset newid t
    xtreg price weight_hat, fe
    
    exit
end
xtset, clear
bootstrap, reps(10) seed(1) cluster(id) idcluster(newid): example

When I run this I get the error:

variable id not found
(error in option cluster())

This is presumably because sample 1 does not contain the variable 'id.'

I am not sure how to resolve this issue, or if perhaps there is a better way of doing it. Maybe Any help would be greatly appreciated.

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2464
#2

12 Dec 2022, 05:58

Hi Christian
Not sure you need to do a bootstrap here (although its certainly an option)
So I have 2 suggestions that may helpyou addressing your problem
1) pool data together.
With an indicator of "sample"
This will allow you to "easily" apply bootstrap or other imputation methods, without having to jump from one sample to the other.
2) If doing bootstrap.
You will have to do the bootstrap using the sample indicator as strata
and may need to create an ID and TIME for the crossection data. It doesn't matter how you do this, as long as your ID's identify each Crossection sample independently, and different than your Panel.
The year should also be within the panel data years.

Under this assumption, you could do the imputation as follows

use sample_pool
** assume sample_dummy to be the indicator
reg x1 z1 z2 z3 if sample_dummy==0 <-the first stage
capture drop x1_hat
predict x1_hat
xtreg y x1_hat z1 z2 if sample_dummy==1 , fe (re) <- your second step

And ofcourse, bootstrap it all.
Just remember, you need extra variables in the second stage, otherwise, you will have a problem of perfect multicolinearity (if for example i added z1 z2 z3 in the model)

3) Second option. SOmething else you could do is to just apply Multiple Imputation. Look into "help mi". This will create multiple imputed values for x1 in your panel data, which you can then use for analysis.

Hope this helps
Fernando
1 like
Comment

Announcement

Bootstrapping standard errors for two-stage program combining cross-sectional and panel data

Comment