Bootstrap? regression - Taking random group mean out of sample and regress

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2406
#16

31 May 2023, 07:34

Hi Nick, thanks for a providing some more detail. I guess you can do that, but it seems to me like you might be approaching this problem backwards. Taking small samples of data and fitting models seems a way to (possibly poorly) approximate the model you would get by fitting to the entire dataset. If you did have a single, large data model, then you could then produce post-estimation predictions using any number of people for any combination of covariates after the fact, and do your post-processing with those results (e.g., take the average opinion and look at a distribution thereof).
1 like
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#17

31 May 2023, 15:35

Thank you Leonardo, could you maybe explain that in more simpler terms? I get that you would just run the every single opinion for every ID year combination first. But i'm not sure how to make a group of people after the fact? If possible, could you maybe give a small example?

Last edited by Nick Bertel; 31 May 2023, 16:11.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2406

#18

31 May 2023, 20:11

Let's use the NLSW 88 dataset as an example dataset.

First, we'll start by trying to apply the idea you described in #15. We'll take small, random sample of the dataset a fit a regression model trying to predict hourly wage based solely on college graduate status. This is an arbitarily simple model for illustrative purposes.

This program draws 100 random samples of 5% of the data which have wage and college graduate status data available.

Code:

set seed 18
webuse nlsw88, clear

keep if !missing(collgrad, wage)

cap program drop samplereg
program samplereg
  syntax , pct(int)
 
  preserve
  sample `pct'
  reg wage i.collgrad
  restore
end

* Run several regressions on subsamples, then average the coefficients.
preserve
simulate _b, reps(100) nodots : samplereg, pct(5)
list in 1/5
mean _sim_2 _b_cons
restore

* Regression on the whole dataset
reg wage i.collgrad, nohead

Relevant results:

Code:

* Run several regressions on subsamples, then average the coefficients.
. mean _sim_2 _b_cons

Mean estimation                            Number of obs = 100

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
      _sim_2 |   3.419975   .1180181      3.185801    3.654148
     _b_cons |   6.975366    .058489      6.859311    7.091421
--------------------------------------------------------------

* Regression on the whole dataset
. reg wage i.collgrad, nohead
-------------------------------------------------------------------------------
         wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
     collgrad |
College grad  |   3.615502   .2753268    13.13   0.000      3.07558    4.155424
        _cons |   6.910561   .1339984    51.57   0.000     6.647788    7.173335
-------------------------------------------------------------------------------

Notice that the first part of the output shows averages of regression coefficients from those 100 models. The mean coeffient for college graduates (-_sim_2-) and the constant, not college graduates (-_b_cons-) approximate the values of the regression model. As you increase the number of replications and size of the sample, the average regression coefficients get closer to the true value (the value of obtained by fitting one model to the entire dataset).

------------------

Switching gears now. What if we just start with a regression model fit the entire dataset.
I'll fit a model to predict wage, this time based on age, marital and college graduate status.

Code:

. webuse nlsw88, clear
. reg wage age i.collgrad i.married, nohead
-------------------------------------------------------------------------------
         wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
          age |  -.0656183   .0382239    -1.72   0.086    -.1405762    .0093397
              |
     collgrad |
College grad  |   3.615111   .2750165    13.15   0.000     3.075797    4.154424
              |
      married |
     Married  |  -.5125942   .2439234    -2.10   0.036    -.9909336   -.0342548
        _cons |   9.808917   1.513613     6.48   0.000     6.840687    12.77715
-------------------------------------------------------------------------------

. est store MyModel

We can use this model to make predictions on the wage for people in the dataset (in-sample prediction).
Here I use -predict- to compute the model-based mean for every individual.

Code:

. predict pred_wage, xb

. list age collgrad married wage pred_wage in 1/5, nolabel

     +-------------------------------------------------+
     | age   collgrad   married       wage   pred_wage |
     |-------------------------------------------------|
  1. |  37          0         0   11.73913   7.3810416 |
  2. |  37          0         0   6.400963   7.3810416 |
  3. |  42          0         0   5.016723   7.0529503 |
  4. |  43          1         1   9.033813   10.089849 |
  5. |  42          0         1   8.083731   6.5403561 |
     +-------------------------------------------------+

But you're not limited to making predictions on observed individuals. You can use out-of-sample or hypothetical data with the existing coefficients to make new predictions. For example:

Code:

mkf New
cwf New
input int age byte(collgrad married)
39 0 1
40 1 1
44 1 0
end

est restore MyModel
predict pred_wage, xb
list

Results

Code:

. list

     +--------------------------------------+
     | age   collgrad   married   pred_wage |
     |--------------------------------------|
  1. |  39          0         1   6.7372109 |
  2. |  40          1         1   10.286703 |
  3. |  44          1         0   10.536824 |
     +--------------------------------------+

So I think in your case, you could work out whatever groups of ID-Year you like, predict the response, take the average, and store the result. Do this several tiems to get your distribution of results you were looking for.

Comment

Nick Bertel

Join Date: Mar 2023

Posts: 27
#19

31 May 2023, 21:52

Thank you for your great effort! Will definitely take this into consideration!
Comment

Announcement

Comment

Comment

Comment

Comment