Bootstrap? regression - Taking random group mean out of sample and regress

Nick Bertel

Join Date: Mar 2023

Posts: 27
#1

Bootstrap? regression - Taking random group mean out of sample and regress

27 May 2023, 15:28

Dear all,

i have opinions of different people (person) on different subjects (id) for different years (year). I want to take 3 random people for every subject and year, take the mean of these people from (value), then take that mean value as a DV for a regression with yearly fixed effects. I want to do that 1000 times with random groups of 3 every time and then somehow display the average regression.

So basically doing this 1000 times:
xtset year
xtreg meanvaluefrom3randompersonsperyear x1 x2 x3, fe robust

I am new to stata and know how to label random observations by year and subject as a kind person explained that to me a few weeks ago on here. But there must be a better way of doing that 1000 times per hand. Also not sure how to combine the results of the 1000 regressions.
Here's the data. it's reconstructed and obviously much more complicated than that. x1,x2,x3 are characteristics of the subjects per year

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(id person) int year float(value x1 x2) byte x3 1 1 2019 3 1.5 2 3 1 2 2019 2 1.5 2 3 1 3 2019 3 1.5 2 3 1 4 2018 2 2 2.5 3 1 5 2018 2 2 2.5 3 1 6 2018 3 2 2.5 3 1 7 2018 4 2 2.5 3 1 8 2018 2 2 2.5 3 1 9 2018 3 2 2.5 3 2 10 2019 6 10 8 12 2 11 2019 8 10 8 12 2 12 2019 9 10 8 12 2 13 2019 8 10 8 12 2 14 2019 8 10 8 12 2 15 2018 3 8 7 10 2 16 2018 2 8 7 10 2 17 2018 2.5 8 7 10 2 18 2018 3 8 7 10 2 19 2018 2 8 7 10 2 20 2018 2 8 7 10 2 21 2018 1.5 8 7 10 2 22 2018 2 8 7 10 3 23 2020 5 15 3 7 3 24 2020 4 15 3 7 3 25 2020 5 15 3 7 3 26 2020 4 15 3 7 3 27 2019 6 12 2 7 3 28 2019 6 12 2 7 3 29 2019 7 12 2 7 4 30 2017 3 12 3 8 4 31 2017 6 12 3 8 4 32 2017 3 12 3 8 4 33 2017 4 12 3 8 4 34 2017 6 12 3 8 end

Last edited by Nick Bertel; 27 May 2023, 15:36.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

27 May 2023, 17:44

Well, there are some problems with what you are trying to do.

First, with only four years, it makes no sense to use -robust- in your -xtreg, fe-. -robust- is taken by Stata to mean -vce(cluster year)- and with only four clusters, these standard errors are not valid.

Second, you are setting the dependent variable of your regression to be a constant within each year: the mean of three randomly selected values of value. But with a constant dv, within each "panel" (year), the coefficient of anything is always zero, so there is no point doing the regressions. The only thing that changes from one replication to another is the constant term.

The following code does what you ask for (except for the robust standard errors), but as you can see, the results are completely useless.

Code:

capture program drop one_iteration program define one_iteration, eclass gen double shuffle = runiform() by year (shuffle), sort: egen dv = mean(cond(_n <= 3, value, .)) list shuffle dv in 1/10 xtset year xtreg dv x1 x2 x3, fe drop dv shuffle exit end isid year person, sort tempfile results set seed 1234 simulate _b _se, reps(1000) dots(100) saving(`results'): one_iteration use `results', clear browse

If I have misunderstood what you are looking for, do post back and explain.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#3

27 May 2023, 19:05

Hi Clyde! Thank you so much!

i guess generating a new dataset for showing my problem wasn't the best idea. It does seem to work for me as my dataset is over 100k observations in over 40 years, but i have some additional questions:

1) Could you explain the command "list shuffle dv in 1/10" and what it means?
2) the command "isid year person, sort" doesn't work in my real sample as persons give multiple opinions in multiple years about multiple IDs. But i just left that out since it does not seem to have a purpose?
3) I think it should be: "by id year (shuffle), sort: egen dv = mean(cond(_n <= 3, value, .))", since i want the mean from every combination of id and year, right?

4)I would also like to winsor dv and have the standart deviation of "value" within the group thats builds the mean and also winsor the standart error, i would do it like this, right?

Code:

capture program drop one_iteration program define one_iteration, eclass gen double shuffle = runiform() by id year (shuffle), sort: egen dv = mean(cond(_n <= 3, value, .)) by id year (shuffle), sort: egen stdev = sd(cond(_n <= 3, value, .)) winsor dv , gen(dv1) p(0.01) winsor stdev , gen(stdev1) p(0.01) list shuffle dv stdev in 1/10 xtset id year xtreg dv1 x1 stdev1 x2 x3, fe drop dv stdev dv1 stdev1 shuffle exit end

5) Is there a way to get the p-values for the regression aswell as r2-within? Also a way to display the results like i would get when just doing a normal xtreg regression?
6) When i want to use x1 in combination with dv for exmample, i could just use use within the "program" command

Code:

gen xyz=dv-x1 xtreg xyz x2 x3 stdev1, fe

for example and it should work, right? Also since x1 is the same for every id year combination Sorry for my many question! and thanks again!

Last edited by Nick Bertel; 27 May 2023, 19:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

28 May 2023, 10:25

1) Could you explain the command "list shuffle dv in 1/10" and what it means?

When I was developing the code, I wanted to make sure that the random selection was working as I intended. So this enabled me to take a peek at the data. I intended to remove that, and other development/debugging commands, but apparently overlooked that one. It does not contribute to the actual solution and you can remove it.

2) the command "isid year person, sort" doesn't work in my real sample as persons give multiple opinions in multiple years about multiple IDs. But i just left that out since it does not seem to have a purpose?

It seems to me it does have a purpose and that your problem is ill-posed if you have multiple opinions in multiple years about multiple IDs. You said you wanted your dv to be the mean of three randomly selected people's responses in each year. If a person can have multiple responses in a year, then which of that person's responses should be used in calculating that mean of 3 people? I guess this reflects that I simply don't understand your problem the way you intended it. If you are getting good results with the changes you made, that may just mean that I have not grasped what you meant in your original ask. But I do urge you to carefully check a subset of your results by hand to make sure that they are really not wrong, not just not obviously wrong.

3) I think it should be: "by id year (shuffle), sort: egen dv = mean(cond(_n <= 3, value, .))", since i want the mean from every combination of id and year, right?

Again, I don't believe I have understood your problem correctly. If you said you want the mean from every combination of id and year in #1, I missed it.

4)I would also like to winsor dv and have the standart deviation of "value" within the group thats builds the mean and also winsor the standart error, i would do it like this, right?

I suppose so.

5) Is there a way to get the p-values for the regression aswell as r2-within? Also a way to display the results like i would get when just doing a normal xtreg regression?

Code:

simulate _b _se e(r2_w) e(p), reps(1000) dots(100) saving(`results'): one_iteration

The header output from -simulate- will inform you what variable names in the results correspond to within-r² and the p-value.

6) When i want to use x1 in combination with dv for example, i could just use use within the "program" command

Code:
gen xyz=dv-x1 xtreg xyz x2 x3 stdev1, fe

What you propose is syntactically legal. Whether it is a sensible analysis to run, I can't say. And as I don't understand what you mean by "use x1 in combination with dv," I also can't say whether it implements your intent or not.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#5

28 May 2023, 12:05

Thanks again!
Persons can have different opinions on different IDs for different years. But one person never has multiple opinions about the same ID and year. So with my modifications i get good results.

Regarding the p-values: I would like to have the p-values for every variable, but with your code i only get one p-value which is 0 for every regression.

Regarding the output: How would i go about combining the different regressions into one regression in stata? Like i would get when doing a bootstrap regression like so:

Code:

bootstrap, reps(1000) size(3) strata(id year) : regress dv x1 x2 x3

thank you so much! You already helped me a lot!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

28 May 2023, 12:24

The p-values you get from the code in #5 are the model pvalues, those associated with the F-statistic for the entire regression. You didn't specify whether you wanted that or the p-values for the coefficient levels, so I guessed what you meant. Evidently, I guessed wrong. For each of the simulated regressions, you can calculate the pvalues from the b and se outputs. For each variable, the t-statistic is b/se. The corresponding p-value is 2*normal(-abs(t)).

The "boostrap" estimates of the coefficients are the means of the 1,000 replicated coefficients. The "bootstrap" standard errors of the coefficients are the standard deviations of the 1,000 replicated coefficients. (The bootstrap standard errors are not calculated from the individual regressions' standard errors.) t-statistics, confidence intervals and p-values would then be calculated based on the "bootstrap" estimates of the coefficients and the "bootstrap" standard errors using the same approach as outlined above for the individual replications.

Of course, what you are doing is not actually bootstrapping, so I cannot vouch for the statistical properties of the inferential statistics you are calculating here, nor can I suggest alternative calculations whose statistical properties could be verified. So you are using this entire approach at your own risk; I do not know what the results actually mean.

Last edited by Clyde Schechter; 28 May 2023, 12:27.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#7

28 May 2023, 18:07

Thanks!
To get to one regression that is the average of the simulated regression, i would take the average of the coefficient and average of the standart errors for one variable over the 1000 regressions.
With that i then calculate the average p-value. Does that make sense or can't one that? Also could i take the average for the R2-within aswell?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

28 May 2023, 18:51

Well, you can calculate anything you want. The question is whether it means anything. Here is how you would do it with ordinary bootstrapping:

Averaging the coefficients makes sense and can be reasonably interpreted as an estimate of the "true" coefficient.

Averaging standard errors does not make sense at all. The resulting statistic has no meaningful interpretation that I am aware of. The standard error of a coefficient for the "average" simulated regression is the standard deviation of the simulated coefficients. A 95% confidence interval for the average regression coefficient is from the 2.5th to the 97.5th percentile of the simulated coefficients. (It is not calculated by averaging the confidence limits of the simulated regressions.)

As I do not, in my own work, pay much if any attention to p-values, I don't recall how bootstrap p-values are calculated, but I think they are done by applying the same formula I gave you for the p-values of the individual regressions that I set out in #6 above. I am not 100% certain that is correct, but I believe it is. I can say with complete confidence that averaging the p-values of the individual regressions is definitely wrong.

Given the way R² statistics are themselves calculated, I do not see how averaging them could result in a statistic that would be appropriately considered to be an R² for the average regression. I think that in real bootstrapping, the R² statistics are not bootstrapped but are simply reported from the regression carried out in the original data. I really don't know this for sure, however. Moreover, in your original data, the dependent variable you are using for your "bootstrap" does not even exist, so I don't know if there is any thing at all you could do for this one.

And, again, I want to emphasize that I am not entirely confident that your procedure is sufficiently analogous to an actual bootstrap to warrant any of this.

Last edited by Clyde Schechter; 28 May 2023, 18:57.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#9

28 May 2023, 19:48

Mhh for my work i am not that much interested in the individual regressions or variables when doing them 1000 times, although interesting and helpfull.
The ultimate goal is to have one regression that is encompassses many regression since i have that randomness factor when constructing the dependent variable.
Is there maybe a better way to do it?

Maybe if i restate my problem:
I have opinions of persons on IDs for different years. One person only has one opinion on an individual ID per year but one person can have an opinion on multiple IDs in a year. The individual person is not relevant in my problem.
I want take 3 random persons opinions on an ID in a year and take their mean opinion as a dependent variable to see how they perform and how the other variables of the ID influence them in a hypothetical group.
Obviously, the regression coefficients, pvalues, R2.... change with each iteration since the 3 people are taken at random every time.

So to get to a result that is somewhat consistent over larger iterations, i thought just simulating the regression 1000 times with random groups each time would get me close to that. I'd want to have a "normal" regression table as an outcome ideally. Maybe you have another idea?

Last edited by Nick Bertel; 28 May 2023, 19:50.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#10

28 May 2023, 21:22

I think simulating the regression, instead of bootstrapping, like you did in your example is the right way. I might just be too stubborn on how i want to display the results. I'd just like a way to display the results of the 1000 regressions that includes, the coefficient, the pvalues and R2. To confidently state for example x1 is statistically significantly negatively correlated to them mean opinion of the random group of 3. Just averaging the coefficent is not enough for me since i want to know about it's statistical signficants aswell.
Maybe you have an idea or example literatur on how to display the results?

thanks you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

29 May 2023, 09:22

Please re-read #8. I have indicated there how you can get the summary coefficients, standard errors for them, and from those two statistics calculate a p-value in a way that mimics what -bootstrap- does. The only statistic that I do not see a way to get in this way is the R².

I have also cautioned you, and I will do so again, that the validity of this approach is not established. It isn't clear to me what you are actually estimating when you use the mean of 3 people's responses instead of just using all of the original response data. Since I do not perceive what the estimand is, I cannot vouch for the relationship of the estimator to the estimand. And while there is a good theoretical basis, as well as empirical support, for the way -bootstrap- works, it is not obvious to me whether or not that same support carries over to this bootstrap-ish procedure. There are things that cannot be estimated with bootstrap, and whatever your estimand is, it may be among them, or not.

If anybody else following this thread has ideas on this, I encourage him or her to contribute.
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#12

30 May 2023, 00:39

I thought about my problem again:

Couldn't i just avoid all my troubles and generate 1000 new observations for each id year combination? With each new observation i'd take 3 random people for each id year combination, take the mean and standart deviation from "value" for these 3 people, and keep all the other variables the same, as they are the same for every id year combi.
With these newly generated observations i could then just do a normal regression. Am i missing something? My observations would then be in the 10s of millions. Can stata deal with that amounts of data?
And most importantly, how would i do that in stata?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

30 May 2023, 05:47

This was cross-posted at https://www.reddit.com/r/stata/comme...p_regressions/

The folks at Reddit can look after themselves, I guess, but I note that there it is given as a rule that you tell people about cross-postings.

Here it is an explicit request. https://www.statalist.org/forums/help#crossposting
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#14

30 May 2023, 06:23

Originally posted by Nick Bertel View Post

With these newly generated observations i could then just do a normal regression. Am i missing something? My observations would then be in the 10s of millions. Can stata deal with that amounts of data?

Stata has limits (see -help limits-) for the number of observations, but even the "smallest" tier, StataBE, allows 2,147,483,619 observations, so no worries there.

But let's back up once again to the question you seem to be ignoring. Clyde has already issues several warnings that the validity of your approach is not established, nor is it clear the purpose of what you propose to do.Can you explain to us what you want to do in plain English? *What* purpose do you think these random samples serve?

Lastly, why can't you simply estimate one regression model for all opinions, topics and years?
2 likes
Comment
Nick Bertel

Join Date: Mar 2023

Posts: 27
#15

30 May 2023, 17:05

Sry Nick, didnt know about the rule. Will make sure to clarify that in the future.

@Leonardo: My research is explicitely about forming groups of people to get their conensus opinion. The groups have to be the same size, but some ID-year combinations have more opinions than others. Some have 20, other only have 2. So i want to set a floor, for example 3 opinions per group. I can't just form the groups once as that would ignore all other opinions for all ID year combination and i only want one group per ID-year combi per regression run as the ID-year combis with more opinions would be overrepresented otherwise
So to take on that problem i'd want to do multiple regressions with different random groups to see the results of the effects of the different variables on the group opinoins and to make sure it is consistent, no matter the composition of the group.
I'm not sure how to combine the results of the several regressions as i only have experience with "normal" regressions and their output in stata.

Last edited by Nick Bertel; 30 May 2023, 17:09.
Comment

Announcement

Bootstrap? regression - Taking random group mean out of sample and regress

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment