How to randomly assign individuals to different groups (other than the actual group)

Samuel Zhang

Join Date: Nov 2015

Posts: 10
#1

How to randomly assign individuals to different groups (other than the actual group)

12 Feb 2025, 04:19

Hello, I have a panel dataset. This dataset records individual-level health across different months in the same year. Individuals live in different county. Each county has different levels of pollution. The no. of people from each county is different.

So, my outcome variable is health, the explanatory variable is pollution. My main regression examines the impact of pollution on (individual's) health.

Now, I want to randomly assign individuals to different counties (other than the actual county) where pollution is different to their actual/original county. I want to examine if the results will be robust.

How can achieve this?
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2409
#2

12 Feb 2025, 07:55

To give a helpful answer to your question, having more information would be helpful.

1. I would guess that you want to repeatedly reassign ("shuffle") the county values, and perform some estimation command for each such assignment. Is this correct? If so, what estimation would that be? And, if so, what about your data makes conventional statistic procedures undesirable for your situation?

2. If #1 is correct, do you want confidence intervals, or p-values?

3. Do you want to reassign the county values in such a way that an individual stays in the same (reassigned) county for every month? Presuming that, do you want county values reassigned based on "month 1," "month N", or what?

5. Please show an example of your data. In this regard, see item 12.2 of the StataList FAQ that newer participants on StataList are asked to read.

I suspect that your data structure and goal will not fit nicely with either -permute- or -bootstrap-, although those might be a good starting point for your consideration. A better command might be the more versatile user-written command -ritest- (-search ritest-).
Comment
George Ford

Join Date: Aug 2014

Posts: 3126
#3

12 Feb 2025, 08:33

Depends on whether you never want the county to be the same.

I'd think you'd want to assign randomly without restraint (imposing the null). For that, use shufflevar. It will maintain the sample sizes by county, since it just shuffles the county identifier. It creates a new variable so as not to meddle with the data. Set a loop and shuffle repeated and store the estimates in some way.

Crude example. There are other ways to store results. Could also move R to a new frame so as not to impact the original data.

Code:

sysuse auto, clear reg price weight i.foreign , r scalar beta = _b[weight] matrix R = J(100,2,.) matrix colnames R = coef reject forv i = 1/100 { quietly { shufflevar foreign reg price weight i.foreign_shuffled , r matrix R[`i',1] = r(table)[1,1] matrix R[`i',2] = r(table)[4,1] < 0.05 } } svmat R , n(col) summ coef reject qui gunique make if coef > beta di "Permuted beta > beta = " r(J)/100
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2409
#4

12 Feb 2025, 13:24

I had thought of suggesting -shufflevar- as well, but -search shufflevar- no longer finds it.
Comment
George Ford

Join Date: Aug 2014

Posts: 3126
#5

12 Feb 2025, 13:27

https://ideas.repec.org/c/boc/bocode/s457116.html
Comment

Samuel Zhang

Join Date: Nov 2015
Posts: 10

12 Feb 2025, 16:03

Originally posted by Mike Lacy View Post

To give a helpful answer to your question, having more information would be helpful.

1. I would guess that you want to repeatedly reassign ("shuffle") the county values, and perform some estimation command for each such assignment. Is this correct? If so, what estimation would that be? And, if so, what about your data makes conventional statistic procedures undesirable for your situation?

2. If #1 is correct, do you want confidence intervals, or p-values?

3. Do you want to reassign the county values in such a way that an individual stays in the same (reassigned) county for every month? Presuming that, do you want county values reassigned based on "month 1," "month N", or what?

5. Please show an example of your data. In this regard, see item 12.2 of the StataList FAQ that newer participants on StataList are asked to read.

I suspect that your data structure and goal will not fit nicely with either -permute- or -bootstrap-, although those might be a good starting point for your consideration. A better command might be the more versatile user-written command -ritest- (-search ritest-).

Thank you Mike. See my sample data below. Note that individuals stay at the same county (countyid) across different months.

The regression that I run is this:

reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

Now I want to randomly assign individuals to other counties where pollution is different. Would ritest help in this case? Could you suggestion codes or other codes?

id	countyid	month	pollution	health	age (years)
1	1	1	100	9	20
1	1	2	80	8	20
2	1	1	100	8	22
2	1	2	80	7	22
2	1	3	70	5	22
3	2	1	50	6	19
3	2	2	60	8	19
3	2	3	30	7	19
4	2	1	50	6	18
4	2	2	60	9	18
5	3	1	50	8	19
5	3	2	80	8	19
6	4	1	60	9	16
6	4	2	60	8	16
6	4	3	70	6	16
7	2	1	50	6	18
7	2	2	60	5	18
7	2	3	30	4	18
8	3	1	50	8	17
8	3	2	80	9	17

Comment

Samuel Zhang

Join Date: Nov 2015

Posts: 10
#7

12 Feb 2025, 16:05

Originally posted by George Ford View Post

Depends on whether you never want the county to be the same.

I'd think you'd want to assign randomly without restraint (imposing the null). For that, use shufflevar. It will maintain the sample sizes by county, since it just shuffles the county identifier. It creates a new variable so as not to meddle with the data. Set a loop and shuffle repeated and store the estimates in some way.

Crude example. There are other ways to store results. Could also move R to a new frame so as not to impact the original data.

Code:

sysuse auto, clear reg price weight i.foreign , r scalar beta = _b[weight] matrix R = J(100,2,.) matrix colnames R = coef reject forv i = 1/100 { quietly { shufflevar foreign reg price weight i.foreign_shuffled , r matrix R[`i',1] = r(table)[1,1] matrix R[`i',2] = r(table)[4,1] < 0.05 } } svmat R , n(col) summ coef reject qui gunique make if coef > beta di "Permuted beta > beta = " r(J)/100

Thank you George. I just shared my sample dataset above. Could you kindly share your further thoughts/comments/suggestions, please?
Comment
George Ford

Join Date: Aug 2014

Posts: 3126
#8

12 Feb 2025, 16:45

Why not shuffle the pollution variable?
1 like
Comment

George Ford

Join Date: Aug 2014
Posts: 3126

12 Feb 2025, 16:50

Code:

ritest pollution _b[pollution] , r(500) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

** I think this permutes so values are chosen only from the same month.
ritest pollution _b[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
ritest pollution _b[pollution]/_se[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

Comment

George Ford

Join Date: Aug 2014

Posts: 3126
#10

12 Feb 2025, 17:00

if id doesn't change county, then you can drop countyid from absorb (reghdfe tells you if redundant).
Comment

Samuel Zhang

Join Date: Nov 2015
Posts: 10

#11

12 Feb 2025, 17:16

Originally posted by George Ford View Post

Code:

ritest pollution _b[pollution] , r(500) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

** I think this permutes so values are chosen only from the same month.
ritest pollution _b[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)
ritest pollution _b[pollution]/_se[pollution] , r(500) strata(month) : reghdfe health pollution control, absorb(id countyid month) vce(cluster county_id)

Hi George, thank you very much. How do you make sense of strata and cluster in this case? Should county be cluster or strata?

Comment

George Ford

Join Date: Aug 2014

Posts: 3126
#12

13 Feb 2025, 08:10

If you care about the time series properties, then strata.

Strata: Values are only permuted within each stratum (month in your case). This preserves the time structure of your data.
Cluster: The entire cluster is permuted as a unit, which would keep all observations for an individual/group together.
Comment

Announcement