draw random sample from current dataset in order to generate fake peers

John Li

Join Date: Aug 2023

Posts: 2
#1

draw random sample from current dataset in order to generate fake peers

23 Aug 2023, 00:17

I am seeking assistance in conducting a falsification test involving fake peers' body mass index (BMI) using Stata. Specifically, I am interested in defining peers for a child (identified by pid) as children within 2 years of age in the same community (cid). Furthermore, I would like to establish fake peers as children within 2 years of age from a different community but within the same county (countyid).

My plan involves randomly selecting a sample of children aged within two years from another community but within the same county, calculating the average BMI, and then assigning this average as the fake peers' BMI for a given child. While I have some code to generate peers' average BMI, my challenge lies in creating the average BMI for the fake peers.

Any guidance or code suggestions on generating the average BMI for the fake peers would be greatly appreciated. Thank you in advance for your assistance!

rangestat (mean) peerbmi =bmi, by(cid) interval(age -2 2) excl

Here is a sample dataset that illustrates the structure of the data I am working with:
clear
input pid cid countyid age bmi
110011104 118300 1 17 .
110011104 118300 1 15 20.28123
110011104 118300 1 10 13.71742
110015520 118300 1 3 .
110018552 118300 1 2 16.609
110020104 118300 1 15 15.24158
110020112 118300 1 10 .
110022104 118300 1 15 19.48696
110022104 118300 1 7 13.88889
110026104 118300 1 11 26.89618
110041104 118300 1 16 20.0155
110043104 118300 1 31 .
110043104 118300 1 14 15.14889
110047104 118300 1 13 22.59814
110051520 118300 1 3 15.61849
110052504 118300 1 2 20.0692
110062552 118300 1 0 .
110071104 118300 1 7 14.87603
110079104 118300 1 17 17.31341
110083400 118300 1 5 13.19444
110092104 118400 1 14 16.56301
110108552 118400 1 4 17.23356
110113400 118400 1 4 .
110124400 118400 1 6 16.52892
110126520 118400 1 2 15.94991
110129552 118400 1 0 18.05556
110157520 118400 1 6 22.32005
110339504 118400 1 4 15.08153
110661568 118400 1 . .
110689552 118400 1 4 15
110717504 118400 1 3 15
111230552 118400 1 1 18.05556
111308552 118400 1 1 21.42857
111997552 118400 1 1 .
111997552 118400 1 0 .
112255568 118400 1 3 25.5102
112323568 118400 1 1 .
112323568 118400 1 0 .
112416552 118400 1 4 24.69136
112416552 118400 1 1 17.83591
end
format pid %11.0g
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30164

23 Aug 2023, 11:17

A little algebra to the rescue:

Code:

rangestat (mean) peerbmi = bmi (count) peer_count = pid, ///
    by(cid) interval(age -2 2) excludeself

rangestat (mean) countybmi = bmi (count) county_count = pid, ///
    by(countyid) interval(age -2 2) excludeself
    
gen fakepeerbmi ///
    = (county_count*countybmi - peer_count*peerbmi)/(county_count - peer_count)

Comment

John Li

Join Date: Aug 2023

Posts: 2
#3

24 Aug 2023, 18:24

It works, thank you, Clyde. How about if we want to randomly draw a sample consisting of children within 2 years of age from another community in the same county, average it to generate fake peers, and then repeat it 100 times?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

25 Aug 2023, 09:07

It is not clear to me how you want to draw the random samples. Random samples of what size? With or without replacement? If without replacement, what will you do if some pid's do not have a sufficient number of age-matched peers in other communities in the county?

Regardless of the answers to those questions, I should point out that this approach will be vastly more computationally and memory intensive than what you outlined in #1, and if your data set is very large you may be biting off more than you can chew here. I also don't see why it would produce anything better than the approach in #1, though I won't claim to have thoroughly thought this point through. But, if you want to proceed, do post back with answers to the questions in the first paragraph.

I should also note that the use of age-matching within 2 years raises some problems. In your example data, you are looking at children age 2 to 17. BMI grows very rapidly with age during this early phase of life. It is a bit of a stretch to say that a 10 year old is a BMI-peer to either an 8 year old or a 12 year old. Now, you are averaging over the entire peer pool, which somewhat mitigates that. But at the low end of the age range the age range (2-4) of the candidate peers will be mostly older than the index pid, and at the high end of the age (15-17) range it will be mostly younger, so that your peer groups will not be good matches to the index pid's.

Last edited by Clyde Schechter; 25 Aug 2023, 09:15.
Comment

Announcement

draw random sample from current dataset in order to generate fake peers

Comment

Comment

Comment