Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • draw random sample from current dataset in order to generate fake peers

    I am seeking assistance in conducting a falsification test involving fake peers' body mass index (BMI) using Stata. Specifically, I am interested in defining peers for a child (identified by pid) as children within 2 years of age in the same community (cid). Furthermore, I would like to establish fake peers as children within 2 years of age from a different community but within the same county (countyid).

    My plan involves randomly selecting a sample of children aged within two years from another community but within the same county, calculating the average BMI, and then assigning this average as the fake peers' BMI for a given child. While I have some code to generate peers' average BMI, my challenge lies in creating the average BMI for the fake peers.

    Any guidance or code suggestions on generating the average BMI for the fake peers would be greatly appreciated. Thank you in advance for your assistance!

    rangestat (mean) peerbmi =bmi, by(cid) interval(age -2 2) excl

    Here is a sample dataset that illustrates the structure of the data I am working with:
    clear
    input pid cid countyid age bmi
    110011104 118300 1 17 .
    110011104 118300 1 15 20.28123
    110011104 118300 1 10 13.71742
    110015520 118300 1 3 .
    110018552 118300 1 2 16.609
    110020104 118300 1 15 15.24158
    110020112 118300 1 10 .
    110022104 118300 1 15 19.48696
    110022104 118300 1 7 13.88889
    110026104 118300 1 11 26.89618
    110041104 118300 1 16 20.0155
    110043104 118300 1 31 .
    110043104 118300 1 14 15.14889
    110047104 118300 1 13 22.59814
    110051520 118300 1 3 15.61849
    110052504 118300 1 2 20.0692
    110062552 118300 1 0 .
    110071104 118300 1 7 14.87603
    110079104 118300 1 17 17.31341
    110083400 118300 1 5 13.19444
    110092104 118400 1 14 16.56301
    110108552 118400 1 4 17.23356
    110113400 118400 1 4 .
    110124400 118400 1 6 16.52892
    110126520 118400 1 2 15.94991
    110129552 118400 1 0 18.05556
    110157520 118400 1 6 22.32005
    110339504 118400 1 4 15.08153
    110661568 118400 1 . .
    110689552 118400 1 4 15
    110717504 118400 1 3 15
    111230552 118400 1 1 18.05556
    111308552 118400 1 1 21.42857
    111997552 118400 1 1 .
    111997552 118400 1 0 .
    112255568 118400 1 3 25.5102
    112323568 118400 1 1 .
    112323568 118400 1 0 .
    112416552 118400 1 4 24.69136
    112416552 118400 1 1 17.83591
    end
    format pid %11.0g


  • #2
    A little algebra to the rescue:
    Code:
    rangestat (mean) peerbmi = bmi (count) peer_count = pid, ///
        by(cid) interval(age -2 2) excludeself
    
    rangestat (mean) countybmi = bmi (count) county_count = pid, ///
        by(countyid) interval(age -2 2) excludeself
        
    gen fakepeerbmi ///
        = (county_count*countybmi - peer_count*peerbmi)/(county_count - peer_count)

    Comment


    • #3
      It works, thank you, Clyde. How about if we want to randomly draw a sample consisting of children within 2 years of age from another community in the same county, average it to generate fake peers, and then repeat it 100 times?

      Comment


      • #4
        It is not clear to me how you want to draw the random samples. Random samples of what size? With or without replacement? If without replacement, what will you do if some pid's do not have a sufficient number of age-matched peers in other communities in the county?

        Regardless of the answers to those questions, I should point out that this approach will be vastly more computationally and memory intensive than what you outlined in #1, and if your data set is very large you may be biting off more than you can chew here. I also don't see why it would produce anything better than the approach in #1, though I won't claim to have thoroughly thought this point through. But, if you want to proceed, do post back with answers to the questions in the first paragraph.

        I should also note that the use of age-matching within 2 years raises some problems. In your example data, you are looking at children age 2 to 17. BMI grows very rapidly with age during this early phase of life. It is a bit of a stretch to say that a 10 year old is a BMI-peer to either an 8 year old or a 12 year old. Now, you are averaging over the entire peer pool, which somewhat mitigates that. But at the low end of the age range the age range (2-4) of the candidate peers will be mostly older than the index pid, and at the high end of the age (15-17) range it will be mostly younger, so that your peer groups will not be good matches to the index pid's.
        Last edited by Clyde Schechter; 25 Aug 2023, 09:15.

        Comment

        Working...
        X