Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random allocation of observations

    Dear Community.

    I have about five million observations(men, age 40s).

    I'd like to classify these 5 million people into 10 groups (randomly) according to the distribution below.
    In other words, each observation should be allocated from group 1 to 10, but should be matched overall proportion of the group.

    group proportion
    1 0.07
    2 0.19
    3 0.16
    4 0.12
    5 0.21
    6 0.01
    7 0.05
    8 0.04
    9 0.1
    10 0.05
    (total 1.00)


    In addition, I have an overall observation of about 20 million people, and the above tasks should be performed by gender*age groups.


    Thanks in advance.


    Best regards,
    Yunsun

  • #2
    This can be done by randomly ordering your observations, and applying the -egen- function -cut- to the random positions within each age and sex group.

    Code:
    // Simulate your data at smaller scale
    clear
    set seed 494676
    set obs 500000
    gen long id = _n
    gen agesex = ceil(runiform()^1.5 * 10)
    // end data
    //
    gen double rand = runiform()
    sort agesex rand
    by agesex: gen fracpos = _n/_N  // random place w/in group
    // Cumulate your distribution; 1.01 to make sure to catch top value
    local atlist = "0.0, 0.07, 0.26, 0.42, 0.54, 0.75, 0.76, 0.81, 0.85, 0.95, 1.01"
    egen group =  cut(fracpos), at(`atlist') icodes
    // check it out
    tab group agesex, col

    Comment


    • #3
      Hi Mike,

      Thanks a lot for your kind answer. This is very helpful!

      Comment


      • #4
        Dear Community,

        I have one more question..

        What should I do if each agesex(variable defined in Mike's answer) has different 'group distribution'...??



        Thanks in advance.


        Best regards,
        Yunsun

        Comment


        • #5
          My understanding is that you just want a different set of cutpoints, a different "atlist" for each age and sex group. In that case, I'd code my agesex groups with the -group- function of the -egen- command (different from the -group- option of the -cut- function that I used above.) This will give age/sex groups numbered from 1 to whatever. You presumably know what distribution you want for each of those groups. I haven't tried this out on any data, but I think the following should work:

          Code:
          // Distribution lists for each age/sex group, 1 to 10 // however many you have
          // I just made up two for an example.
          local atlist1 = "0.0, 0.07, 0.26, 0.42, 0.54, 0.75, 0.76, 0.81, 0.85, 0.95, 1.01"
          local atlist2 = "0.0, 0.8, 0.30, 0.36, 0.47, 0.59, 0.77, 0.79, 0.83, 0.88, 0.93, 1.01"
          local atlist3 ....
          ....
          local atlist10 = ...
          gen group = .
          forval i = 1/10 {
              egen temp =  cut(fracpos) if agesex = `i', at(`atlist`i'') icodes
              replace group = temp if agesex == `i'
              drop temp
          }

          Comment

          Working...
          X