Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating composite variable

    Suppose i have the following variables:

    Code:
    . set obs 10000
    obs was 0, now 10000
    
    . gen var1 = rbinomial(1,0.446)
    
    . gen var2 = rbinomial(1,0.339)
    
    . gen var3 = rbinomial(1,0.142)
    
    . gen var4 = rbinomial(1,0.073)
    
    
    . tab var1
    
           var1 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      5,543       55.43       55.43
              1 |      4,457       44.57      100.00
    ------------+-----------------------------------
          Total |     10,000      100.00
    
    . tab var2
    
           var2 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      6,608       66.08       66.08
              1 |      3,392       33.92      100.00
    ------------+-----------------------------------
          Total |     10,000      100.00
    
    . tab var3
    
           var3 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      8,572       85.72       85.72
              1 |      1,428       14.28      100.00
    ------------+-----------------------------------
          Total |     10,000      100.00
    
    . tab var4
    
           var4 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      9,300       93.00       93.00
              1 |        700        7.00      100.00
    ------------+-----------------------------------
          Total |     10,000      100.00
    I would like to create a composite variable var5 that takes the following values:

    1 4,457
    2 3,392
    3 1,428
    4 700

    How can i do that? The combination of egen with group() as suggested in other posts creates all combinations.

  • #2
    Your frequencies don't add to 10000. How about this?

    Code:
    . clear
    
    . set obs 10000 
    number of observations (_N) was 0, now 10,000
    
    . gen wanted = cond(_n <= 4460, 1, cond(_n <= 4460 + 3390, 2, cond(_n <= 4460 + 3390 +
    >  1420, 3, 4)))
    
    . tab wanted 
    
         wanted |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |      4,460       44.60       44.60
              2 |      3,390       33.90       78.50
              3 |      1,420       14.20       92.70
              4 |        730        7.30      100.00
    ------------+-----------------------------------
          Total |     10,000      100.00
    If you want them randomised, just shuffle them on some randoms:

    Code:
     
    set seed 2803
    gen double unwanted = runiform()
    sort unwanted
    Compare also (in this case too, good practice to set the seed beforehand and record it).

    Code:
    gen WANTED = . 
    mata: st_store(., "WANTED", rdiscrete(10000, 1, (0.446, 0.339, 0.142, 0.073)))
    Important nuance: the first method guarantees the exact frequencies asked if they are possible; the second does not.

    Comment


    • #3
      Thanks, both approaches are clever. I tried the first one without the cond() function but with if statements instead and i was getting weird results.

      I suppose the more general question here is how to simulate categorical variables with more than two levels. The first approach requires the binary variables to be generated first and then use the frequencies to create the composite variable manually so to speak.

      I like the mata approach better as it appears to be a more direct solution to the problem. However, i have one question: is the rdiscrete() function producing equivalent results to rbinomial()?

      Comment


      • #4
        The statement based on cond() certainly could be rewritten as a series of statements using if but I can't comment on what you got wrong without seeing the code.

        rdiscrete() will produce similar results to rbinomial() if and only if the probabilities supplied are consistent with a binomial distribution. I don't understand what you were trying to do in #1 but you'll know that it wasn't equivalent to generating from a single binomial distribution.

        Comment

        Working...
        X