No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with dividing a data set randomly but consistent with a random variable

    Dear Statalisters,

    I generated a dataset and want to randomly divide it into 70/30. In the beginning of the do-file I used the command set seed, but somehow the division changes. Therefore a regression I ran on the 70% gives different estimates everytime. How do I get a consistent division based on a random variable?
    The code is below. ZV2 is the random variable I generated and OOS should divide the data set OOS=0 (70% of the observations) and OOS=1 (30% of the observations). However, if I sum my dependent variable, the summary statistics of the subgroups are different. How can I divide the data set randomly, but with the same observations in the same subgroups everytime I run the do-file?
    I really appreciate your help.

    set seed 100
    gen ZV2 = runiform()
    label var ZV2 "random variable"
    xtile OOS = ZV2, nquantiles(10)
    replace OOS = 0 if OOS <=7
    replace OOS = 1 if OOS >7
    label var OOS "indicator"
    global y depvar
    sum $y if OOS==1
    sum $y if OOS==0
    Kind regards


  • #2
    Is it at all possible that earlier in your program you are doing something that leaves your dataset in a different order? For example, if you are sorting your data and there are ties — that is, multiple observations with the same values for the sort key variable(s) — the output of help sort tells us

    Without the stable option, the ordering of observations with equal values of varlist is randomized.