Problem with dividing a data set randomly but consistent with a random variable

Steffen Pluetzke

Join Date: Mar 2019

Posts: 20
#1

Problem with dividing a data set randomly but consistent with a random variable

08 Jul 2019, 11:43

Dear Statalisters,

I generated a dataset and want to randomly divide it into 70/30. In the beginning of the do-file I used the command set seed, but somehow the division changes. Therefore a regression I ran on the 70% gives different estimates everytime. How do I get a consistent division based on a random variable?
The code is below. ZV2 is the random variable I generated and OOS should divide the data set OOS=0 (70% of the observations) and OOS=1 (30% of the observations). However, if I sum my dependent variable, the summary statistics of the subgroups are different. How can I divide the data set randomly, but with the same observations in the same subgroups everytime I run the do-file?
I really appreciate your help.

Code:

set seed 100 gen ZV2 = runiform() label var ZV2 "random variable" xtile OOS = ZV2, nquantiles(10) replace OOS = 0 if OOS <=7 replace OOS = 1 if OOS >7 label var OOS "indicator" global y depvar sum $y if OOS==1 sum $y if OOS==0

Kind regards

Steffen
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

08 Jul 2019, 18:35

Is it at all possible that earlier in your program you are doing something that leaves your dataset in a different order? For example, if you are sorting your data and there are ties — that is, multiple observations with the same values for the sort key variable(s) — the output of help sort tells us

Without the stable option, the ordering of observations with equal values of varlist is randomized.
Comment

Announcement

Problem with dividing a data set randomly but consistent with a random variable

Comment