A question on randomization

Victor Smith

Join Date: Feb 2018

Posts: 85
#1

A question on randomization

03 Sep 2019, 22:06

I wonder if there is a command that can do the following:

1. I have a list of 7 numbers, say (2,5,7,8,11,13,14)
2. I want to generate a variable call "v1" with observations = 100, the values of the 100 observations are randomly assigned from the 7 numbers in #1. In other words, the values of the 100 observations can only be (2,5,7,8,11,13,14).

If it possible to generate such variable in stata?
Tags: data, Generate, random, randomization

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

03 Sep 2019, 22:50

Code:

//  CREATE A DATA SET WITH THE 7 NUMBERS
clear*
input int (pick x)
1 2
2 5
3 7
4 9
5 11
6 13
7 14
end
tempfile 7numbers
save `7numbers'

//  CREATE THE DESIRED DATA SET
clear
set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
set obs 100
gen pick = runiformint(1, 7)
gen sort_order = _n
merge m:1 pick using `7numbers', assert(match using) nogenerate
drop pick
sort sort_order

Comment

Victor Smith

Join Date: Feb 2018
Posts: 85

04 Sep 2019, 23:17

Originally posted by Clyde Schechter View Post

Code:

// CREATE A DATA SET WITH THE 7 NUMBERS
clear*
input int (pick x)
1 2
2 5
3 7
4 9
5 11
6 13
7 14
end
tempfile 7numbers
save `7numbers'

// CREATE THE DESIRED DATA SET
clear
set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
set obs 100
gen pick = runiformint(1, 7)
gen sort_order = _n
merge m:1 pick using `7numbers', assert(match using) nogenerate
drop pick
sort sort_order

Awesome! The solution is so beautiful.

Could you help me one more time?

I created a sample dataset as below. For each observation (or row), I would like to randomly assign var1, var2, or var3's value to x. Is this possible in stata?

Code:

clear all
input int (var1 var2 var3 x)
1 2 5 .
2 5 2 .
3 7 6 .
4 9 7 .
5 11 9 .
6 13 11 .
7 14 34 .
4 9 7 . 
3 7 6 . 
7 14 34 .
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

05 Sep 2019, 08:02

Code:

clear all input int (var1 var2 var3 x) 1 2 5 . 2 5 2 . 3 7 6 . 4 9 7 . 5 11 9 . 6 13 11 . 7 14 34 . 4 9 7 . 3 7 6 . 7 14 34 . end set seed 5678 gen int pick = runiformint(1, 3) forvalues i = 1/3 { replace x = var`i' if pick == `i' }

Note: If the real situation has a substantially larger number of variables, I would approach it differently, but for picking one out of three, I think this is simplest and best.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

05 Sep 2019, 09:26

Here are a couple of other solutions to Victor's original questions--not better, just different. I'll bet other people can suggest several other reasonable solutions.

Code:

clear set obs 100 set seed 185434 // #1 mat urn = (2,5,7,8,11,13,14) gen x = urn[1,runiformint(1,7)] // #2 local urn = "2 5 7 8 11 13 14" replace x = real(word("`urn'", runiformint(1,7)))
2 likes
Comment
Victor Smith

Join Date: Feb 2018

Posts: 85
#6

06 Sep 2019, 06:36

Originally posted by Clyde Schechter View Post

Code:

clear all input int (var1 var2 var3 x) 1 2 5 . 2 5 2 . 3 7 6 . 4 9 7 . 5 11 9 . 6 13 11 . 7 14 34 . 4 9 7 . 3 7 6 . 7 14 34 . end set seed 5678 gen int pick = runiformint(1, 3) forvalues i = 1/3 { replace x = var`i' if pick == `i' }

Note: If the real situation has a substantially larger number of variables, I would approach it differently, but for picking one out of three, I think this is simplest and best.

Hi Clyde, thanks again for your beautiful solution.

If there are many variables with different names, how would you do it?
Comment
Victor Smith

Join Date: Feb 2018

Posts: 85
#7

06 Sep 2019, 06:40

Originally posted by Mike Lacy View Post

Here are a couple of other solutions to Victor's original questions--not better, just different. I'll bet other people can suggest several other reasonable solutions.

Code:

clear set obs 100 set seed 185434 // #1 mat urn = (2,5,7,8,11,13,14) gen x = urn[1,runiformint(1,7)] // #2 local urn = "2 5 7 8 11 13 14" replace x = real(word("`urn'", runiformint(1,7)))

Thanks Mike, for the diversity of solutions. Really enjoyed reading your solutions.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

06 Sep 2019, 08:44

Here's the general approach, illustrated with the built-in auto.dta.

Code:

clear* sysuse auto // CREATE A LOCAL MACRO LISTING // THE VARIABLES THAT VALUES WILL BE // RANDOMLY SELECTED FROM ds price-mpg headroom-gear_ratio local sources `r(varlist)' // SET RANDOM NUMBER SEED set seed 9101112 // GIVE SOURCE VARIABLES NAMES WITH // A COMMON PREFIX AND CREATE AN OBS IDENTIFIER // SO WE CAN RESHAPE rename (`sources') s_= gen long obs_no = _n // GO LONG reshape long s_, i(obs_no) j(vname) string // SORT EACH OBS VARIABLES INTO RANDOM ORDER // AND SELECT THE FIRST FOR NEW VARIABLE x gen double shuffle = runiform() by obs_no (shuffle), sort: gen x = s_[1] // RESTORE ORIGINAL DATA LAYOUT AND VARIABLE NAMES drop shuffle reshape wide rename s_* *

Establishing a local macro with the desired variable names can be tricky if the names are completely unsystematic and the variables are scattered haphazardly around the data set. Worst case scenario you have to just list them all out, but usually, as here, the use of some wildcards can accomplish it more economically.

Also, the creation of new variable names ahead of the -reshape long- command can be tricky. Since Stata limits variable names to 32 characters, if any of the source variable names are already more than 30 characters, what I've done here won't work. So sometimes this method requires some ad hoc renaming of variables to work.

As a rule of thumb I would say that if your problem involves a large number of variables but the number of observations is modest (and the variable names are not too difficult to work with) this is the approach I prefer. But if the number of variables is small, or if the data set contains a large number of observations (which makes the sorting and -reshape-ing very slow) then I would stick with the approach in #4.
Comment
Victor Smith

Join Date: Feb 2018

Posts: 85
#9

08 Sep 2019, 21:36

Originally posted by Clyde Schechter View Post

Here's the general approach, illustrated with the built-in auto.dta.

Code:

clear* sysuse auto // CREATE A LOCAL MACRO LISTING // THE VARIABLES THAT VALUES WILL BE // RANDOMLY SELECTED FROM ds price-mpg headroom-gear_ratio local sources `r(varlist)' // SET RANDOM NUMBER SEED set seed 9101112 // GIVE SOURCE VARIABLES NAMES WITH // A COMMON PREFIX AND CREATE AN OBS IDENTIFIER // SO WE CAN RESHAPE rename (`sources') s_= gen long obs_no = _n // GO LONG reshape long s_, i(obs_no) j(vname) string // SORT EACH OBS VARIABLES INTO RANDOM ORDER // AND SELECT THE FIRST FOR NEW VARIABLE x gen double shuffle = runiform() by obs_no (shuffle), sort: gen x = s_[1] // RESTORE ORIGINAL DATA LAYOUT AND VARIABLE NAMES drop shuffle reshape wide rename s_* *

Establishing a local macro with the desired variable names can be tricky if the names are completely unsystematic and the variables are scattered haphazardly around the data set. Worst case scenario you have to just list them all out, but usually, as here, the use of some wildcards can accomplish it more economically.

Also, the creation of new variable names ahead of the -reshape long- command can be tricky. Since Stata limits variable names to 32 characters, if any of the source variable names are already more than 30 characters, what I've done here won't work. So sometimes this method requires some ad hoc renaming of variables to work.

As a rule of thumb I would say that if your problem involves a large number of variables but the number of observations is modest (and the variable names are not too difficult to work with) this is the approach I prefer. But if the number of variables is small, or if the data set contains a large number of observations (which makes the sorting and -reshape-ing very slow) then I would stick with the approach in #4.

I am just learning and exploring the randomizing function in stata. Thanks again for helping me with my random questions.
Comment

Announcement

A question on randomization

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment