Simulating from an empirical distribution

Luca Gagliardone

Join Date: Oct 2016

Posts: 54
#1

Simulating from an empirical distribution

09 Mar 2017, 04:23

Dear all,

I am trying to make simulations of realisations of a random variable associated with probabilities I found empirically.
To be clear, suppose I have a random variable which can take values {1,2,3,4,5} with probabilities respectively {.2,.3,0,.1,.4}.
What I am trying to do is to randomly select one entry of the vector (1,2,3,4,5) where each entry has probability to be chose (respectively) taken from the vector (.2,.3,0,.1,.4).
How can I do?

Thanks in advance for the help!
Best
Luca Gagliardone
Tags: empirical distribution, random generator, simulation

Stephen Jenkins

Join Date: Apr 2014
Posts: 1435

09 Mar 2017, 06:47

Something like the following?

Code:

. set obs 10000000
number of observations (_N) was 0, now 10,000,000

.
. /*
> values {1,2,3,4,5} with probabilities respectively {.2,.3,0,.1,.4}.
> */
.
. set seed 123456789

.
. ge double r = runiform()

. su r

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           r | 10,000,000     .500094    .2886607   7.23e-08   .9999999

. ge v = .
(10,000,000 missing values generated)

. replace v = 1 if inrange(r, 0, .2)
(1,998,376 real changes made)

. replace v = 2 if inrange(r, `=.2 + 1e-9', .5)
(3,000,913 real changes made)

. replace v = 4 if inrange(r, `=.5 + 1e-9', .6)
(999,015 real changes made)

. replace v = 5 if inrange(r, `=.6 + 1e-9', 1)
(4,001,696 real changes made)

. ta v

          v |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |  1,998,376       19.98       19.98
          2 |  3,000,913       30.01       49.99
          4 |    999,015        9.99       59.98
          5 |  4,001,696       40.02      100.00
------------+-----------------------------------
      Total | 10,000,000      100.00

Comment

Luca Gagliardone

Join Date: Oct 2016

Posts: 54
#3

09 Mar 2017, 07:26

Dear Stephen,
thanks for the help.

You are definitely going in the right direction, but I am looking for a method to select one value of the random variable, in order to be able to make a "Monte-carlo"-style simulation using my empirically estimated distribution.

For example, using the code above, the last step should be choosing one specific observation of the variable "v" selecting randomly: we will get then one value with approximately the right frequencies.
Also, if I need to estimate the sample mean of my simulations (to approximate the sample mean of the random variable v) in a Monte-carlo fashion, I imagine it is incredibly inefficient to simulate thousands of variables "v"s with the approach above, choose randomly one realisation each and then average over them. There exists another way?
Comment
Luca Gagliardone

Join Date: Oct 2016

Posts: 54
#4

09 Mar 2017, 07:38

To be more clear, the best possible result would be to have a function that works this way:

1) Takes as inputs 3 things:
a) a vector of realisations.
b) a vector of probabilities.
where the two vectors have the same length.
c) an integer n.
2) Gives back a vector of length n where each entry is:
a) one and only one realisation from the vector of realisations.
b) selected randomly using as a "weight" the vector of probabilities.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#5

09 Mar 2017, 07:42

Sorry, but your specification is confusing to me. Do you have a dataset full of observations or not, and you want to allocate values of v to each obs randomly? It's confusing (to me) for you to refer to getting "one value with approximately the right frequencies." One value? Per Obs? Right frequencies over what sample -- over all the replications?

Whatever, the basic "trick" is likely to involve using the same machinery, i.e. based on comparisons of some number with values generated from a random uniform distribution.

If you only have one obs in each replication, then I suspect that you need, within each iteration, to use the if prefix (not if qualifier) as above: if draw is in a particular range, value is set to something; if in another range, then to another number.

But then, again, aren't the multiple rows in my artificial dataset just like a set of replications (one per row)?

All in all, you can see that I am confused. You're going to have to spell things out in a much more detailed and basic fashion for people like me. (And I have to go off and do other stuff now ...)
Comment
Luca Gagliardone

Join Date: Oct 2016

Posts: 54
#6

09 Mar 2017, 08:38

I apologise for being confusing.
I will try to spell things out better.

I have a a matrix of the form:

5 7 9 10 20

0.1 0 0.3 0.4 0.2

where the first row are the values that the random variable can take and the second row are the associated probabilities.

I am looking for a function that gives me ONE of the values among {5,7,9,10,20} randomly selected according to the associated probabilities.
Hope this is clear now.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3460
#7

09 Mar 2017, 11:15

http://www.maartenbuis.nl/publications/discrete.html

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Kreshna Gopal (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 43
#8

09 Mar 2017, 11:22

You can also use Mata function rdiscrete(r, c, p) that returns an r x c matrix containing random variates from the discrete distribution specified by the probabilities in the vector p of length k. The range of the discrete variates is 1, 2, ..., k.

If you are working with matrices in Stata, you can use Mata function st_matrix() to obtain/put matrices from/into Stata.

Using your example matrix (say M), here is code that hopefully produces a vector V that you want (n = 10 and k = 5).

Code:

mat M = (5,7,9,10,20\.2,.3,0,.1,.4) mata: M = st_matrix("M") R = M[1,.] // Realisations in row 1 P = M[2,.] // Probabilities in row 2 n = 10 D = rdiscrete(1,n,P) // n discrete random variates from 1 to k V = R[1, D] // V[i] = R[D[i]], 1 <= i <= n st_matrix("V", V) end mat list V

I hope this helps.

-- Kreshna
Comment

5	7	9	10	20
0.1	0	0.3	0.4	0.2

Announcement

Simulating from an empirical distribution

Comment

Comment

Comment

Comment

Comment

Comment

Comment