Simulated data

Olivia Emma

Join Date: Dec 2022

Posts: 102
#1

Simulated data

29 Mar 2023, 03:23

Hello

is there a code to get or generate(simulate) a similar data based on the existing data.

thanks
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4433
#2

29 Mar 2023, 06:29

I think that there are methods alluded to previously on the List that use quantiles of the original data to generate artificial data that follow these quantiles. In addition, there is a user-written suite of commands on SSC that can also be used for such a purpose.

Code:

search jnsn

will bring it up.

Its use is a two-step procedure, first, fitting a so-called Johnson distribution to your existing data using one of two commands in the suite, either jnsw or jnsn. From the parameters yielded in the first step, you can use the third command in the suite, ajv to generate artificial data that follow the Johnson distribution that mimic the original data.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 547
#3

29 Mar 2023, 08:36

I am not exactly sure what you mean by "generate(simulate) a similar data". If you have summary statistics and want to generate datasets having the same summary statistics, it may be that

Code:

h corr2data

(have a look at the "complete PDF manual entry" therein) will do what you want.
Comment

Olivia Emma

Join Date: Dec 2022
Posts: 102

29 Mar 2023, 09:48

Thanks

I have a data like this

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str41 country double(latitude longitude) float(head urban female age yearofbirth marital education)
"ARE"  -11.7108 43.24825 1 1 0 18 1985 1 4
"ARE" -11.85792 43.39094 0 0 1 18 1985 1 3
"ARE" -11.86671  43.3963 1 0 0 28 1975 1 4
"ARE" -11.86678 43.49342 0 1 1 30 1973 1 4
"ARE" -11.60353 43.36902 1 0 0 18 1985 1 3
"ARE" -11.67971 43.27585 0 0 0 18 1985 1 3
"ARE" -11.69256 43.25417 0 1 0 45 1958 4 1
"ARE" -11.70818  43.2498 1 1 0 18 1985 1 4
"ARE" -11.69469 43.41869 0 0 0 21 1982 1 2
"ARE" -11.87008  43.4938 1 1 1 40 1963 2 1
"ARE" -11.71414 43.42326 0 0 0 60 1943 2 1
"ARE" -11.85266 43.34328 1 0 0 40 1963 2 1
"ARE" -11.60293 43.37102 0 0 1 20 1983 1 2
"ARE" -11.88072 43.43332 0 0 1 36 1967 2 1
"ARE" -11.75045 43.25045 0 0 1 60 1943 2 1
"ARE" -11.69383 43.25383 1 1 0 30 1973 1 7
"ARE" -11.88087 43.43171 0 0 0 18 1985 1 2
"ARE" -11.73903 43.23956 1 1 0 27 1976 1 2
"ARE" -11.91004 43.45481 0 0 0 35 1968 2 1
"ARE" -11.65126 43.28531 0 0 1 27 1976 2 1
"ARE" -11.85213 43.42586 0 0 1 18 1985 2 1
"ARE" -11.69311 43.25381 0 1 1 26 1977 4 1
"ARE" -11.74454 43.24083 0 1 1 18 1985 1 3
"ARE" -11.73498 43.26755 1 0 1 60 1943 5 1
"ARE" -11.91373 43.49727 0 0 1 19 1984 1 2
"ARE" -11.50367 43.38698 1 1 0 39 1964 2 3
"ARE" -11.84924 43.31779 0 0 1 29 1974 2 3
"ARE" -11.85166 43.42765 1 0 1 42 1961 5 1
"ARE" -11.73754 43.25216 0 0 1 59 1944 2 1
"ARE" -11.85891 43.39177 0 0 0 49 1954 2 1
"ARE" -11.69274  43.2537 0 1 1 55 1948 4 1
"ARE" -11.69157  43.2548 1 1 0 38 1965 2 6
"ARE" -11.64753 43.39556 1 0 1 50 1953 4 1
"ARE" -11.88863 43.40805 0 0 0 47 1956 2 2
"ARE" -11.88742 43.40673 1 0 0 62 1941 2 1
"ARE" -11.85351 43.42695 0 0 1 58 1945 5 1
"ARE" -11.71981 43.26424 1 0 0 21 1982 1 4
"ARE" -11.65853  43.2899 0 1 0 23 1980 1 2
"ARE" -11.69486 43.41714 1 0 1 60 1943 4 1
"ARE" -11.71445 43.42279 0 0 0 20 1983 1 2
"ARE" -11.74405 43.24277 0 1 1 35 1968 4 1
"ARE" -11.56513 43.27083 1 1 1 60 1943 4 1
"ARE" -11.66373 43.27118 0 1 1 28 1975 2 2
"ARE"  -11.8516 43.34042 1 0 1 46 1957 2 1
"ARE" -11.70247 43.25217 1 1 0 25 1978 2 3
"ARE" -11.85817 43.44834 1 0 1 28 1975 3 1
"ARE"  -11.6492 43.28407 0 0 1 35 1968 2 2
"ARE" -11.86664 43.49059 0 1 1 19 1984 1 3
"ARE" -11.64689  43.3949 0 0 1 22 1981 2 2
"ARE"  -11.5462 43.38869 0 0 0 30 1973 2 1
"ARE" -11.84815  43.3186 0 0 1 65 1938 2 1
"ARE" -11.73941 43.24987 0 0 1 42 1961 2 1
"ARE" -11.64958  43.2842 1 0 1 50 1953 2 1
"ARE" -11.68033 43.27621 0 0 0 30 1973 1 3
"ARE" -11.71141 43.24925 1 1 0 44 1959 4 1
"ARE" -11.50382 43.38697 1 1 0 40 1963 4 2
"ARE" -11.85045 43.31834 0 0 1 55 1948 2 1
"ARE" -11.64998 43.28463 1 0 1 60 1943 4 1
"ARE" -11.85431 43.34028 0 0 0 23 1980 1 4
"ARE" -11.69753 43.25393 1 1 0 43 1960 2 7
"ARE"  -11.6037 43.36958 0 0 0 45 1958 2 1
"ARE" -11.65626 43.28934 0 1 1 45 1958 2 2
"ARE" -11.60386  43.3701 1 0 1 28 1975 2 1
"ARE" -11.87734 43.40853 0 0 1 33 1970 2 3
"ARE" -11.64988 43.28502 1 0 0 20 1983 1 4
"ARE" -11.56465 43.27067 1 1 0 37 1966 1 2
"ARE" -11.50311 43.38818 0 1 1 20 1983 1 4
"ARE"  -11.7519 43.25198 0 0 0 40 1963 2 1
"ARE" -11.69407 43.41896 1 0 1 22 1981 2 1
"ARE" -11.85205 43.34133 0 0 1 34 1969 2 2
"ARE"  -11.6022 43.37045 1 0 1 30 1973 2 1
"ARE" -11.64773 43.39732 1 0 0 25 1978 2 3
"ARE"  -11.8367 43.31347 0 0 1 63 1940 5 1
"ARE"  -11.5658 43.27178 1 1 1 33 1970 1 5
"ARE" -11.70702 43.25002 0 1 1 49 1954 2 1
"ARE" -11.71207 43.25108 0 1 1 34 1969 2 2
"ARE" -11.80552 43.27995 0 1 0 23 1980 1 2
"ARE"  -11.7499 43.25055 1 0 1 28 1975 2 1
"ARE" -11.73746 43.25236 1 0 1 19 1984 1 3
"ARE" -11.71018 43.25703 0 1 0 21 1982 1 5
"ARE" -11.60445 43.37008 1 0 1 19 1984 2 1
"ARE" -11.73924   43.252 0 0 0 40 1963 2 1
"ARE" -11.69697 43.25345 0 1 0 37 1966 2 2
"ARE" -11.85242  43.4259 0 0 1 33 1970 2 1
"ARE" -11.68076  43.2756 1 0 1 58 1945 5 1
"ARE" -11.67991 43.27591 0 0 1 54 1949 2 1
"ARE" -11.91072 43.45356 1 0 1 63 1940 5 1
"ARE" -11.65724 43.28977 0 1 1 23 1980 2 1
"ARE"  -11.6746  43.2644 1 1 0 40 1963 2 7
"ARE" -11.86532 43.39444 0 0 1 35 1968 4 1
"ARE" -11.65781 43.28765 0 1 0 26 1977 1 1
"ARE" -11.70823 43.24981 1 1 1 30 1973 2 1
"ARE" -11.56432 43.27123 1 1 0 64 1939 2 1
"ARE" -11.85862 43.39246 0 0 0 48 1955 2 1
"ARE" -11.50311 43.38792 0 1 0 42 1961 2 4
"ARE" -11.91133 43.45293 1 0 1 37 1966 2 1
"ARE" -11.71468 43.42195 1 0 0 51 1952 3 1
"ARE" -11.87768 43.40717 1 0 1 28 1975 2 3
"ARE" -11.56494 43.27161 1 1 0 21 1982 1 5
"ARE" -11.71931 43.26345 0 0 1 57 1946 2 1
end

and I want to generate a similiar simulated data

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3464
#5

29 Mar 2023, 13:21

Similar in what way?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Olivia Emma

Join Date: Dec 2022

Posts: 102
#6

29 Mar 2023, 13:43

similiar in a way that if a variable is taking values 0 or 1 the new variable should take vlaue 0 and 1 and should have the mean , sd and other proprieties similar to the old data
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 547

30 Mar 2023, 07:19

Maarten Buis in #5 is completely right asking "Similar in what way?" As far as I can see it is difficult to "simulate" the data such that variables will be binary if your variables are binary (0 or 1) and counts if they are counts (e.g. age and yearofbirth) and at the same time means, variances, and correlations are the same (or similar).

Without the restrictions that binary data should stay binary and counts should stay being counts an easy way using -corr2data- would be (note that country is constant and string, and the correlation of age and yearofbirth is -1):

Code:

. mata: dat = st_data(., .)
. mata: cov = variance(dat[.,2..cols(dat)])
. mata: st_matrix("cov",cov)
. mata: m = mean(dat[.,2..cols(dat)])
. mata: st_matrix("m",m)
.
. qui count
. local n = r(N)
.
. sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     country |          0
    latitude |        100   -11.72968    .1097476  -11.91373  -11.50311
   longitude |        100    43.33374    .0768347   43.23956   43.49727
        head |        100         .44    .4988877          0          1
       urban |        100         .37    .4852366          0          1
-------------+---------------------------------------------------------
      female |        100         .56    .4988877          0          1
         age |        100       36.35     14.2541         18         65
 yearofbirth |        100     1966.65     14.2541       1938       1985
     marital |        100        2.13    1.134002          1          5
   education |        100        2.06    1.489492          1          7

. reg longitude-female yearofbirth-education

      Source |       SS           df       MS      Number of obs   =       100
-------------+----------------------------------   F(6, 93)        =      3.63
       Model |  .111011126         6  .018501854   Prob > F        =    0.0028
    Residual |   .47344186        93  .005090773   R-squared       =    0.1899
-------------+----------------------------------   Adj R-squared   =    0.1377
       Total |  .584452987        99  .005903566   Root MSE        =    .07135

------------------------------------------------------------------------------
   longitude | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        head |  -.0005422   .0153166    -0.04   0.972    -.0309578    .0298735
       urban |  -.0622105   .0161803    -3.84   0.000    -.0943414   -.0300796
      female |   .0023185   .0160624     0.14   0.886    -.0295783    .0342153
 yearofbirth |    .001136   .0006609     1.72   0.089    -.0001763    .0024484
     marital |    .006061   .0090344     0.67   0.504    -.0118795    .0240014
   education |  -.0035628   .0063164    -0.56   0.574     -.016106    .0089803
       _cons |   41.11595    1.30664    31.47   0.000     38.52122    43.71068
------------------------------------------------------------------------------
.
. clear
. corr2data latitude longitude head urban female age yearofbirth marital ///
>           education, n(`n') cov(cov) means(m) double
(obs 100)
. gen str country = "ARE"
. order country, first
.
. sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     country |          0
    latitude |        100   -11.72968    .1097476  -12.01081  -11.45636
   longitude |        100    43.33374    .0768347   43.13199   43.50001
        head |        100         .44    .4988877  -1.031292   1.678765
       urban |        100         .37    .4852366  -.9998499   1.767055
-------------+---------------------------------------------------------
      female |        100         .56    .4988877  -.8446221    2.21753
         age |        100       36.35     14.2541   2.581823   70.94985
 yearofbirth |        100     1966.65     14.2541    1932.05   2000.418
     marital |        100        2.13    1.134002   -.438501   5.031321
   education |        100        2.06    1.489492  -.8918127   5.436786

. reg longitude-female yearofbirth-education

      Source |       SS           df       MS      Number of obs   =       100
-------------+----------------------------------   F(6, 93)        =      3.63
       Model |  .111011126         6  .018501854   Prob > F        =    0.0028
    Residual |  .473441859        93  .005090773   R-squared       =    0.1899
-------------+----------------------------------   Adj R-squared   =    0.1377
       Total |  .584452986        99  .005903566   Root MSE        =    .07135

------------------------------------------------------------------------------
   longitude | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        head |  -.0005422   .0153166    -0.04   0.972    -.0309578    .0298735
       urban |  -.0622105   .0161803    -3.84   0.000    -.0943414   -.0300796
      female |   .0023185   .0160624     0.14   0.886    -.0295783    .0342153
 yearofbirth |    .001136   .0006609     1.72   0.089    -.0001763    .0024484
     marital |    .006061   .0090344     0.67   0.504    -.0118795    .0240014
   education |  -.0035628   .0063164    -0.56   0.574     -.016106    .0089803
       _cons |   41.11595    1.30664    31.47   0.000     38.52122    43.71068
------------------------------------------------------------------------------

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3464
#8

31 Mar 2023, 02:05

Originally posted by Olivia Emma View Post

similiar in a way that if a variable is taking values 0 or 1 the new variable should take vlaue 0 and 1 and should have the mean , sd and other proprieties similar to the old data

The problem is that "and other properties" is not very precise, and details matter. If you want the simulated data to be the same as your observed data in every respect imaginable, then there is only one "simulated" dataset possible: the original data. So you have to give up something. This is what drives the choice of method. Dirk's solution is to preserve means variances and covariances but give up on the exact distribution of all variables. That can be fine for applications where you are primarily interested in the variances and covariances (like linear regression), but that is not all possible applications.

In short, before we can tell you how to do something, we first need to know what it is that it is you want to do. A good place to start would be if you told us why you want that simulated dataset. If we know what you want to do with that simulated dataset, we can work out what properties are vital and what properties we can relax. From there we can think about how to do it in Stata.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment