Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulated data

    Hello

    is there a code to get or generate(simulate) a similar data based on the existing data.

    thanks

  • #2
    I think that there are methods alluded to previously on the List that use quantiles of the original data to generate artificial data that follow these quantiles. In addition, there is a user-written suite of commands on SSC that can also be used for such a purpose.
    Code:
    search jnsn
    will bring it up.

    Its use is a two-step procedure, first, fitting a so-called Johnson distribution to your existing data using one of two commands in the suite, either jnsw or jnsn. From the parameters yielded in the first step, you can use the third command in the suite, ajv to generate artificial data that follow the Johnson distribution that mimic the original data.

    Comment


    • #3
      I am not exactly sure what you mean by "generate(simulate) a similar data". If you have summary statistics and want to generate datasets having the same summary statistics, it may be that
      Code:
      h corr2data
      (have a look at the "complete PDF manual entry" therein) will do what you want.

      Comment


      • #4
        Thanks

        I have a data like this

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str41 country double(latitude longitude) float(head urban female age yearofbirth marital education)
        "ARE"  -11.7108 43.24825 1 1 0 18 1985 1 4
        "ARE" -11.85792 43.39094 0 0 1 18 1985 1 3
        "ARE" -11.86671  43.3963 1 0 0 28 1975 1 4
        "ARE" -11.86678 43.49342 0 1 1 30 1973 1 4
        "ARE" -11.60353 43.36902 1 0 0 18 1985 1 3
        "ARE" -11.67971 43.27585 0 0 0 18 1985 1 3
        "ARE" -11.69256 43.25417 0 1 0 45 1958 4 1
        "ARE" -11.70818  43.2498 1 1 0 18 1985 1 4
        "ARE" -11.69469 43.41869 0 0 0 21 1982 1 2
        "ARE" -11.87008  43.4938 1 1 1 40 1963 2 1
        "ARE" -11.71414 43.42326 0 0 0 60 1943 2 1
        "ARE" -11.85266 43.34328 1 0 0 40 1963 2 1
        "ARE" -11.60293 43.37102 0 0 1 20 1983 1 2
        "ARE" -11.88072 43.43332 0 0 1 36 1967 2 1
        "ARE" -11.75045 43.25045 0 0 1 60 1943 2 1
        "ARE" -11.69383 43.25383 1 1 0 30 1973 1 7
        "ARE" -11.88087 43.43171 0 0 0 18 1985 1 2
        "ARE" -11.73903 43.23956 1 1 0 27 1976 1 2
        "ARE" -11.91004 43.45481 0 0 0 35 1968 2 1
        "ARE" -11.65126 43.28531 0 0 1 27 1976 2 1
        "ARE" -11.85213 43.42586 0 0 1 18 1985 2 1
        "ARE" -11.69311 43.25381 0 1 1 26 1977 4 1
        "ARE" -11.74454 43.24083 0 1 1 18 1985 1 3
        "ARE" -11.73498 43.26755 1 0 1 60 1943 5 1
        "ARE" -11.91373 43.49727 0 0 1 19 1984 1 2
        "ARE" -11.50367 43.38698 1 1 0 39 1964 2 3
        "ARE" -11.84924 43.31779 0 0 1 29 1974 2 3
        "ARE" -11.85166 43.42765 1 0 1 42 1961 5 1
        "ARE" -11.73754 43.25216 0 0 1 59 1944 2 1
        "ARE" -11.85891 43.39177 0 0 0 49 1954 2 1
        "ARE" -11.69274  43.2537 0 1 1 55 1948 4 1
        "ARE" -11.69157  43.2548 1 1 0 38 1965 2 6
        "ARE" -11.64753 43.39556 1 0 1 50 1953 4 1
        "ARE" -11.88863 43.40805 0 0 0 47 1956 2 2
        "ARE" -11.88742 43.40673 1 0 0 62 1941 2 1
        "ARE" -11.85351 43.42695 0 0 1 58 1945 5 1
        "ARE" -11.71981 43.26424 1 0 0 21 1982 1 4
        "ARE" -11.65853  43.2899 0 1 0 23 1980 1 2
        "ARE" -11.69486 43.41714 1 0 1 60 1943 4 1
        "ARE" -11.71445 43.42279 0 0 0 20 1983 1 2
        "ARE" -11.74405 43.24277 0 1 1 35 1968 4 1
        "ARE" -11.56513 43.27083 1 1 1 60 1943 4 1
        "ARE" -11.66373 43.27118 0 1 1 28 1975 2 2
        "ARE"  -11.8516 43.34042 1 0 1 46 1957 2 1
        "ARE" -11.70247 43.25217 1 1 0 25 1978 2 3
        "ARE" -11.85817 43.44834 1 0 1 28 1975 3 1
        "ARE"  -11.6492 43.28407 0 0 1 35 1968 2 2
        "ARE" -11.86664 43.49059 0 1 1 19 1984 1 3
        "ARE" -11.64689  43.3949 0 0 1 22 1981 2 2
        "ARE"  -11.5462 43.38869 0 0 0 30 1973 2 1
        "ARE" -11.84815  43.3186 0 0 1 65 1938 2 1
        "ARE" -11.73941 43.24987 0 0 1 42 1961 2 1
        "ARE" -11.64958  43.2842 1 0 1 50 1953 2 1
        "ARE" -11.68033 43.27621 0 0 0 30 1973 1 3
        "ARE" -11.71141 43.24925 1 1 0 44 1959 4 1
        "ARE" -11.50382 43.38697 1 1 0 40 1963 4 2
        "ARE" -11.85045 43.31834 0 0 1 55 1948 2 1
        "ARE" -11.64998 43.28463 1 0 1 60 1943 4 1
        "ARE" -11.85431 43.34028 0 0 0 23 1980 1 4
        "ARE" -11.69753 43.25393 1 1 0 43 1960 2 7
        "ARE"  -11.6037 43.36958 0 0 0 45 1958 2 1
        "ARE" -11.65626 43.28934 0 1 1 45 1958 2 2
        "ARE" -11.60386  43.3701 1 0 1 28 1975 2 1
        "ARE" -11.87734 43.40853 0 0 1 33 1970 2 3
        "ARE" -11.64988 43.28502 1 0 0 20 1983 1 4
        "ARE" -11.56465 43.27067 1 1 0 37 1966 1 2
        "ARE" -11.50311 43.38818 0 1 1 20 1983 1 4
        "ARE"  -11.7519 43.25198 0 0 0 40 1963 2 1
        "ARE" -11.69407 43.41896 1 0 1 22 1981 2 1
        "ARE" -11.85205 43.34133 0 0 1 34 1969 2 2
        "ARE"  -11.6022 43.37045 1 0 1 30 1973 2 1
        "ARE" -11.64773 43.39732 1 0 0 25 1978 2 3
        "ARE"  -11.8367 43.31347 0 0 1 63 1940 5 1
        "ARE"  -11.5658 43.27178 1 1 1 33 1970 1 5
        "ARE" -11.70702 43.25002 0 1 1 49 1954 2 1
        "ARE" -11.71207 43.25108 0 1 1 34 1969 2 2
        "ARE" -11.80552 43.27995 0 1 0 23 1980 1 2
        "ARE"  -11.7499 43.25055 1 0 1 28 1975 2 1
        "ARE" -11.73746 43.25236 1 0 1 19 1984 1 3
        "ARE" -11.71018 43.25703 0 1 0 21 1982 1 5
        "ARE" -11.60445 43.37008 1 0 1 19 1984 2 1
        "ARE" -11.73924   43.252 0 0 0 40 1963 2 1
        "ARE" -11.69697 43.25345 0 1 0 37 1966 2 2
        "ARE" -11.85242  43.4259 0 0 1 33 1970 2 1
        "ARE" -11.68076  43.2756 1 0 1 58 1945 5 1
        "ARE" -11.67991 43.27591 0 0 1 54 1949 2 1
        "ARE" -11.91072 43.45356 1 0 1 63 1940 5 1
        "ARE" -11.65724 43.28977 0 1 1 23 1980 2 1
        "ARE"  -11.6746  43.2644 1 1 0 40 1963 2 7
        "ARE" -11.86532 43.39444 0 0 1 35 1968 4 1
        "ARE" -11.65781 43.28765 0 1 0 26 1977 1 1
        "ARE" -11.70823 43.24981 1 1 1 30 1973 2 1
        "ARE" -11.56432 43.27123 1 1 0 64 1939 2 1
        "ARE" -11.85862 43.39246 0 0 0 48 1955 2 1
        "ARE" -11.50311 43.38792 0 1 0 42 1961 2 4
        "ARE" -11.91133 43.45293 1 0 1 37 1966 2 1
        "ARE" -11.71468 43.42195 1 0 0 51 1952 3 1
        "ARE" -11.87768 43.40717 1 0 1 28 1975 2 3
        "ARE" -11.56494 43.27161 1 1 0 21 1982 1 5
        "ARE" -11.71931 43.26345 0 0 1 57 1946 2 1
        end


        and I want to generate a similiar simulated data

        Comment


        • #5
          Similar in what way?
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            similiar in a way that if a variable is taking values 0 or 1 the new variable should take vlaue 0 and 1 and should have the mean , sd and other proprieties similar to the old data

            Comment


            • #7
              Maarten Buis in #5 is completely right asking "Similar in what way?" As far as I can see it is difficult to "simulate" the data such that variables will be binary if your variables are binary (0 or 1) and counts if they are counts (e.g. age and yearofbirth) and at the same time means, variances, and correlations are the same (or similar).

              Without the restrictions that binary data should stay binary and counts should stay being counts an easy way using -corr2data- would be (note that country is constant and string, and the correlation of age and yearofbirth is -1):
              Code:
              . mata: dat = st_data(., .)
              . mata: cov = variance(dat[.,2..cols(dat)])
              . mata: st_matrix("cov",cov)
              . mata: m = mean(dat[.,2..cols(dat)])
              . mata: st_matrix("m",m)
              .
              . qui count
              . local n = r(N)
              .
              . sum
              
                  Variable |        Obs        Mean    Std. dev.       Min        Max
              -------------+---------------------------------------------------------
                   country |          0
                  latitude |        100   -11.72968    .1097476  -11.91373  -11.50311
                 longitude |        100    43.33374    .0768347   43.23956   43.49727
                      head |        100         .44    .4988877          0          1
                     urban |        100         .37    .4852366          0          1
              -------------+---------------------------------------------------------
                    female |        100         .56    .4988877          0          1
                       age |        100       36.35     14.2541         18         65
               yearofbirth |        100     1966.65     14.2541       1938       1985
                   marital |        100        2.13    1.134002          1          5
                 education |        100        2.06    1.489492          1          7
              
              . reg longitude-female yearofbirth-education
              
                    Source |       SS           df       MS      Number of obs   =       100
              -------------+----------------------------------   F(6, 93)        =      3.63
                     Model |  .111011126         6  .018501854   Prob > F        =    0.0028
                  Residual |   .47344186        93  .005090773   R-squared       =    0.1899
              -------------+----------------------------------   Adj R-squared   =    0.1377
                     Total |  .584452987        99  .005903566   Root MSE        =    .07135
              
              ------------------------------------------------------------------------------
                 longitude | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                      head |  -.0005422   .0153166    -0.04   0.972    -.0309578    .0298735
                     urban |  -.0622105   .0161803    -3.84   0.000    -.0943414   -.0300796
                    female |   .0023185   .0160624     0.14   0.886    -.0295783    .0342153
               yearofbirth |    .001136   .0006609     1.72   0.089    -.0001763    .0024484
                   marital |    .006061   .0090344     0.67   0.504    -.0118795    .0240014
                 education |  -.0035628   .0063164    -0.56   0.574     -.016106    .0089803
                     _cons |   41.11595    1.30664    31.47   0.000     38.52122    43.71068
              ------------------------------------------------------------------------------
              .
              . clear
              . corr2data latitude longitude head urban female age yearofbirth marital ///
              >           education, n(`n') cov(cov) means(m) double
              (obs 100)
              . gen str country = "ARE"
              . order country, first
              .
              . sum
              
                  Variable |        Obs        Mean    Std. dev.       Min        Max
              -------------+---------------------------------------------------------
                   country |          0
                  latitude |        100   -11.72968    .1097476  -12.01081  -11.45636
                 longitude |        100    43.33374    .0768347   43.13199   43.50001
                      head |        100         .44    .4988877  -1.031292   1.678765
                     urban |        100         .37    .4852366  -.9998499   1.767055
              -------------+---------------------------------------------------------
                    female |        100         .56    .4988877  -.8446221    2.21753
                       age |        100       36.35     14.2541   2.581823   70.94985
               yearofbirth |        100     1966.65     14.2541    1932.05   2000.418
                   marital |        100        2.13    1.134002   -.438501   5.031321
                 education |        100        2.06    1.489492  -.8918127   5.436786
              
              . reg longitude-female yearofbirth-education
              
                    Source |       SS           df       MS      Number of obs   =       100
              -------------+----------------------------------   F(6, 93)        =      3.63
                     Model |  .111011126         6  .018501854   Prob > F        =    0.0028
                  Residual |  .473441859        93  .005090773   R-squared       =    0.1899
              -------------+----------------------------------   Adj R-squared   =    0.1377
                     Total |  .584452986        99  .005903566   Root MSE        =    .07135
              
              ------------------------------------------------------------------------------
                 longitude | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                      head |  -.0005422   .0153166    -0.04   0.972    -.0309578    .0298735
                     urban |  -.0622105   .0161803    -3.84   0.000    -.0943414   -.0300796
                    female |   .0023185   .0160624     0.14   0.886    -.0295783    .0342153
               yearofbirth |    .001136   .0006609     1.72   0.089    -.0001763    .0024484
                   marital |    .006061   .0090344     0.67   0.504    -.0118795    .0240014
                 education |  -.0035628   .0063164    -0.56   0.574     -.016106    .0089803
                     _cons |   41.11595    1.30664    31.47   0.000     38.52122    43.71068
              ------------------------------------------------------------------------------

              Comment


              • #8
                Originally posted by Olivia Emma View Post
                similiar in a way that if a variable is taking values 0 or 1 the new variable should take vlaue 0 and 1 and should have the mean , sd and other proprieties similar to the old data
                The problem is that "and other properties" is not very precise, and details matter. If you want the simulated data to be the same as your observed data in every respect imaginable, then there is only one "simulated" dataset possible: the original data. So you have to give up something. This is what drives the choice of method. Dirk's solution is to preserve means variances and covariances but give up on the exact distribution of all variables. That can be fine for applications where you are primarily interested in the variances and covariances (like linear regression), but that is not all possible applications.

                In short, before we can tell you how to do something, we first need to know what it is that it is you want to do. A good place to start would be if you told us why you want that simulated dataset. If we know what you want to do with that simulated dataset, we can work out what properties are vital and what properties we can relax. From there we can think about how to do it in Stata.
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment

                Working...
                X