Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating data for logistic regression with categorical variables

    Dear Statalists,

    I would like to test the effect of sample size on standard errors of interaction effects. The model results from the survey data shows a pattern of interaction effects but the interaction effects do not reach statistical significance. I am interested to find out how large the sample size needs to be in order to be statistically significant.

    The idea is to generate a new data set with the same distribution, correlation matrix, regression coefficients as the real data, but a larger sample size where the interaction effects of interest reach statistical significance.

    I may need to consider complex survey design as well.

    I have considered the following options.

    1.The command
    Code:
    corr2data
    would have been ideal if it worked well with logistic regression and categorical variables.

    2. Following Buis' s discussion(i.e., M.L. Buis (2007) "Stata tip 48: Discrete uses for uniform()), I was able to simulate a data set for logistic regression with specified distributions, but failed to replicate regression coefficients. The regression coefficients in the simulated data set only approximate those specified. I cannot reproduce the correlation matrix either.

    The approach is something similar to this post. https://www.stata.com/statalist/arch.../msg00018.html

    3. A very vague idea is to use probit regression. I may simulate a data using corr2data and transform the outcome variables using the probit link functions. It is a long shot and I have not been able to figure out how to do it yet.


    Any suggestion is appreciated.

    The logistic regression results I wish to simulate:
    Code:
    . svyset psu  [pw=xw], strata(strata) singleunit(scaled)
    
          pweight: xw
              VCE: linearized
      Single unit: scaled
         Strata 1: strata
             SU 1: psu
            FPC 1: <zero>
    
    . svy: logit y i.x1##i.x2 i.x3 c.x4
    (running logit on estimation sample)
    
    Survey: Logistic regression
    
    Number of strata   =     1,630                  Number of obs     =      7,355
    Number of PSUs     =     3,232                  Population size   = 7,976.7239
                                                    Design df         =      1,602
                                                    F(  13,   1590)   =      12.36
                                                    Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
                 |             Linearized
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            1.x1 |   .3586926   .1260393     2.85   0.004     .1114734    .6059119
                 |
              x2 |
              3  |   .0491903   .1609868     0.31   0.760    -.2665767    .3649573
              4  |   .1623764   .1383986     1.17   0.241     -.109085    .4338377
              7  |   .0937721   .1003796     0.93   0.350    -.1031171    .2906613
                 |
           x1#x2 |
            1 3  |   .0348744   .2815246     0.12   0.901    -.5173208    .5870697
            1 4  |   .3241205   .2281414     1.42   0.156    -.1233666    .7716076
            1 7  |  -.0562443   .2055027    -0.27   0.784    -.4593267    .3468381
                 |
              x3 |
              1  |    .109137   .1153704     0.95   0.344    -.1171559    .3354298
              2  |   .4191621   .1126151     3.72   0.000     .1982738    .6400505
              3  |   .4574391   .1259686     3.63   0.000     .2103585    .7045197
              4  |   .8478119   .1286161     6.59   0.000     .5955382    1.100085
              5  |   1.051244   .1458429     7.21   0.000     .7651811    1.337307
                 |
              x4 |   .1644617   .0746806     2.20   0.028     .0179798    .3109435
           _cons |    -1.4128   .6050882    -2.33   0.020    -2.599648   -.2259526
    ------------------------------------------------------------------------------
    Note: Variance scaled to handle strata with a single sampling unit.
    I use Stata 15, Windows 64bit.

    Many thanks.
    Min

  • #2
    If you can't find anything fancy, the code below suggests an approach that may help you approximate what you want.
    Code:
    sysuse auto, clear
    generate highpr = price>7000
    generate int wgt = 1
    forvalues i=1/10 {
        quietly replace wgt = `i'
        quietly logit highpr c.mpg##i.foreign [fw=wgt]
        matrix t = r(table)
        local p = t[rownumb(t,"pvalue"),colnumb(t,"highpr:1.foreign#c.mpg")]
        display %4.0f `i' %9.3f `p'
        }
    Code:
       1    0.444
       2    0.279
       3    0.185
       4    0.126
       5    0.087
       6    0.061
       7    0.043
       8    0.030
       9    0.022
      10    0.015
    Last edited by William Lisowski; 16 Dec 2018, 13:01. Reason: Improved coding by using fw weights

    Comment


    • #3
      Originally posted by William Lisowski View Post
      If you can't find anything fancy, the code below suggests an approach that may help you approximate what you want.
      Code:
      sysuse auto, clear
      generate highpr = price>7000
      generate int wgt = 1
      forvalues i=1/10 {
      quietly replace wgt = `i'
      quietly logit highpr c.mpg##i.foreign [fw=wgt]
      matrix t = r(table)
      local p = t[rownumb(t,"pvalue"),colnumb(t,"highpr:1.foreign#c.mpg")]
      display %4.0f `i' %9.3f `p'
      }
      Code:
      1 0.444
      2 0.279
      3 0.185
      4 0.126
      5 0.087
      6 0.061
      7 0.043
      8 0.030
      9 0.022
      10 0.015
      Hi William,

      Many thanks for your reply.

      Sorry I did not make myself clear. I have no problem with extracting the estimates from logit regression. The problem is that I do not know how to simulate a data set with specified data distribution and logistic regression coefficients.

      For example, the sample size is 7355 in my survey data and the interaction effect is insignificant. I would like to know whether the interaction would be significant if the sample was five times as large or ten times as large.

      Thank you.

      Regards,
      Min

      Comment


      • #4
        An alternative for this ‘problem’ could be a Bayesian approach.
        Best regards,

        Marcos

        Comment


        • #5
          I am not on very clear what is the end goal here.

          For example, would you not achieve your end goal if you use -expand- to say make your sample 2, 3, 4, etc. times bigger than now, and then simply re-estimate your model to see what happens with the standard errors?

          As to your question how you generate data from a model such as probit or logit, you have at least two complications that I am not sure how to deal with. (That you have survey data, and that you have correlation of regressors that you want to replicate, but your regressors are discrete, so it is not clear whether -corrtodata- would work here.)

          Absent of these complications, it is easy to generate data from a probit or logit using the threshold/index form of the model. Say probit (logit is the same, but with the logistically distributed error):

          Code:
          . clear
          
          . set obs 300
          number of observations (_N) was 0, now 300
          
          . gen x = runiform()
          
          . gen ystar = 1 + .5*x + rnormal()
          
          . gen y = ystar>0
          
          . probit y x, nolog noheader
          ------------------------------------------------------------------------------
                     y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                     x |   .6335745   .3551178     1.78   0.074    -.0624436    1.329593
                 _cons |   .9540644   .1844389     5.17   0.000     .5925708    1.315558
          ------------------------------------------------------------------------------
          
          . clear
          
          . set obs 300000
          number of observations (_N) was 0, now 300,000
          
          . gen x = runiform()
          
          . gen ystar = 1 + .5*x + rnormal()
          
          . gen y = ystar>0
          
          . probit y x, nolog noheader
          ------------------------------------------------------------------------------
                     y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                     x |   .5043405   .0107321    46.99   0.000      .483306    .5253749
                 _cons |   .9968994   .0058136   171.48   0.000     .9855049    1.008294
          ------------------------------------------------------------------------------
          Note that the estimated parameters are the same (up to sampling error) as the population parameters that I used.

          Comment


          • #6
            Apparently my code in post #2 was more subtle than I realized, since both post #3 and post #5 do not fully understand it. From post #5

            For example, would you not achieve your end goal if you use -expand- to say make your sample 2, 3, 4, etc. times bigger than now, and then simply re-estimate your model to see what happens with the standard errors?
            That was exactly what my original code did, until I saw a more efficient approach. The key lines in my code are
            Code:
            forvalues i=1/10 {
                quietly replace wgt = `i'
                quietly logit highpr c.mpg##i.foreign [fw=wgt]
            for which the included frequency weights have the effect of fitting the model including each observation multiple times. The first time through the loop, 1 time; then 2 times; then 3 times; etc., exactly as suggested in post #5. For more detail, see the output of help weights and also, remove the quietly from before the logit command and compare the results of the successive regressions.

            So from the output in post #2, we see that the p-value on the interaction term coefficient estimate decreases from 0.444 to 0.279 when the sample is doubled, and to 0.015 when the sample is increased 10-fold.

            Note that this expansion of the original data, either by frequency weights or using expand, produces the desired

            new data set with the same distribution, correlation matrix, regression coefficients as the real data, but a larger sample size

            Comment


            • #7
              Many thanks for your replies. I have given it more thought and tried a somewhat different approach.
              I am posting my codes here and hope to help others who face similar problems. Please correct me if you spot any mistakes in my codes or statistics.

              1. Instead of expecting to reproduce exact coefficients of logistic regression, I will use repeated simulations to approximate specified ones. I think this is similar to what
              Joro Kolev suggested.

              2. I gave up to reproduce correlation matrix if this is not possible. My following codes do not include the efforts to replicate correlations.

              3. I will give more thought to the frequency weights approach that William Lisowski suggests.

              To recap, the goal is to see whether the insignificance of interaction terms is just a matter of sample size.


              Code:
              *****************************
              *** Step 1 save data structure and logistic regression coefficients
              
              . use https://stats.idre.ucla.edu/stat/stata/webbooks/logistic/apilog, clear
              
              . * Set up sample
              . qui logit hiqual i.cred##i.pared i.awards c.api99 , nolog // the model to be reproduced
              
              . gen sample=e(sample)
              
              . keep if sample==1
              (0 observations deleted)
              
              . 
              . * Create sampling weights
              . set seed 21122018
              
              . gen pw = 1+  uniform()
              
              . 
              . * api99
              . sum api99 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                     api99 |   1,200  1796.54305    635.7269   141.4479        315        957
              
              . global api_mean = r(mean)
              
              . di $api_mean 
              635.72688
              
              . global api_sd   = r(sd) // not ideal to use global but local does not work here
              
              . di $api_sd 
              141.44794
              
              . 
              . * award
              . qui tab awards  , gen(award_)
              
              . sum award_1 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                   award_1 |   1,200  1796.54305    .2549022   .4359879          0          1
              
              . global award_col = r(mean)
              
              . 
              . * cred
              . qui tab cred , gen(cred)
              
              . sum cred1 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                     cred1 |   1,200  1796.54305      .31792   .4658624          0          1
              
              . global cred1_col = r(mean)
              
              . di $cred1_col 
              .31792001
              
              . sum cred2 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                     cred2 |   1,200  1796.54305    .2725915   .4454781          0          1
              
              . global cred2_col = r(mean)
              
              . di $cred2_col 
              .27259152
              
              . global cred_cum1 = $cred1_col 
              
              . global cred_cum2 = $cred1_col  + $cred2_col 
              
              . di $cred_cum1 
              .31792001
              
              . di $cred_cum2 
              .59051153
              
              . 
              . * pared
              . qui tab pared, gen(pared)
              
              . sum pared1 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                    pared1 |   1,200  1796.54305    .3344101   .4719805          0          1
              
              . global pared1_col = r(mean)
              
              . di $pared1_col 
              .33441012
              
              . sum pared2 [aw=pw]
              
                  Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
              -------------+-----------------------------------------------------------------
                    pared2 |   1,200  1796.54305    .3459202   .4758657          0          1
              
              . global pared2_col = r(mean)
              
              . di $pared2_col 
              .34592023
              
              . global pared_cum1 = $pared1_col 
              
              . global pared_cum2 = $pared1_col  + $pared2_col 
              
              . 
              . * Get coefficients
              . logit hiqual i.cred##i.pared i.awards c.api99 [pw=pw], nofvlabel // the model to be reproduced
              
              Iteration 0:   log pseudolikelihood =  -1134.567  
              Iteration 1:   log pseudolikelihood = -368.53697  
              Iteration 2:   log pseudolikelihood = -216.49058  
              Iteration 3:   log pseudolikelihood =  -175.3326  
              Iteration 4:   log pseudolikelihood = -172.95596  
              Iteration 5:   log pseudolikelihood = -172.86153  
              Iteration 6:   log pseudolikelihood = -172.86088  
              Iteration 7:   log pseudolikelihood = -172.86088  
              
              Logistic regression                             Number of obs     =      1,200
                                                              Wald chi2(10)     =     135.45
                                                              Prob > chi2       =     0.0000
              Log pseudolikelihood = -172.86088               Pseudo R2         =     0.8476
              
              ------------------------------------------------------------------------------
                           |               Robust
                    hiqual |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                      cred |
                        2  |  -.0734085   1.407255    -0.05   0.958    -2.831578    2.684761
                        3  |  -1.709417   .7936534    -2.15   0.031    -3.264949   -.1538847
                           |
                     pared |
                        2  |  -1.646075   .8916789    -1.85   0.065    -3.393734    .1015837
                        3  |  -.6957868   .9117727    -0.76   0.445    -2.482828    1.091255
                           |
                cred#pared |
                      2 2  |   1.289002   1.665787     0.77   0.439    -1.975881    4.553884
                      2 3  |  -.0436259   1.634133    -0.03   0.979    -3.246467    3.159216
                      3 2  |   2.089298   1.113269     1.88   0.061    -.0926687    4.271264
                      3 3  |   1.846565   1.084369     1.70   0.089    -.2787598     3.97189
                           |
                  1.awards |   2.979602   .4305003     6.92   0.000     2.135837    3.823367
                     api99 |   .0694097    .006752    10.28   0.000     .0561761    .0826434
                     _cons |  -50.75371     4.7891   -10.60   0.000    -60.14018   -41.36725
              ------------------------------------------------------------------------------
              Note: 124 failures and 0 successes completely determined.
              
              . global b_cred2=_b[2.cred]
              
              . di $b_cred2 
              -.07340846
              
              . global b_cred3=_b[3.cred]
              
              . di $b_cred3 
              -1.7094167
              
              . global b_pared2=_b[2.pared]
              
              . di $b_pared2 
              -1.6460749
              
              . global b_pared3=_b[3.pared]
              
              . di $b_pared3 
              -.69578684
              
              . 
              . global b_int22 = _b[2.cred#2.pared]
              
              . di $b_int22 
              1.2890016
              
              . global b_int23 = _b[2.cred#3.pared]
              
              . di $b_int23 
              -.04362591
              
              . global b_int32 = _b[3.cred#2.pared]
              
              . di $b_int32 
              2.0892977
              
              . global b_int33 = _b[3.cred#3.pared]
              
              . di $b_int33 
              1.8465649
              
              . 
              . global b_award = _b[1.awards]
              
              . global b_api = _b[api99]
              
              . global b_con = _b[_cons]
              
              . di $b_con 
              -50.753713
              
              . 
              
              
              
              *****************************
              *** Step 2 Prepare simulation .
              
              . capture program drop mysimu
              
              . program mysimu, rclass
                1. 
              . 
              . drop _all
                2. set obs 12000
                3. gen id = _n
                4. 
              . gen sapi99 = $api_mean  + $api_sd *invnorm(uniform())
                5. gen rand1 = uniform()
                6. gen scred = cond(rand1 < $cred_cum1 , 1, cond(rand1<$cred_cum2 , 2, 3))
                7. qui tab scred, gen(scred)
                8. gen rand2 = uniform()
                9. gen spared = cond(rand2 < $pared_cum1 , 1, cond(rand2<$pared_cum2 , 2, 3))
               10. qui tab spared, gen(spared)
               11. gen rand3 = uniform()
               12. gen saward = cond(rand3 < $award_col , 1, 0)
               13. drop rand?
               14. 
              . gen y  = uniform() < ///
              >         invlogit($b_con  + ///
              >          $b_cred2 *(scred2)  +  $b_cred3 *(scred3)  + ///
              >         $b_pared2 *(spared2) + $b_pared3 *(spared3) + ///
              >         $b_int22 *(scred2*spared2) + $b_int33 *(scred3*spared3) + ///
              >         $b_int23 *(scred2*spared3)  + $b_int32 *(scred3*spared2) + ///
              >         $b_award *saward + $b_api *sapi99)
               15. 
              .  tempvar z1 
               16.  tempvar p1
               17.  
              . qui logit y i.scred2##i.spared2 i.scred3##i.spared3 ///
              >         i.scred2##i.spared3 i.scred3##i.spared2 i.saward  c.sapi99, nolog // the model to be reproduced
               18. 
              .  return scalar coef1=_b[1.scred2#1.spared2] 
               19.  return scalar coef2=_b[1.scred3#1.spared2] 
               20.  return scalar se1=_se[1.scred2#1.spared2]
               21.  return scalar se2=_se[1.scred3#1.spared2]
               22.  gen `z1' =_b[1.scred2#1.spared2]/_se[1.scred2#1.spared2]
               23.  qui sum `z1' 
               24.  return scalar z=r(mean)
               25.  gen `p1' =2*normal(-abs(`z1' ))
               26.  qui sum `p1'
               27.  return scalar p=r(mean)        
               28.         
              . end     
              
              . 
              
              
              
              *****************************
              *** Step 3. Run Simulation
              
              . set seed 21122018
              
              . simulate b1=r(coef1)  b2=r(coef2) z1=r(z) p1=r(p), reps(5000) nodots : mysimu
              
                    command:  mysimu
                         b1:  r(coef1)
                         b2:  r(coef2)
                         z1:  r(z)
                         p1:  r(p)
              
              Simulations (5000)
              ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
              
              . d
              
              Contains data
                obs:         5,000                          simulate: mysimu
               vars:             4                          21 Dec 2018 23:12
               size:        80,000                          
              ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                            storage   display    value
              variable name   type    format     label      variable label
              ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              b1              float   %9.0g                 r(coef1)
              b2              float   %9.0g                 r(coef2)
              z1              float   %9.0g                 r(z)
              p1              float   %9.0g                 r(p)
              ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              Sorted by: 
              
              . su
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                        b1 |      5,000    1.291391    .3262617   .2748145   2.437333
                        b2 |      5,000    2.092637    .2980006   1.053037   3.444387
                        z1 |      5,000    3.966857    .9854072   .8831808   7.123791
                        p1 |      5,000    .0044242    .0200894   1.05e-12   .3771386
              
              .  
              . set seed 21122018
              
              . mysimu
              number of observations (_N) was 0, now 12,000
              
              . return list
              
              scalars:
                                r(p) =  .0003921732131857
                                r(z) =  3.545295476913452
                              r(se2) =  .2956437003887292
                              r(se1) =  .3203618302689425
                            r(coef2) =  2.026182763741815
                            r(coef1) =  1.135777314947391

              Many thanks.

              I work with Stata15, Windows 64 bits.

              Min

              Comment

              Working...
              X