Simulating data for logistic regression with categorical variables

Min Zhang

Join Date: Mar 2015
Posts: 40

Simulating data for logistic regression with categorical variables

16 Dec 2018, 10:41

Dear Statalists,

I would like to test the effect of sample size on standard errors of interaction effects. The model results from the survey data shows a pattern of interaction effects but the interaction effects do not reach statistical significance. I am interested to find out how large the sample size needs to be in order to be statistically significant.

The idea is to generate a new data set with the same distribution, correlation matrix, regression coefficients as the real data, but a larger sample size where the interaction effects of interest reach statistical significance.

I may need to consider complex survey design as well.

I have considered the following options.

1.The command

Code:

corr2data

would have been ideal if it worked well with logistic regression and categorical variables.

2. Following Buis' s discussion(i.e., M.L. Buis (2007) "Stata tip 48: Discrete uses for uniform()), I was able to simulate a data set for logistic regression with specified distributions, but failed to replicate regression coefficients. The regression coefficients in the simulated data set only approximate those specified. I cannot reproduce the correlation matrix either.

The approach is something similar to this post. https://www.stata.com/statalist/arch.../msg00018.html

3. A very vague idea is to use probit regression. I may simulate a data using corr2data and transform the outcome variables using the probit link functions. It is a long shot and I have not been able to figure out how to do it yet.

Any suggestion is appreciated.

The logistic regression results I wish to simulate:

Code:

. svyset psu  [pw=xw], strata(strata) singleunit(scaled)

      pweight: xw
          VCE: linearized
  Single unit: scaled
     Strata 1: strata
         SU 1: psu
        FPC 1: <zero>

. svy: logit y i.x1##i.x2 i.x3 c.x4
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =     1,630                  Number of obs     =      7,355
Number of PSUs     =     3,232                  Population size   = 7,976.7239
                                                Design df         =      1,602
                                                F(  13,   1590)   =      12.36
                                                Prob > F          =     0.0000

------------------------------------------------------------------------------
             |             Linearized
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        1.x1 |   .3586926   .1260393     2.85   0.004     .1114734    .6059119
             |
          x2 |
          3  |   .0491903   .1609868     0.31   0.760    -.2665767    .3649573
          4  |   .1623764   .1383986     1.17   0.241     -.109085    .4338377
          7  |   .0937721   .1003796     0.93   0.350    -.1031171    .2906613
             |
       x1#x2 |
        1 3  |   .0348744   .2815246     0.12   0.901    -.5173208    .5870697
        1 4  |   .3241205   .2281414     1.42   0.156    -.1233666    .7716076
        1 7  |  -.0562443   .2055027    -0.27   0.784    -.4593267    .3468381
             |
          x3 |
          1  |    .109137   .1153704     0.95   0.344    -.1171559    .3354298
          2  |   .4191621   .1126151     3.72   0.000     .1982738    .6400505
          3  |   .4574391   .1259686     3.63   0.000     .2103585    .7045197
          4  |   .8478119   .1286161     6.59   0.000     .5955382    1.100085
          5  |   1.051244   .1458429     7.21   0.000     .7651811    1.337307
             |
          x4 |   .1644617   .0746806     2.20   0.028     .0179798    .3109435
       _cons |    -1.4128   .6050882    -2.33   0.020    -2.599648   -.2259526
------------------------------------------------------------------------------
Note: Variance scaled to handle strata with a single sampling unit.

I use Stata 15, Windows 64bit.

Many thanks.
Min

Tags: categorical, corr2data, simulate

William Lisowski

Join Date: Dec 2014
Posts: 10150

16 Dec 2018, 12:26

If you can't find anything fancy, the code below suggests an approach that may help you approximate what you want.

Code:

sysuse auto, clear
generate highpr = price>7000
generate int wgt = 1
forvalues i=1/10 {
    quietly replace wgt = `i'
    quietly logit highpr c.mpg##i.foreign [fw=wgt]
    matrix t = r(table)
    local p = t[rownumb(t,"pvalue"),colnumb(t,"highpr:1.foreign#c.mpg")]
    display %4.0f `i' %9.3f `p'
    }

Code:

   1    0.444
   2    0.279
   3    0.185
   4    0.126
   5    0.087
   6    0.061
   7    0.043
   8    0.030
   9    0.022
  10    0.015

Last edited by William Lisowski; 16 Dec 2018, 13:01. Reason: Improved coding by using fw weights

Comment

Min Zhang

Join Date: Mar 2015

Posts: 40
#3

17 Dec 2018, 02:55

Originally posted by William Lisowski View Post

If you can't find anything fancy, the code below suggests an approach that may help you approximate what you want.

Code:

sysuse auto, clear generate highpr = price>7000 generate int wgt = 1 forvalues i=1/10 { quietly replace wgt = `i' quietly logit highpr c.mpg##i.foreign [fw=wgt] matrix t = r(table) local p = t[rownumb(t,"pvalue"),colnumb(t,"highpr:1.foreign#c.mpg")] display %4.0f `i' %9.3f `p' }

Code:

1 0.444 2 0.279 3 0.185 4 0.126 5 0.087 6 0.061 7 0.043 8 0.030 9 0.022 10 0.015

Hi William,

Many thanks for your reply.

Sorry I did not make myself clear. I have no problem with extracting the estimates from logit regression. The problem is that I do not know how to simulate a data set with specified data distribution and logistic regression coefficients.

For example, the sample size is 7355 in my survey data and the interaction effect is insignificant. I would like to know whether the interaction would be significant if the sample was five times as large or ten times as large.

Thank you.

Regards,
Min
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

17 Dec 2018, 04:22

An alternative for this ‘problem’ could be a Bayesian approach.

Best regards,

Marcos
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

17 Dec 2018, 05:03

I am not on very clear what is the end goal here.

For example, would you not achieve your end goal if you use -expand- to say make your sample 2, 3, 4, etc. times bigger than now, and then simply re-estimate your model to see what happens with the standard errors?

As to your question how you generate data from a model such as probit or logit, you have at least two complications that I am not sure how to deal with. (That you have survey data, and that you have correlation of regressors that you want to replicate, but your regressors are discrete, so it is not clear whether -corrtodata- would work here.)

Absent of these complications, it is easy to generate data from a probit or logit using the threshold/index form of the model. Say probit (logit is the same, but with the logistically distributed error):

Code:

. clear

. set obs 300
number of observations (_N) was 0, now 300

. gen x = runiform()

. gen ystar = 1 + .5*x + rnormal()

. gen y = ystar>0

. probit y x, nolog noheader
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .6335745   .3551178     1.78   0.074    -.0624436    1.329593
       _cons |   .9540644   .1844389     5.17   0.000     .5925708    1.315558
------------------------------------------------------------------------------

. clear

. set obs 300000
number of observations (_N) was 0, now 300,000

. gen x = runiform()

. gen ystar = 1 + .5*x + rnormal()

. gen y = ystar>0

. probit y x, nolog noheader
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .5043405   .0107321    46.99   0.000      .483306    .5253749
       _cons |   .9968994   .0058136   171.48   0.000     .9855049    1.008294
------------------------------------------------------------------------------

Note that the estimated parameters are the same (up to sampling error) as the population parameters that I used.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

17 Dec 2018, 06:18

Apparently my code in post #2 was more subtle than I realized, since both post #3 and post #5 do not fully understand it. From post #5

For example, would you not achieve your end goal if you use -expand- to say make your sample 2, 3, 4, etc. times bigger than now, and then simply re-estimate your model to see what happens with the standard errors?

That was exactly what my original code did, until I saw a more efficient approach. The key lines in my code are

Code:

forvalues i=1/10 { quietly replace wgt = `i' quietly logit highpr c.mpg##i.foreign [fw=wgt]

for which the included frequency weights have the effect of fitting the model including each observation multiple times. The first time through the loop, 1 time; then 2 times; then 3 times; etc., exactly as suggested in post #5. For more detail, see the output of help weights and also, remove the quietly from before the logit command and compare the results of the successive regressions.

So from the output in post #2, we see that the p-value on the interaction term coefficient estimate decreases from 0.444 to 0.279 when the sample is doubled, and to 0.015 when the sample is increased 10-fold.

Note that this expansion of the original data, either by frequency weights or using expand, produces the desired

new data set with the same distribution, correlation matrix, regression coefficients as the real data, but a larger sample size
1 like
Comment

Min Zhang

Join Date: Mar 2015
Posts: 40

22 Dec 2018, 11:43

Many thanks for your replies. I have given it more thought and tried a somewhat different approach.
I am posting my codes here and hope to help others who face similar problems. Please correct me if you spot any mistakes in my codes or statistics.

1. Instead of expecting to reproduce exact coefficients of logistic regression, I will use repeated simulations to approximate specified ones. I think this is similar to what
Joro Kolev suggested.

2. I gave up to reproduce correlation matrix if this is not possible. My following codes do not include the efforts to replicate correlations.

3. I will give more thought to the frequency weights approach that William Lisowski suggests.

To recap, the goal is to see whether the insignificance of interaction terms is just a matter of sample size.

Code:

*****************************
*** Step 1 save data structure and logistic regression coefficients

. use https://stats.idre.ucla.edu/stat/stata/webbooks/logistic/apilog, clear

. * Set up sample
. qui logit hiqual i.cred##i.pared i.awards c.api99 , nolog // the model to be reproduced

. gen sample=e(sample)

. keep if sample==1
(0 observations deleted)

. 
. * Create sampling weights
. set seed 21122018

. gen pw = 1+  uniform()

. 
. * api99
. sum api99 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       api99 |   1,200  1796.54305    635.7269   141.4479        315        957

. global api_mean = r(mean)

. di $api_mean 
635.72688

. global api_sd   = r(sd) // not ideal to use global but local does not work here

. di $api_sd 
141.44794

. 
. * award
. qui tab awards  , gen(award_)

. sum award_1 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
     award_1 |   1,200  1796.54305    .2549022   .4359879          0          1

. global award_col = r(mean)

. 
. * cred
. qui tab cred , gen(cred)

. sum cred1 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       cred1 |   1,200  1796.54305      .31792   .4658624          0          1

. global cred1_col = r(mean)

. di $cred1_col 
.31792001

. sum cred2 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       cred2 |   1,200  1796.54305    .2725915   .4454781          0          1

. global cred2_col = r(mean)

. di $cred2_col 
.27259152

. global cred_cum1 = $cred1_col 

. global cred_cum2 = $cred1_col  + $cred2_col 

. di $cred_cum1 
.31792001

. di $cred_cum2 
.59051153

. 
. * pared
. qui tab pared, gen(pared)

. sum pared1 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      pared1 |   1,200  1796.54305    .3344101   .4719805          0          1

. global pared1_col = r(mean)

. di $pared1_col 
.33441012

. sum pared2 [aw=pw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      pared2 |   1,200  1796.54305    .3459202   .4758657          0          1

. global pared2_col = r(mean)

. di $pared2_col 
.34592023

. global pared_cum1 = $pared1_col 

. global pared_cum2 = $pared1_col  + $pared2_col 

. 
. * Get coefficients
. logit hiqual i.cred##i.pared i.awards c.api99 [pw=pw], nofvlabel // the model to be reproduced

Iteration 0:   log pseudolikelihood =  -1134.567  
Iteration 1:   log pseudolikelihood = -368.53697  
Iteration 2:   log pseudolikelihood = -216.49058  
Iteration 3:   log pseudolikelihood =  -175.3326  
Iteration 4:   log pseudolikelihood = -172.95596  
Iteration 5:   log pseudolikelihood = -172.86153  
Iteration 6:   log pseudolikelihood = -172.86088  
Iteration 7:   log pseudolikelihood = -172.86088  

Logistic regression                             Number of obs     =      1,200
                                                Wald chi2(10)     =     135.45
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -172.86088               Pseudo R2         =     0.8476

------------------------------------------------------------------------------
             |               Robust
      hiqual |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        cred |
          2  |  -.0734085   1.407255    -0.05   0.958    -2.831578    2.684761
          3  |  -1.709417   .7936534    -2.15   0.031    -3.264949   -.1538847
             |
       pared |
          2  |  -1.646075   .8916789    -1.85   0.065    -3.393734    .1015837
          3  |  -.6957868   .9117727    -0.76   0.445    -2.482828    1.091255
             |
  cred#pared |
        2 2  |   1.289002   1.665787     0.77   0.439    -1.975881    4.553884
        2 3  |  -.0436259   1.634133    -0.03   0.979    -3.246467    3.159216
        3 2  |   2.089298   1.113269     1.88   0.061    -.0926687    4.271264
        3 3  |   1.846565   1.084369     1.70   0.089    -.2787598     3.97189
             |
    1.awards |   2.979602   .4305003     6.92   0.000     2.135837    3.823367
       api99 |   .0694097    .006752    10.28   0.000     .0561761    .0826434
       _cons |  -50.75371     4.7891   -10.60   0.000    -60.14018   -41.36725
------------------------------------------------------------------------------
Note: 124 failures and 0 successes completely determined.

. global b_cred2=_b[2.cred]

. di $b_cred2 
-.07340846

. global b_cred3=_b[3.cred]

. di $b_cred3 
-1.7094167

. global b_pared2=_b[2.pared]

. di $b_pared2 
-1.6460749

. global b_pared3=_b[3.pared]

. di $b_pared3 
-.69578684

. 
. global b_int22 = _b[2.cred#2.pared]

. di $b_int22 
1.2890016

. global b_int23 = _b[2.cred#3.pared]

. di $b_int23 
-.04362591

. global b_int32 = _b[3.cred#2.pared]

. di $b_int32 
2.0892977

. global b_int33 = _b[3.cred#3.pared]

. di $b_int33 
1.8465649

. 
. global b_award = _b[1.awards]

. global b_api = _b[api99]

. global b_con = _b[_cons]

. di $b_con 
-50.753713

. 



*****************************
*** Step 2 Prepare simulation .

. capture program drop mysimu

. program mysimu, rclass
  1. 
. 
. drop _all
  2. set obs 12000
  3. gen id = _n
  4. 
. gen sapi99 = $api_mean  + $api_sd *invnorm(uniform())
  5. gen rand1 = uniform()
  6. gen scred = cond(rand1 < $cred_cum1 , 1, cond(rand1<$cred_cum2 , 2, 3))
  7. qui tab scred, gen(scred)
  8. gen rand2 = uniform()
  9. gen spared = cond(rand2 < $pared_cum1 , 1, cond(rand2<$pared_cum2 , 2, 3))
 10. qui tab spared, gen(spared)
 11. gen rand3 = uniform()
 12. gen saward = cond(rand3 < $award_col , 1, 0)
 13. drop rand?
 14. 
. gen y  = uniform() < ///
>         invlogit($b_con  + ///
>          $b_cred2 *(scred2)  +  $b_cred3 *(scred3)  + ///
>         $b_pared2 *(spared2) + $b_pared3 *(spared3) + ///
>         $b_int22 *(scred2*spared2) + $b_int33 *(scred3*spared3) + ///
>         $b_int23 *(scred2*spared3)  + $b_int32 *(scred3*spared2) + ///
>         $b_award *saward + $b_api *sapi99)
 15. 
.  tempvar z1 
 16.  tempvar p1
 17.  
. qui logit y i.scred2##i.spared2 i.scred3##i.spared3 ///
>         i.scred2##i.spared3 i.scred3##i.spared2 i.saward  c.sapi99, nolog // the model to be reproduced
 18. 
.  return scalar coef1=_b[1.scred2#1.spared2] 
 19.  return scalar coef2=_b[1.scred3#1.spared2] 
 20.  return scalar se1=_se[1.scred2#1.spared2]
 21.  return scalar se2=_se[1.scred3#1.spared2]
 22.  gen `z1' =_b[1.scred2#1.spared2]/_se[1.scred2#1.spared2]
 23.  qui sum `z1' 
 24.  return scalar z=r(mean)
 25.  gen `p1' =2*normal(-abs(`z1' ))
 26.  qui sum `p1'
 27.  return scalar p=r(mean)        
 28.         
. end     

. 



*****************************
*** Step 3. Run Simulation

. set seed 21122018

. simulate b1=r(coef1)  b2=r(coef2) z1=r(z) p1=r(p), reps(5000) nodots : mysimu

      command:  mysimu
           b1:  r(coef1)
           b2:  r(coef2)
           z1:  r(z)
           p1:  r(p)

Simulations (5000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 

. d

Contains data
  obs:         5,000                          simulate: mysimu
 vars:             4                          21 Dec 2018 23:12
 size:        80,000                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
b1              float   %9.0g                 r(coef1)
b2              float   %9.0g                 r(coef2)
z1              float   %9.0g                 r(z)
p1              float   %9.0g                 r(p)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. su

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          b1 |      5,000    1.291391    .3262617   .2748145   2.437333
          b2 |      5,000    2.092637    .2980006   1.053037   3.444387
          z1 |      5,000    3.966857    .9854072   .8831808   7.123791
          p1 |      5,000    .0044242    .0200894   1.05e-12   .3771386

.  
. set seed 21122018

. mysimu
number of observations (_N) was 0, now 12,000

. return list

scalars:
                  r(p) =  .0003921732131857
                  r(z) =  3.545295476913452
                r(se2) =  .2956437003887292
                r(se1) =  .3203618302689425
              r(coef2) =  2.026182763741815
              r(coef1) =  1.135777314947391

Many thanks.

I work with Stata15, Windows 64 bits.

Min

Announcement

Simulating data for logistic regression with categorical variables

Comment

Comment

Comment

Comment

Comment

Comment