Simulating data for fracreg

Richard Williams

Join Date: Apr 2014
Posts: 5043

Simulating data for fracreg

25 Jun 2015, 14:53

The new fracreg command in Stata 14 lets you estimate logit models for fractional models, e.g. models where the dependent variable is a proportion that varies between 0 and 1. I am trying to simulate data for fracreg but I seem to be doing something wrong. No matter what seed I use, the fracreg estimates are attenuated, i.e. less that I set them up to be. I generally don't have such problems when I do simulations -- so I don't know if I am doing something wrong or if estimates are inherently attenuated or what. Here is a simple example:

Code:

    clear all
    set obs 1000
    set rng kiss32
    set seed 125
    gen x1 = rnormal()
    * e1 has a standard logistic distribution
    gen e1 = rlogistic()
    local b1 = 1
    local b0 = 1
    
    gen ystar = `b0' + `b1'*x1 + e1
    reg ystar x1
    gen yprob = invlogit(ystar)
    fracreg logit yprob x1, nolog

Here is selected output:

Code:

.         gen ystar = `b0' + `b1'*x1 + e1

.         reg ystar x1

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(1, 998)       =    328.64
       Model |    961.8941         1    961.8941   Prob > F        =    0.0000
    Residual |  2921.02103       998  2.92687478   R-squared       =    0.2477
-------------+----------------------------------   Adj R-squared   =    0.2470
       Total |  3882.91513       999  3.88680193   Root MSE        =    1.7108

------------------------------------------------------------------------------
       ystar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |    .990443   .0546346    18.13   0.000     .8832311    1.097655
       _cons |   .9569887   .0541051    17.69   0.000     .8508158    1.063162
------------------------------------------------------------------------------

.         gen yprob = invlogit(ystar)

.         fracreg logit yprob x1, nolog


Fractional logistic regression                  Number of obs     =      1,000
                                                Wald chi2(1)      =     301.46
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -602.71384               Pseudo R2         =     0.0751

------------------------------------------------------------------------------
             |               Robust
       yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .7081039   .0407833    17.36   0.000     .6281701    .7880376
       _cons |   .6438433   .0384121    16.76   0.000      .568557    .7191295
------------------------------------------------------------------------------

As you can see, regress gives me numbers like I would expect. But fracreg gives estimates about a third smaller than I should have gotten. It doesn't seem to be a sampling fluke, as I get similar results with other seeds.

Is my approach inherently flawed? Am I generating yprob and/or ystar wrong?

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/

Tags: None

Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015
Posts: 223

25 Jun 2015, 16:15

Dear Richard,

This is a great question. I think the problem lies in the interpretation of the coefficients of the model. fracreg is a model for a conditional expectation of the form:

\begin{equation*}
E\left(Y|X\right) = G\left(X\beta\right)
\end{equation*}

In your case you have

\begin{equation*}
Y = G\left(X\beta + e \right)
\end{equation*}

You now have to take an expected value with respect to the unobserved component and this will change the coefficients. I will show this to you using a conditional expectation from the normal distribution. In my example I know that the coefficients are of the form:

\begin{equation*}
\frac{\beta}{\sqrt{(1+\sigma^2)}}
\end{equation*}

Code:

    clear all
    set obs 10000
    set rng kiss32
    set seed 125
    gen x1 = rnormal()
    gen e1 = rnormal()
    local b1 = .8
    local b0 = 1
    
    gen ystar = `b0' + `b1'*x1 + e1
    reg ystar x1
    gen yprob = normal(ystar)
    fracreg probit yprob x1, nolog
    
    display  `b1'/sqrt(2)
    display  `b0'/sqrt(2)

In sum, the coefficients are the result of integrating out the unobserved component. The resutl is

Code:

.     clear all

.     set obs 10000
number of observations (_N) was 0, now 10,000

.     set rng kiss32

.     set seed 125

.     gen x1 = rnormal()

.     gen e1 = rnormal()

.     local b1 = .8

.     local b0 = 1

.     
.     gen ystar = `b0' + `b1'*x1 + e1

.     reg ystar x1

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(1, 9998)      =   6401.38
       Model |  6419.66689         1  6419.66689   Prob > F        =    0.0000
    Residual |  10026.5576     9,998  1.00285633   R-squared       =    0.3903
-------------+----------------------------------   Adj R-squared   =    0.3903
       Total |  16446.2245     9,999  1.64478693   Root MSE        =    1.0014

------------------------------------------------------------------------------
       ystar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .8030816   .0100374    80.01   0.000     .7834062     .822757
       _cons |   1.015502   .0100151   101.40   0.000     .9958699    1.035133
------------------------------------------------------------------------------

.     gen yprob = normal(ystar)

.     fracreg probit yprob x1, nolog


Fractional probit regression                    Number of obs     =     10,000
                                                Wald chi2(1)      =    4459.31
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -5054.2293               Pseudo R2         =     0.1248

------------------------------------------------------------------------------
             |               Robust
       yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .5672843   .0084951    66.78   0.000     .5506343    .5839344
       _cons |   .7167636    .007919    90.51   0.000     .7012426    .7322845
------------------------------------------------------------------------------

.     
.     display  `b1'/sqrt(2)
.56568542

.     display  `b0'/sqrt(2)
.70710678

Last edited by Enrique Pinzon (StataCorp); 25 Jun 2015, 16:18.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5043
#3

25 Jun 2015, 19:02

Fantastic! Thanks.A few qs:
Basically, it sounds like my simulations were ok, but I just didn't know it? i.e. the attenuation that seemed to be occurring was actually correct and to be expected?

How did you know that was the formula for the form of the coefficients? I am guessing it is a formula everyone in the world knows except me. I just tried setting the SD of e1 to 2 and then using sqrt(5) in the display statements, and it seemed to work.

I am also guessing the formula for logit is the same, except you use the scale s rather than the standard deviation

You did things like display `b1'/sqrt(2), Would it make sense to instead do estimated_b1 * sqrt(2)? Wouldn't that get me the estimate of the "true" Beta?

Here are some more complicated simulations for both probit and logit. As far as I can tell, they are working right. I am using the defaults for rnormal and rlogistic but in future simulations I may change those.

Code:

* probit clear all set obs 10000 set rng kiss32 set seed 200 gen x1 = rnormal() gen x2 = rnormal() gen x3 = rnormal() gen e1 = rnormal(0,1) local b1 = .7 local b2 = .5 local b3 = .2 local b0 = 1 gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1 gen yprob = normal(ystar) fracreg probit yprob x1 x2 x3, nolog display `b1'/sqrt(2) display `b2'/sqrt(2) display `b3'/sqrt(2) display `b0'/sqrt(2) * logit clear all set obs 10000 set rng kiss32 set seed 125 gen x1 = rnormal() gen x2 = rnormal() gen x3 = rnormal() gen e1 = rlogistic(0,1) local b1 = .7 local b2 = .5 local b3 = .2 local b0 = 1 gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1 gen yprob = invlogit(ystar) fracreg logit yprob x1 x2 x3, nolog display `b1'/sqrt(2) display `b2'/sqrt(2) display `b3'/sqrt(2) display `b0'/sqrt(2)

Last edited by Richard Williams; 25 Jun 2015, 19:04.

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#4

26 Jun 2015, 03:22

How did you know that was the formula for the form of the coefficients? I am guessing it is a formula everyone in the world knows except me. I just tried setting the SD of e1 to 2 and then using sqrt(5) in the display statements, and it seemed to work.

I am also guessing the formula for logit is the same, except you use the scale s rather than the standard deviation

Rich: AFAIK Enrique's "formula" derives from evaluating the expectation that he sets out in the line above. Evaluation of that formula involves an integral and a distributional assumption and it so happens that, with a normal distribution, the "integrating out" leads to the closed form expression that he shows. My understanding is that there is no neat closed form expression if the distributional assumption is changed to logistic. [The "integrating out" has to be done numerically. However, my guess is that it would, as you conjecture, lead to similar results.] Enrique's formula is perhaps familiar to some readers because it is relevant in the panel data case to the comparison of coefficients from (a) pooled probit model versus (b) random effects (random intercept) probit model. The coefficients are not directly comparable for the same sorts of reasons as Enrique mentions in your context. [It's another version of the difference in interpretation between Population Average (marginal) models and cluster-specific (conditional) models.] I think the formulae are somewhere in either Greene's Econometric Analysis or Wooldridge's Econometric Analysis of Cross Section and Panel Data
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#5

26 Jun 2015, 03:30

Here is how I would start with simulating data for fracreg:

Code:

clear all set seed 123456 program define sim drop _all set obs 500 gen x = rnormal() gen mean = invlogit(-1+1.25*x) local phi = 5 gen a = mean*`phi' gen b = (1-mean)*`phi' gen y = rbeta(a,b) fracreg logit y x end simulate _b _se , reps(20000) : sim

Since there is now no need to integrate over unobserved heterogeneity, the point estimates from fracreg should correspond with the parameters chosen in the simulation.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5043
#6

26 Jun 2015, 08:22

Thanks Stephen and Maarten. I am not totally sure why Maarten's approach works but I will play around with it.

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5043
#7

26 Jun 2015, 08:59

Maarten, here is a variation of your code:

Code:

drop _all set obs 500 version 13.1 set seed 123456 gen x = rnormal() gen mean = invlogit(1+1.5*x) local phi = 5 gen a = mean*`phi' gen b = (1-mean)*`phi' gen y = rbeta(a,b) fracreg logit y x sum

In this particular case, y winds up having 23 missing values. The number of missing varies depending on the parameters I specify, e.g. with your original values of -1 and 1.25 I only had 2 missing cases. Is this an inherent problem with the approach, or is there a workaround?

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5043

26 Jun 2015, 09:15

Is there anything wrong with doing the simulations this way? It seems to get me the parameters I want for the actual fracreg.

Code:

    * probit
    clear all
    set obs 1000
    set rng kiss32
    set seed 200
    gen x1 = rnormal()
    gen x2 = rnormal()
    gen x3 = rnormal()
    gen e1 = rnormal(0,1)
    local adjust = sqrt(2)
    local b1 = .7 * `adjust'
    local b2 = .5 * `adjust'
    local b3 = .2 * `adjust'
    local b0 = 1 * `adjust'
    
    gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1
    gen yprob = normal(ystar)
    fracreg probit yprob x1 x2 x3, nolog

Code:

Fractional probit regression                    Number of obs     =      1,000
                                                Wald chi2(3)      =     775.96
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -399.69606               Pseudo R2         =     0.2391

------------------------------------------------------------------------------
             |               Robust
       yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .7005894   .0305856    22.91   0.000     .6406428    .7605361
          x2 |   .5119914   .0265617    19.28   0.000     .4599314    .5640514
          x3 |    .193765   .0261246     7.42   0.000     .1425618    .2449682
       _cons |   1.020026   .0274727    37.13   0.000     .9661809    1.073872
------------------------------------------------------------------------------

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5043
#9

26 Jun 2015, 11:42

Rich: AFAIK Enrique's "formula" derives from evaluating the expectation that he sets out in the line above. Evaluation of that formula involves an integral and a distributional assumption and it so happens that, with a normal distribution, the "integrating out" leads to the closed form expression that he shows. My understanding is that there is no neat closed form expression if the distributional assumption is changed to logistic. [The "integrating out" has to be done numerically. However, my guess is that it would, as you conjecture, lead to similar results.]

I have been running some simulations, and it appears that, in the case of logit, using the procedure I suggested may result in very slight underestimates of the parameters. But if so it may not be enough to worry about if you prefer to use logit.

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 223
#10

26 Jun 2015, 12:08

Hello Richard,

Most of the intuition that you have from the logit and probit framework translates to the fracreg framework. Specifically, that the results in terms of effects are very similar. Simulations that focus on the effects, using margins for example, should verify this. Also, I think that the way you are generating the data in the post where you rescale your coefficients is a legitimate way of doing it.

Last edited by Enrique Pinzon (StataCorp); 26 Jun 2015, 12:11.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5043
#11

26 Jun 2015, 13:27

Originally posted by Enrique Pinzon (StataCorp) View Post

Hello Richard,

Most of the intuition that you have from the logit and probit framework translates to the fracreg framework. Specifically, that the results in terms of effects are very similar. Simulations that focus on the effects, using margins for example, should verify this. Also, I think that the way you are generating the data in the post where you rescale your coefficients is a legitimate way of doing it.

Thanks Enrique. I will quote you on this. ;-) You and the others have been an enormous help. I am going to play around with Maarten's approach in post 5 and my approach in post 8. If both seem to work equally well I may just go with whichever one is easiest to explain.

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#12

28 Jun 2015, 05:30

Originally posted by Richard Williams View Post

In this particular case, y winds up having 23 missing values. The number of missing varies depending on the parameters I specify, e.g. with your original values of -1 and 1.25 I only had 2 missing cases. Is this an inherent problem with the approach, or is there a workaround?

This has to do with rbeta working for only a specific range of values for a and b. Theoretically a and b could be any positive number, but as implemented it only allows values in the range 0.05-1e5 for a and 0.15 - 1e5 for b. Especially the lower bounds can easily bite.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#13

29 Jun 2015, 01:33

Here is another way of simulating data for fracreg that does not require an adjustment:

Code:

clear all program define sim drop _all set obs 200 gen x = rnormal() gen p = invlogit(-1+.5*x) gen y = rbinomial(1440, p) // 1440 = # minutes / day replace y = y/1440 fracreg logit y x end simulate b=_b[x] se=_se[x], reps(1000) : sim

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5043
#14

29 Jun 2015, 19:06

Thanks again Maarten. Not sure why it works but I will try it.

Going back to the approaches where I was using an adjustment -- perhaps it is just a matter of aesthetics, but I wonder if it would make more sense not to make an adjustment. The unadjusted Beta is the parameter that generated the data, after all, so why not use it?

Semi-related, fracreg pretty much forces you to use robust standard errors. But if I know for a fact (or am 99% certain) that a logit or probit distribution generated the data, would it be ok to use non-robust errors? The manual says

By contrast, if the conditional mean of the model is the same as the conditional mean of a probit but the model is not a probit, the point estimates are consistent, but the standard errors are not asymptotically efficient.

To me that implies that if I generated the data myself using logit or probit then I don't need robust standard errors. But maybe that is wrong. fracreg doesn't seem to have an option for non-robust.

-------------------------------------------
Richard Williams
Professor Emeritus of Sociology
University of Notre Dame
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#15

30 Jun 2015, 02:30

Originally posted by Richard Williams View Post

Not sure why it works but I will try it.

The difference is that there is no error term within the link function. This is similar to the discussion about unobserved heterogeneity in logistic regression: Because the onobserved heterogeneity is unobserved the model in effect averages over it, and since the logit function is a non-linear transformation this averageing changes the effect. In case of the logit transformation it is guaranteed to reduce the effect. Since in my simulation there was no error term withing the link function, there was no averaging, so no need for an adjustment.

Originally posted by Richard Williams View Post

Going back to the approaches where I was using an adjustment -- perhaps it is just a matter of aesthetics, but I wonder if it would make more sense not to make an adjustment. The unadjusted Beta is the parameter that generated the data, after all, so why not use it?

That depends on what the purpose of the study it and how you think these proportions were generated. If you think the error term is real, then doing this with or without adjustment is like the difference between a population averaged model and individual model in multilevel logistic regression. Both model can be valid, but answer different questions. Alternatively if you think that my last simulation describes the data generating process best, then you obviously would not want to adjust as there is nothing to adjust.

Originally posted by Richard Williams View Post

Semi-related, fracreg pretty much forces you to use robust standard errors. But if I know for a fact (or am 99% certain) that a logit or probit distribution generated the data, would it be ok to use non-robust errors?

The logit and probit parts are link functions not distributions. In fact the whole point of fracreg is that you don't specify a distribution at all, but use quasi-maximum likelihood to just estimate effects on the conditional mean and nothing else. So there is no way around using robust standard errors when using fracreg. The alternative would be to specify a distribution for the proportion, which is what betareg does.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement

Simulating data for fracreg

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment