Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating data for fracreg

    The new fracreg command in Stata 14 lets you estimate logit models for fractional models, e.g. models where the dependent variable is a proportion that varies between 0 and 1. I am trying to simulate data for fracreg but I seem to be doing something wrong. No matter what seed I use, the fracreg estimates are attenuated, i.e. less that I set them up to be. I generally don't have such problems when I do simulations -- so I don't know if I am doing something wrong or if estimates are inherently attenuated or what. Here is a simple example:

    Code:
        clear all
        set obs 1000
        set rng kiss32
        set seed 125
        gen x1 = rnormal()
        * e1 has a standard logistic distribution
        gen e1 = rlogistic()
        local b1 = 1
        local b0 = 1
        
        gen ystar = `b0' + `b1'*x1 + e1
        reg ystar x1
        gen yprob = invlogit(ystar)
        fracreg logit yprob x1, nolog
    Here is selected output:

    Code:
    .         gen ystar = `b0' + `b1'*x1 + e1
    
    .         reg ystar x1
    
          Source |       SS           df       MS      Number of obs   =     1,000
    -------------+----------------------------------   F(1, 998)       =    328.64
           Model |    961.8941         1    961.8941   Prob > F        =    0.0000
        Residual |  2921.02103       998  2.92687478   R-squared       =    0.2477
    -------------+----------------------------------   Adj R-squared   =    0.2470
           Total |  3882.91513       999  3.88680193   Root MSE        =    1.7108
    
    ------------------------------------------------------------------------------
           ystar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |    .990443   .0546346    18.13   0.000     .8832311    1.097655
           _cons |   .9569887   .0541051    17.69   0.000     .8508158    1.063162
    ------------------------------------------------------------------------------
    
    .         gen yprob = invlogit(ystar)
    
    .         fracreg logit yprob x1, nolog
    
    
    Fractional logistic regression                  Number of obs     =      1,000
                                                    Wald chi2(1)      =     301.46
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -602.71384               Pseudo R2         =     0.0751
    
    ------------------------------------------------------------------------------
                 |               Robust
           yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |   .7081039   .0407833    17.36   0.000     .6281701    .7880376
           _cons |   .6438433   .0384121    16.76   0.000      .568557    .7191295
    ------------------------------------------------------------------------------
    As you can see, regress gives me numbers like I would expect. But fracreg gives estimates about a third smaller than I should have gotten. It doesn't seem to be a sampling fluke, as I get similar results with other seeds.

    Is my approach inherently flawed? Am I generating yprob and/or ystar wrong?
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    Dear Richard,

    This is a great question. I think the problem lies in the interpretation of the coefficients of the model. fracreg is a model for a conditional expectation of the form:

    \begin{equation*}
    E\left(Y|X\right) = G\left(X\beta\right)
    \end{equation*}

    In your case you have

    \begin{equation*}
    Y = G\left(X\beta + e \right)
    \end{equation*}

    You now have to take an expected value with respect to the unobserved component and this will change the coefficients. I will show this to you using a conditional expectation from the normal distribution. In my example I know that the coefficients are of the form:

    \begin{equation*}
    \frac{\beta}{\sqrt{(1+\sigma^2)}}
    \end{equation*}

    Code:
        clear all
        set obs 10000
        set rng kiss32
        set seed 125
        gen x1 = rnormal()
        gen e1 = rnormal()
        local b1 = .8
        local b0 = 1
        
        gen ystar = `b0' + `b1'*x1 + e1
        reg ystar x1
        gen yprob = normal(ystar)
        fracreg probit yprob x1, nolog
        
        display  `b1'/sqrt(2)
        display  `b0'/sqrt(2)
    In sum, the coefficients are the result of integrating out the unobserved component. The resutl is

    Code:
    .     clear all
    
    .     set obs 10000
    number of observations (_N) was 0, now 10,000
    
    .     set rng kiss32
    
    .     set seed 125
    
    .     gen x1 = rnormal()
    
    .     gen e1 = rnormal()
    
    .     local b1 = .8
    
    .     local b0 = 1
    
    .     
    .     gen ystar = `b0' + `b1'*x1 + e1
    
    .     reg ystar x1
    
          Source |       SS           df       MS      Number of obs   =    10,000
    -------------+----------------------------------   F(1, 9998)      =   6401.38
           Model |  6419.66689         1  6419.66689   Prob > F        =    0.0000
        Residual |  10026.5576     9,998  1.00285633   R-squared       =    0.3903
    -------------+----------------------------------   Adj R-squared   =    0.3903
           Total |  16446.2245     9,999  1.64478693   Root MSE        =    1.0014
    
    ------------------------------------------------------------------------------
           ystar |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |   .8030816   .0100374    80.01   0.000     .7834062     .822757
           _cons |   1.015502   .0100151   101.40   0.000     .9958699    1.035133
    ------------------------------------------------------------------------------
    
    .     gen yprob = normal(ystar)
    
    .     fracreg probit yprob x1, nolog
    
    
    Fractional probit regression                    Number of obs     =     10,000
                                                    Wald chi2(1)      =    4459.31
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -5054.2293               Pseudo R2         =     0.1248
    
    ------------------------------------------------------------------------------
                 |               Robust
           yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              x1 |   .5672843   .0084951    66.78   0.000     .5506343    .5839344
           _cons |   .7167636    .007919    90.51   0.000     .7012426    .7322845
    ------------------------------------------------------------------------------
    
    .     
    .     display  `b1'/sqrt(2)
    .56568542
    
    .     display  `b0'/sqrt(2)
    .70710678
    Last edited by Enrique Pinzon (StataCorp); 25 Jun 2015, 16:18.

    Comment


    • #3
      Fantastic! Thanks.A few qs:
      • Basically, it sounds like my simulations were ok, but I just didn't know it? i.e. the attenuation that seemed to be occurring was actually correct and to be expected?
      • How did you know that was the formula for the form of the coefficients? I am guessing it is a formula everyone in the world knows except me. I just tried setting the SD of e1 to 2 and then using sqrt(5) in the display statements, and it seemed to work.
      • I am also guessing the formula for logit is the same, except you use the scale s rather than the standard deviation
      • You did things like display `b1'/sqrt(2), Would it make sense to instead do estimated_b1 * sqrt(2)? Wouldn't that get me the estimate of the "true" Beta?
      Here are some more complicated simulations for both probit and logit. As far as I can tell, they are working right. I am using the defaults for rnormal and rlogistic but in future simulations I may change those.

      Code:
          * probit
          clear all
          set obs 10000
          set rng kiss32
          set seed 200
          gen x1 = rnormal()
          gen x2 = rnormal()
          gen x3 = rnormal()
          gen e1 = rnormal(0,1)
          local b1 = .7
          local b2 = .5
          local b3 = .2
          local b0 = 1
          
          gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1
          gen yprob = normal(ystar)
          fracreg probit yprob x1 x2 x3, nolog
          
          display  `b1'/sqrt(2)
          display  `b2'/sqrt(2)
          display  `b3'/sqrt(2)
          display  `b0'/sqrt(2)
      
          * logit
          clear all
          set obs 10000
          set rng kiss32
          set seed 125
          gen x1 = rnormal()
          gen x2 = rnormal()
          gen x3 = rnormal()
          gen e1 = rlogistic(0,1)
          local b1 = .7
          local b2 = .5
          local b3 = .2
          local b0 = 1
          
          gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1
          gen yprob = invlogit(ystar)
          fracreg logit yprob x1 x2 x3, nolog
          
          display  `b1'/sqrt(2)
          display  `b2'/sqrt(2)
          display  `b3'/sqrt(2)
          display  `b0'/sqrt(2)
      Last edited by Richard Williams; 25 Jun 2015, 19:04.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        • How did you know that was the formula for the form of the coefficients? I am guessing it is a formula everyone in the world knows except me. I just tried setting the SD of e1 to 2 and then using sqrt(5) in the display statements, and it seemed to work.
        • I am also guessing the formula for logit is the same, except you use the scale s rather than the standard deviation
        Rich: AFAIK Enrique's "formula" derives from evaluating the expectation that he sets out in the line above. Evaluation of that formula involves an integral and a distributional assumption and it so happens that, with a normal distribution, the "integrating out" leads to the closed form expression that he shows. My understanding is that there is no neat closed form expression if the distributional assumption is changed to logistic. [The "integrating out" has to be done numerically. However, my guess is that it would, as you conjecture, lead to similar results.] Enrique's formula is perhaps familiar to some readers because it is relevant in the panel data case to the comparison of coefficients from (a) pooled probit model versus (b) random effects (random intercept) probit model. The coefficients are not directly comparable for the same sorts of reasons as Enrique mentions in your context. [It's another version of the difference in interpretation between Population Average (marginal) models and cluster-specific (conditional) models.] I think the formulae are somewhere in either Greene's Econometric Analysis or Wooldridge's Econometric Analysis of Cross Section and Panel Data

        Comment


        • #5
          Here is how I would start with simulating data for fracreg:

          Code:
          clear all
          set seed 123456
          
          program define sim
              drop _all
              set obs 500
              gen x = rnormal()
              gen mean = invlogit(-1+1.25*x)
              local phi = 5
              gen a = mean*`phi'
              gen b = (1-mean)*`phi'
              gen y = rbeta(a,b)
              fracreg logit y x
          end
          
          simulate _b _se , reps(20000) : sim


          Since there is now no need to integrate over unobserved heterogeneity, the point estimates from fracreg should correspond with the parameters chosen in the simulation.

          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Thanks Stephen and Maarten. I am not totally sure why Maarten's approach works but I will play around with it.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Maarten, here is a variation of your code:

              Code:
                  drop _all
                  set obs 500
                  version 13.1
                  set seed 123456
                  gen x = rnormal()
                  gen mean = invlogit(1+1.5*x)
                  local phi = 5
                  gen a = mean*`phi'
                  gen b = (1-mean)*`phi'
                  gen y = rbeta(a,b)
                  fracreg logit y x
                  sum
              In this particular case, y winds up having 23 missing values. The number of missing varies depending on the parameters I specify, e.g. with your original values of -1 and 1.25 I only had 2 missing cases. Is this an inherent problem with the approach, or is there a workaround?
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Is there anything wrong with doing the simulations this way? It seems to get me the parameters I want for the actual fracreg.

                Code:
                    * probit
                    clear all
                    set obs 1000
                    set rng kiss32
                    set seed 200
                    gen x1 = rnormal()
                    gen x2 = rnormal()
                    gen x3 = rnormal()
                    gen e1 = rnormal(0,1)
                    local adjust = sqrt(2)
                    local b1 = .7 * `adjust'
                    local b2 = .5 * `adjust'
                    local b3 = .2 * `adjust'
                    local b0 = 1 * `adjust'
                    
                    gen ystar = `b0' + `b1'*x1 + `b2'*x2 + `b3'*x3 + e1
                    gen yprob = normal(ystar)
                    fracreg probit yprob x1 x2 x3, nolog
                Code:
                Fractional probit regression                    Number of obs     =      1,000
                                                                Wald chi2(3)      =     775.96
                                                                Prob > chi2       =     0.0000
                Log pseudolikelihood = -399.69606               Pseudo R2         =     0.2391
                
                ------------------------------------------------------------------------------
                             |               Robust
                       yprob |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          x1 |   .7005894   .0305856    22.91   0.000     .6406428    .7605361
                          x2 |   .5119914   .0265617    19.28   0.000     .4599314    .5640514
                          x3 |    .193765   .0261246     7.42   0.000     .1425618    .2449682
                       _cons |   1.020026   .0274727    37.13   0.000     .9661809    1.073872
                ------------------------------------------------------------------------------
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Rich: AFAIK Enrique's "formula" derives from evaluating the expectation that he sets out in the line above. Evaluation of that formula involves an integral and a distributional assumption and it so happens that, with a normal distribution, the "integrating out" leads to the closed form expression that he shows. My understanding is that there is no neat closed form expression if the distributional assumption is changed to logistic. [The "integrating out" has to be done numerically. However, my guess is that it would, as you conjecture, lead to similar results.]
                  I have been running some simulations, and it appears that, in the case of logit, using the procedure I suggested may result in very slight underestimates of the parameters. But if so it may not be enough to worry about if you prefer to use logit.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 19.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    Hello Richard,

                    Most of the intuition that you have from the logit and probit framework translates to the fracreg framework. Specifically, that the results in terms of effects are very similar. Simulations that focus on the effects, using margins for example, should verify this. Also, I think that the way you are generating the data in the post where you rescale your coefficients is a legitimate way of doing it.
                    Last edited by Enrique Pinzon (StataCorp); 26 Jun 2015, 12:11.

                    Comment


                    • #11
                      Originally posted by Enrique Pinzon (StataCorp) View Post
                      Hello Richard,

                      Most of the intuition that you have from the logit and probit framework translates to the fracreg framework. Specifically, that the results in terms of effects are very similar. Simulations that focus on the effects, using margins for example, should verify this. Also, I think that the way you are generating the data in the post where you rescale your coefficients is a legitimate way of doing it.

                      Thanks Enrique. I will quote you on this. ;-) You and the others have been an enormous help. I am going to play around with Maarten's approach in post 5 and my approach in post 8. If both seem to work equally well I may just go with whichever one is easiest to explain.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Originally posted by Richard Williams View Post
                        In this particular case, y winds up having 23 missing values. The number of missing varies depending on the parameters I specify, e.g. with your original values of -1 and 1.25 I only had 2 missing cases. Is this an inherent problem with the approach, or is there a workaround?
                        This has to do with rbeta working for only a specific range of values for a and b. Theoretically a and b could be any positive number, but as implemented it only allows values in the range 0.05-1e5 for a and 0.15 - 1e5 for b. Especially the lower bounds can easily bite.

                        ---------------------------------
                        Maarten L. Buis
                        University of Konstanz
                        Department of history and sociology
                        box 40
                        78457 Konstanz
                        Germany
                        http://www.maartenbuis.nl
                        ---------------------------------

                        Comment


                        • #13
                          Here is another way of simulating data for fracreg that does not require an adjustment:

                          Code:
                          clear all
                          
                          program define sim
                              drop _all
                              set obs 200
                              gen x = rnormal()
                              gen p = invlogit(-1+.5*x)
                              gen y = rbinomial(1440, p) // 1440 = # minutes / day
                              replace y = y/1440
                              fracreg logit y x
                          end
                          
                          simulate b=_b[x] se=_se[x], reps(1000) : sim
                          ---------------------------------
                          Maarten L. Buis
                          University of Konstanz
                          Department of history and sociology
                          box 40
                          78457 Konstanz
                          Germany
                          http://www.maartenbuis.nl
                          ---------------------------------

                          Comment


                          • #14
                            Thanks again Maarten. Not sure why it works but I will try it.

                            Going back to the approaches where I was using an adjustment -- perhaps it is just a matter of aesthetics, but I wonder if it would make more sense not to make an adjustment. The unadjusted Beta is the parameter that generated the data, after all, so why not use it?

                            Semi-related, fracreg pretty much forces you to use robust standard errors. But if I know for a fact (or am 99% certain) that a logit or probit distribution generated the data, would it be ok to use non-robust errors? The manual says

                            By contrast, if the conditional mean of the model is the same as the conditional mean of a probit but the model is not a probit, the point estimates are consistent, but the standard errors are not asymptotically efficient.
                            To me that implies that if I generated the data myself using logit or probit then I don't need robust standard errors. But maybe that is wrong. fracreg doesn't seem to have an option for non-robust.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            StataNow Version: 19.5 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Originally posted by Richard Williams View Post
                              Not sure why it works but I will try it.
                              The difference is that there is no error term within the link function. This is similar to the discussion about unobserved heterogeneity in logistic regression: Because the onobserved heterogeneity is unobserved the model in effect averages over it, and since the logit function is a non-linear transformation this averageing changes the effect. In case of the logit transformation it is guaranteed to reduce the effect. Since in my simulation there was no error term withing the link function, there was no averaging, so no need for an adjustment.

                              Originally posted by Richard Williams View Post
                              Going back to the approaches where I was using an adjustment -- perhaps it is just a matter of aesthetics, but I wonder if it would make more sense not to make an adjustment. The unadjusted Beta is the parameter that generated the data, after all, so why not use it?
                              That depends on what the purpose of the study it and how you think these proportions were generated. If you think the error term is real, then doing this with or without adjustment is like the difference between a population averaged model and individual model in multilevel logistic regression. Both model can be valid, but answer different questions. Alternatively if you think that my last simulation describes the data generating process best, then you obviously would not want to adjust as there is nothing to adjust.

                              Originally posted by Richard Williams View Post
                              Semi-related, fracreg pretty much forces you to use robust standard errors. But if I know for a fact (or am 99% certain) that a logit or probit distribution generated the data, would it be ok to use non-robust errors?
                              The logit and probit parts are link functions not distributions. In fact the whole point of fracreg is that you don't specify a distribution at all, but use quasi-maximum likelihood to just estimate effects on the conditional mean and nothing else. So there is no way around using robust standard errors when using fracreg. The alternative would be to specify a distribution for the proportion, which is what betareg does.
                              ---------------------------------
                              Maarten L. Buis
                              University of Konstanz
                              Department of history and sociology
                              box 40
                              78457 Konstanz
                              Germany
                              http://www.maartenbuis.nl
                              ---------------------------------

                              Comment

                              Working...
                              X