Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistent estimates in a Ordered Choice Model

    I encountered a surprising phenomenon while trying out the `oprobit` and `ologit` commands. Many texts (including the Stata documentation) relate these to a data generating process in which a latent continuous variable y* is binned into observed bins y. My understanding was that these models are intended to estimate:
    1) The parameters of the latent continuous variable y*
    2) The cutpoints separating the bins of y* into y

    The following Stata code produced unusual results:

    Code:
    clear *
    
    local N=999
    set obs `=`N'*3'
    
    gen grp = mod(_n, 3)
    
    gen grp_mean = .
    replace grp_mean = -1 if grp == 0
    replace grp_mean = +0 if grp == 1
    replace grp_mean = +1 if grp == 2
    
    gen x = grp_mean + rnormal()
    
    oprobit grp x
    Code:
    Iteration 0:  Log likelihood =  -3292.541  
    Iteration 1:  Log likelihood = -2566.0424  
    Iteration 2:  Log likelihood = -2562.1137  
    Iteration 3:  Log likelihood = -2562.1062  
    Iteration 4:  Log likelihood = -2562.1062  
    
    Ordered probit regression                              Number of obs =   2,997
                                                           LR chi2(1)    = 1460.87
                                                           Prob > chi2   =  0.0000
    Log likelihood = -2562.1062                            Pseudo R2     =  0.2218
    
    ------------------------------------------------------------------------------
             grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |   .7311482   .0209857    34.84   0.000     .6900169    .7722795
    -------------+----------------------------------------------------------------
           /cut1 |  -.5895484   .0275614                     -.6435677   -.5355291
           /cut2 |   .6169286   .0277134                      .5626113    .6712458
    ------------------------------------------------------------------------------
    The surprising thing was that the cutpoints were not -0.5 and +0.5, as I would've expected. Furthermore, the cutpoints are systematically biased away from, rather than towards the unconditional mean. To verify this, I ran the above code again with more observations, and got the following
    Code:
    Ordered probit regression                            Number of obs =   299,997
                                                         LR chi2(1)    = 151649.88
                                                         Prob > chi2   =    0.0000
    Log likelihood = -253755.45                          Pseudo R2     =    0.2301
    
    ------------------------------------------------------------------------------
             grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |   .7549289   .0021425   352.35   0.000     .7507297    .7591282
    -------------+----------------------------------------------------------------
           /cut1 |  -.6116182   .0027802                     -.6170672   -.6061691
           /cut2 |   .6096994   .0027812                      .6042484    .6151504
    ------------------------------------------------------------------------------
    I verified that similar results obtain with `rlogistic` and the `ologit`. As a final check, I verified that the estimated cutpoints are still incorrect even when using the `offset` option
    Code:
    . oprobit grp, offset(x)
    
    Iteration 0:  Log likelihood = -272298.18  
    Iteration 1:  Log likelihood = -259882.53  
    Iteration 2:  Log likelihood = -259871.17  
    Iteration 3:  Log likelihood = -259871.17  
    
    Ordered probit regression                              Number of obs = 299,997
    Log likelihood = -259871.17
    
    ------------------------------------------------------------------------------
             grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |          1  (offset)
    -------------+----------------------------------------------------------------
           /cut1 |  -.6813364    .002864                     -.6869497   -.6757231
           /cut2 |   .6785018   .0028682                      .6728803    .6841234
    ------------------------------------------------------------------------------
    This seems like a straightforward example, so I'm a bit stuck as to what is causing this.

  • #2
    My hunch is that you haven't imposed a (conditionally) normal distribution on the latent outcome y* while –oprobitassumes your outcome is defined thusly. Normality of the distribution of the covariate (=x) doesn't come into play.

    Consider an example where normality of f(y*|x) is imposed.

    Code:
    set seed 123456
    
    local N=9999
    set obs `=`N'*3'
    
    loc c1=-.5
    loc c2=.5
    
    gen grp = mod(_n, 3)
    
    gen grp_mean = .
    replace grp_mean = -1 if grp == 0
    replace grp_mean = +0 if grp == 1
    replace grp_mean = +1 if grp == 2
    
    gen x = grp_mean + rnormal()
    
    gen ystar = x + rnormal()
    gen y=-1*(ystar<`c1') + 0*(ystar>=`c1' & ystar<`c2') + 1*(ystar>=`c2')
    
    oprobit grp x
    oprobit y x
    Results:
    Code:
    . oprobit grp x
    
    Iteration 0:   log likelihood = -32955.073  
    Iteration 1:   log likelihood = -25522.396  
    Iteration 2:   log likelihood = -25473.357  
    Iteration 3:   log likelihood = -25473.256  
    Iteration 4:   log likelihood = -25473.256  
    
    Ordered probit regression                             Number of obs =   29,997
                                                          LR chi2(1)    = 14963.63
                                                          Prob > chi2   =   0.0000
    Log likelihood = -25473.256                           Pseudo R2     =   0.2270
    
    ------------------------------------------------------------------------------
             grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |    .750116   .0067675   110.84   0.000     .7368521      .76338
    -------------+----------------------------------------------------------------
           /cut1 |  -.6066713   .0087744                     -.6238688   -.5894738
           /cut2 |   .6088155    .008768                      .5916306    .6260005
    ------------------------------------------------------------------------------
    
    . oprobit y x
    
    Iteration 0:   log likelihood =  -32288.09  
    Iteration 1:   log likelihood = -22196.956  
    Iteration 2:   log likelihood = -22179.099  
    Iteration 3:   log likelihood = -22179.055  
    Iteration 4:   log likelihood = -22179.055  
    
    Ordered probit regression                             Number of obs =   29,997
                                                          LR chi2(1)    = 20218.07
                                                          Prob > chi2   =   0.0000
    Log likelihood = -22179.055                           Pseudo R2     =   0.3131
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |   .9963435    .008347   119.37   0.000     .9799837    1.012703
    -------------+----------------------------------------------------------------
           /cut1 |  -.4901051   .0091592                     -.5080567   -.4721535
           /cut2 |   .5060214    .009177                      .4880348     .524008
    ------------------------------------------------------------------------------

    Comment


    • #3
      Thanks for providing a "null" case in which it does recover the parameters. I guess two mutually consistent explanations for the phenomenon are:
      1) As you said, the model assumes that the latent variable is conditionally normal given x in a particular way: the noise is independent of x. In my simulated data, y* was a degenerate point distribution at each group mean. Thus, the true noise was correlated with x, violating the assumptions
      2) In general, the maximum likelihood estimate of the cutpoint is *not* the midpoint between the group means, it is always "pushed away from the middle". I got this intuition from differentiating the likelihood function of the ordered probit (logit would be the same), and there it is clear that what you are trading off is the probability of being in group 1 versus group 2, rather than group 1 versus its complement.

      Comment


      • #4
        P.S. to #2. Among other topics this recent paper discusses some features of ordered regression models that may not be widely appreciated.

        https://www.sciencedirect.com/scienc...67629624000201

        Comment


        • #5
          Congratulations on the paper and thanks for sharing it, John Mullahy. Another incredible resource on ordered regression models is Frank Harrell. Lots of great info on the ordinal model at Frank's hbiostat website.

          Comment


          • #6
            Originally posted by John Mullahy View Post
            P.S. to #2. Among other topics this recent paper discusses some features of ordered regression models that may not be widely appreciated.

            https://www.sciencedirect.com/scienc...67629624000201
            Thanks for the reference

            Comment

            Working...
            X