Inconsistent estimates in a Ordered Choice Model

Geoff Zheng

Join Date: Sep 2021
Posts: 11

Inconsistent estimates in a Ordered Choice Model

20 Aug 2024, 07:28

I encountered a surprising phenomenon while trying out the `oprobit` and `ologit` commands. Many texts (including the Stata documentation) relate these to a data generating process in which a latent continuous variable y* is binned into observed bins y. My understanding was that these models are intended to estimate:
1) The parameters of the latent continuous variable y*
2) The cutpoints separating the bins of y* into y

The following Stata code produced unusual results:

Code:

clear *

local N=999
set obs `=`N'*3'

gen grp = mod(_n, 3)

gen grp_mean = .
replace grp_mean = -1 if grp == 0
replace grp_mean = +0 if grp == 1
replace grp_mean = +1 if grp == 2

gen x = grp_mean + rnormal()

oprobit grp x

Code:

Iteration 0:  Log likelihood =  -3292.541  
Iteration 1:  Log likelihood = -2566.0424  
Iteration 2:  Log likelihood = -2562.1137  
Iteration 3:  Log likelihood = -2562.1062  
Iteration 4:  Log likelihood = -2562.1062  

Ordered probit regression                              Number of obs =   2,997
                                                       LR chi2(1)    = 1460.87
                                                       Prob > chi2   =  0.0000
Log likelihood = -2562.1062                            Pseudo R2     =  0.2218

------------------------------------------------------------------------------
         grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   .7311482   .0209857    34.84   0.000     .6900169    .7722795
-------------+----------------------------------------------------------------
       /cut1 |  -.5895484   .0275614                     -.6435677   -.5355291
       /cut2 |   .6169286   .0277134                      .5626113    .6712458
------------------------------------------------------------------------------

The surprising thing was that the cutpoints were not -0.5 and +0.5, as I would've expected. Furthermore, the cutpoints are systematically biased away from, rather than towards the unconditional mean. To verify this, I ran the above code again with more observations, and got the following

Code:

Ordered probit regression                            Number of obs =   299,997
                                                     LR chi2(1)    = 151649.88
                                                     Prob > chi2   =    0.0000
Log likelihood = -253755.45                          Pseudo R2     =    0.2301

------------------------------------------------------------------------------
         grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   .7549289   .0021425   352.35   0.000     .7507297    .7591282
-------------+----------------------------------------------------------------
       /cut1 |  -.6116182   .0027802                     -.6170672   -.6061691
       /cut2 |   .6096994   .0027812                      .6042484    .6151504
------------------------------------------------------------------------------

I verified that similar results obtain with `rlogistic` and the `ologit`. As a final check, I verified that the estimated cutpoints are still incorrect even when using the `offset` option

Code:

. oprobit grp, offset(x)

Iteration 0:  Log likelihood = -272298.18  
Iteration 1:  Log likelihood = -259882.53  
Iteration 2:  Log likelihood = -259871.17  
Iteration 3:  Log likelihood = -259871.17  

Ordered probit regression                              Number of obs = 299,997
Log likelihood = -259871.17

------------------------------------------------------------------------------
         grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |          1  (offset)
-------------+----------------------------------------------------------------
       /cut1 |  -.6813364    .002864                     -.6869497   -.6757231
       /cut2 |   .6785018   .0028682                      .6728803    .6841234
------------------------------------------------------------------------------

This seems like a straightforward example, so I'm a bit stuck as to what is causing this.

Tags: None

John Mullahy

Join Date: Dec 2016
Posts: 750

21 Aug 2024, 07:53

My hunch is that you haven't imposed a (conditionally) normal distribution on the latent outcome y* while –oprobit– assumes your outcome is defined thusly. Normality of the distribution of the covariate (=x) doesn't come into play.

Consider an example where normality of f(y*|x) is imposed.

Code:

set seed 123456

local N=9999
set obs `=`N'*3'

loc c1=-.5
loc c2=.5

gen grp = mod(_n, 3)

gen grp_mean = .
replace grp_mean = -1 if grp == 0
replace grp_mean = +0 if grp == 1
replace grp_mean = +1 if grp == 2

gen x = grp_mean + rnormal()

gen ystar = x + rnormal()
gen y=-1*(ystar<`c1') + 0*(ystar>=`c1' & ystar<`c2') + 1*(ystar>=`c2')

oprobit grp x
oprobit y x

Results:

Code:

. oprobit grp x

Iteration 0:   log likelihood = -32955.073  
Iteration 1:   log likelihood = -25522.396  
Iteration 2:   log likelihood = -25473.357  
Iteration 3:   log likelihood = -25473.256  
Iteration 4:   log likelihood = -25473.256  

Ordered probit regression                             Number of obs =   29,997
                                                      LR chi2(1)    = 14963.63
                                                      Prob > chi2   =   0.0000
Log likelihood = -25473.256                           Pseudo R2     =   0.2270

------------------------------------------------------------------------------
         grp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |    .750116   .0067675   110.84   0.000     .7368521      .76338
-------------+----------------------------------------------------------------
       /cut1 |  -.6066713   .0087744                     -.6238688   -.5894738
       /cut2 |   .6088155    .008768                      .5916306    .6260005
------------------------------------------------------------------------------

. oprobit y x

Iteration 0:   log likelihood =  -32288.09  
Iteration 1:   log likelihood = -22196.956  
Iteration 2:   log likelihood = -22179.099  
Iteration 3:   log likelihood = -22179.055  
Iteration 4:   log likelihood = -22179.055  

Ordered probit regression                             Number of obs =   29,997
                                                      LR chi2(1)    = 20218.07
                                                      Prob > chi2   =   0.0000
Log likelihood = -22179.055                           Pseudo R2     =   0.3131

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   .9963435    .008347   119.37   0.000     .9799837    1.012703
-------------+----------------------------------------------------------------
       /cut1 |  -.4901051   .0091592                     -.5080567   -.4721535
       /cut2 |   .5060214    .009177                      .4880348     .524008
------------------------------------------------------------------------------

Comment

Geoff Zheng

Join Date: Sep 2021

Posts: 11
#3

21 Aug 2024, 12:25

Thanks for providing a "null" case in which it does recover the parameters. I guess two mutually consistent explanations for the phenomenon are:
1) As you said, the model assumes that the latent variable is conditionally normal given x in a particular way: the noise is independent of x. In my simulated data, y* was a degenerate point distribution at each group mean. Thus, the true noise was correlated with x, violating the assumptions
2) In general, the maximum likelihood estimate of the cutpoint is *not* the midpoint between the group means, it is always "pushed away from the middle". I got this intuition from differentiating the likelihood function of the ordered probit (logit would be the same), and there it is clear that what you are trading off is the probability of being in group 1 versus group 2, rather than group 1 versus its complement.
1 like
Comment
John Mullahy

Join Date: Dec 2016

Posts: 750
#4

21 Aug 2024, 12:25

P.S. to #2. Among other topics this recent paper discusses some features of ordered regression models that may not be widely appreciated.

https://www.sciencedirect.com/scienc...67629624000201
1 like
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 423
#5

22 Aug 2024, 08:02

Congratulations on the paper and thanks for sharing it, John Mullahy. Another incredible resource on ordered regression models is Frank Harrell. Lots of great info on the ordinal model at Frank's hbiostat website.
1 like
Comment
Geoff Zheng

Join Date: Sep 2021

Posts: 11
#6

22 Aug 2024, 08:27

Originally posted by John Mullahy View Post

P.S. to #2. Among other topics this recent paper discusses some features of ordered regression models that may not be widely appreciated.

https://www.sciencedirect.com/scienc...67629624000201

Thanks for the reference
Comment

Announcement

Inconsistent estimates in a Ordered Choice Model

Comment

Comment

Comment

Comment

Comment