Interval regression

Gaston Fernandez

Join Date: Jul 2015
Posts: 27

Interval regression

07 Apr 2020, 14:04

Hello!

I would be glad to hear your opinion on this.

My dependent variable (y_var) measures the number that a certain event is showed for each observation, ranging from 0 up to 11. As you can see below, 54% of my sample has a value of 0 y_var:

Code:

tab y_var, m

      y_var |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        111       53.88       53.88
          1 |          5        2.43       56.31
          2 |         31       15.05       71.36
          3 |         11        5.34       76.70
          4 |          7        3.40       80.10
          5 |         18        8.74       88.83
          6 |          8        3.88       92.72
          7 |          3        1.46       94.17
          8 |          2        0.97       95.15
          9 |          5        2.43       97.57
         10 |          2        0.97       98.54
         11 |          3        1.46      100.00
------------+-----------------------------------
      Total |        206      100.00

My independent variable of interest is a categorical variable, that counts the number of correct answers in a certain test:

Code:

tab crt, m

      nº of |
    answers |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         56       27.18       27.18
          1 |         40       19.42       46.60
          2 |         46       22.33       68.93
          3 |         64       31.07      100.00
------------+-----------------------------------
      Total |        206      100.00

Naturally, since my dependent variable counts the number of times that an individual exhibits the event in the data, I was thinking to explore this relationship using a Negative Binomial model or a Zero-Inflated model.

However, I came up with an idea that might allow me to explore this relationship using an interval regression as well. I hope to hear your opinion on this:

I have defined a new dependent variable with 3 categories; category 1, all of those who showed a number of events equal to 11; category 2, all of those who showed a number of events between 1 and 10; and category 3, all of those with a number of events equal to 0:

Code:

tab new_yvar, m

   new_yvar |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3        1.46        1.46
          2 |         92       44.66       46.12
          3 |        111       53.88      100.00
------------+-----------------------------------
      Total |        206      100.00

Since I know the cut-off values (i.e. 1, 2, 3, … 11), I am able to create the upper (y2) and lower (y1) limit of each of these three categories:

Code:

g y1 = .
g y2 = .
replace y1 = . if new_yvar == 3
replace y2 = 0 if new_yvar == 3
replace y1 = 1 if new_yvar == 2
replace y2 = 10 if new_yvar == 2
replace y1 = 11 if new_yvar == 1
replace y2 = . if new_yvar == 1

sum y1 y2      

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          y1 |         95    1.315789     1.75804          1         11
          y2 |        203     4.53202    4.990358          0         10

Finally, I have set up a regression model, and estimated it through an interval regression:

Code:

 intreg y1 y2 i.crt, robust nolog

Interval regression                             Number of obs     =        206
                                                   Uncensored     =          0
                                                   Left-censored  =        111
                                                   Right-censored =          3
                                                   Interval-cens. =         92

                                                Wald chi2(3)      =       5.21
Log pseudolikelihood = -171.53029               Prob > chi2       =     0.1569

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         crt |
          0  |          0  (base)
          1  |  -1.862053   1.273837    -1.46   0.144    -4.358727    .6346216
          2  |  -2.344412   1.287257    -1.82   0.069    -4.867389     .178565
          3  |  -2.226667   1.153804    -1.93   0.054    -4.488082     .034748
             |
       _cons |   1.279063   .8225653     1.55   0.120    -.3331352    2.891262
-------------+----------------------------------------------------------------
    /lnsigma |   1.698164   .0856653    19.82   0.000     1.530263    1.866065
-------------+----------------------------------------------------------------
       sigma |   5.463908   .4680675                      4.619393    6.462817
------------------------------------------------------------------------------

If my exercise is correct, I would be able to interpret directly the coefficients from the regression output; for example, having 3 correct answers in the crt test, on average, would decrease the number of events exhibited by 2.2.

I am wondering if it makes any sense the exercise I am proposing to use my dependent variable as an ordered variable? Conditional on that, it would be reasonable to compare my interval regression results with the results that I could get estimating a model for count data?

Any further suggestion is very welcome!

Many thanks!

Tags: None

Maarten Buis

Join Date: Mar 2014

Posts: 3456
#2

07 Apr 2020, 14:16

The number of events is necesarilly discrete, so I don't see how an interval regression would make sense. Interval regression is for situations where the variable that you want to measure is continuous, but you happend to use a question with a limited number of answer categories. For example, you asked someones age, and you gave them the options of saying 21-30, 31-40, etc.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#3

08 Apr 2020, 01:47

Thanks for your answer, Maarten.
Comment
Dung Le

Join Date: May 2018

Posts: 120
#4

08 Apr 2020, 10:06

Originally posted by Maarten Buis View Post

The number of events is necesarilly discrete, so I don't see how an interval regression would make sense. Interval regression is for situations where the variable that you want to measure is continuous, but you happend to use a question with a limited number of answer categories. For example, you asked someones age, and you gave them the options of saying 21-30, 31-40, etc.

Hi Maarten,

I also have a question on intreg. Let take Gaston's data as an example. But instead of using the number of events (from 0-3), my case is income (a categorical variable), in which 0 corresponds to income <$5000; 1 corresponds to $5000-$10,000; 2 corresponds to $10k-$15k; and 3 corresponds to $15k-$20. My question is should I use intreg? and how to do an interval regression in my case or it is just similar to #1.

Thank you.

DL

Last edited by Dung Le; 08 Apr 2020, 10:08.
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#5

09 Apr 2020, 05:28

Dung Le, as I can tell from Maarten's comment, and since you know your interval's thresholds, an interval regression might be suitable.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#6

09 Apr 2020, 05:47

Here are some notes on intreg, which include notes on when it is ok to use it, and how you might modify the model if it doesn't seem to be working well.

https://www3.nd.edu/~rwilliam/xsoc73994/intreg2.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Dung Le

Join Date: May 2018
Posts: 120

09 Apr 2020, 08:20

Thank you Gaston Fernandez and Richard Williams for your esponses

I have read throughout Richard's instruction and am I correct by doing so?

Code:

 tab income

     income |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         69        1.79        1.79
          2 |        702       18.19       19.98
          3 |      1,895       49.11       69.09
          4 |        812       21.04       90.13
          5 |        336        8.71       98.83
          6 |         45        1.17      100.00
------------+-----------------------------------
      Total |      3,859      100.00

* Generate a lower limit var
.recode income (1=.) (2=1) (3=2) (4=3) (5=4) (6=5), gen(incl)
(3859 differences between income and incl)

* Generate an upper limit var
recode income (6=.), gen(incu)
(45 differences between income and incu)

intreg y incl incu x1 x2 x3

Thank you

DL

Comment

Richard Williams

Join Date: Apr 2014

Posts: 4987
#8

09 Apr 2020, 08:36

How is income coded? That is not consistent with what you said in #4. I’d be surprised if your coding is right.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Dung Le

Join Date: May 2018

Posts: 120
#9

09 Apr 2020, 09:00

Originally posted by Richard Williams View Post

How is income coded? That is not consistent with what you said in #4. I’d be surprised if your coding is right.

Hi Richard,

I am sorry for making you confused. Let me explain the income code in #7. That said:

1 corresponds to income <$1000
2 corresponds to income from $1000 to <$5000
3 corresponds to income from $5000 to <$10,000
4 corresponds to income from $10,000 to <$15,000
5 corresponds to income from $15,00 to <$20,000
6 corresponds to income from $20,000 to <$25,000

This income variable is categorical consisting of six categories as shown above. I think I can use oprobit regression as an alternative, however, I also want to try interval one so that I can compare estimates of the two.

Thank you.
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#10

09 Apr 2020, 10:08

Richard Williams thanks for sharing your notes.

Dung Le, I am wondering why would you like to estimate it through an oprobit regression if you know the cutpoints of your ordinal variable?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#11

09 Apr 2020, 10:17

First off, I stole most of my notes from the Stata manual!

intreg makes assumptions that may be questionable. The Stata manual suggests that you compare results from intreg and oprobit. If the oprobit model fits much better than integ, then either you shouldn't use intreg or you should modify the intreg model so it fits better, e.g. add an x^2 term.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Dung Le

Join Date: May 2018

Posts: 120
#12

09 Apr 2020, 10:41

Originally posted by Gaston Fernandez View Post

Richard Williams thanks for sharing your notes.

Dung Le, I am wondering why would you like to estimate it through an oprobit regression if you know the cutpoints of your ordinal variable?

Hi Gaston Fernandez,

The reason that I may want to use oprobit is as explained by Richard Williams. My main concern is whether my codes used to generate incl and incu in #7 are correct?
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#13

09 Apr 2020, 10:50

The codes you used in #7 are not correct. The codes should correspond to the endpoints of the intervals, e.g. for income category 2, the lower and upper bounds should be coded 1000 and 5000.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Dung Le

Join Date: May 2018

Posts: 120
#14

09 Apr 2020, 11:16

Originally posted by Richard Williams View Post

The codes you used in #7 are not correct. The codes should correspond to the endpoints of the intervals, e.g. for income category 2, the lower and upper bounds should be coded 1000 and 5000.

Thank you, I get your point.
Comment
Gizem Levent

Join Date: Mar 2019

Posts: 15
#15

09 Apr 2020, 12:13

Hey Gaston, i am curious if you ever considered to evaluate the data using churdle? I have similar outcome and I was investigating ZIP, ZINB and Churdle options.
Comment

Announcement

Interval regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment