Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interval regression

    Hello!

    I would be glad to hear your opinion on this.

    My dependent variable (y_var) measures the number that a certain event is showed for each observation, ranging from 0 up to 11. As you can see below, 54% of my sample has a value of 0 y_var:
    Code:
    tab y_var, m
    
          y_var |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        111       53.88       53.88
              1 |          5        2.43       56.31
              2 |         31       15.05       71.36
              3 |         11        5.34       76.70
              4 |          7        3.40       80.10
              5 |         18        8.74       88.83
              6 |          8        3.88       92.72
              7 |          3        1.46       94.17
              8 |          2        0.97       95.15
              9 |          5        2.43       97.57
             10 |          2        0.97       98.54
             11 |          3        1.46      100.00
    ------------+-----------------------------------
          Total |        206      100.00
    My independent variable of interest is a categorical variable, that counts the number of correct answers in a certain test:
    Code:
    tab crt, m
    
          nº of |
        answers |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         56       27.18       27.18
              1 |         40       19.42       46.60
              2 |         46       22.33       68.93
              3 |         64       31.07      100.00
    ------------+-----------------------------------
          Total |        206      100.00
    Naturally, since my dependent variable counts the number of times that an individual exhibits the event in the data, I was thinking to explore this relationship using a Negative Binomial model or a Zero-Inflated model.

    However, I came up with an idea that might allow me to explore this relationship using an interval regression as well. I hope to hear your opinion on this:

    I have defined a new dependent variable with 3 categories; category 1, all of those who showed a number of events equal to 11; category 2, all of those who showed a number of events between 1 and 10; and category 3, all of those with a number of events equal to 0:

    Code:
    tab new_yvar, m
    
       new_yvar |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          3        1.46        1.46
              2 |         92       44.66       46.12
              3 |        111       53.88      100.00
    ------------+-----------------------------------
          Total |        206      100.00
    Since I know the cut-off values (i.e. 1, 2, 3, … 11), I am able to create the upper (y2) and lower (y1) limit of each of these three categories:
    Code:
    g y1 = .
    g y2 = .
    replace y1 = . if new_yvar == 3
    replace y2 = 0 if new_yvar == 3
    replace y1 = 1 if new_yvar == 2
    replace y2 = 10 if new_yvar == 2
    replace y1 = 11 if new_yvar == 1
    replace y2 = . if new_yvar == 1
    
    sum y1 y2      
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
              y1 |         95    1.315789     1.75804          1         11
              y2 |        203     4.53202    4.990358          0         10
    Finally, I have set up a regression model, and estimated it through an interval regression:
    Code:
     intreg y1 y2 i.crt, robust nolog
    
    Interval regression                             Number of obs     =        206
                                                       Uncensored     =          0
                                                       Left-censored  =        111
                                                       Right-censored =          3
                                                       Interval-cens. =         92
    
                                                    Wald chi2(3)      =       5.21
    Log pseudolikelihood = -171.53029               Prob > chi2       =     0.1569
    
    ------------------------------------------------------------------------------
                 |               Robust
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             crt |
              0  |          0  (base)
              1  |  -1.862053   1.273837    -1.46   0.144    -4.358727    .6346216
              2  |  -2.344412   1.287257    -1.82   0.069    -4.867389     .178565
              3  |  -2.226667   1.153804    -1.93   0.054    -4.488082     .034748
                 |
           _cons |   1.279063   .8225653     1.55   0.120    -.3331352    2.891262
    -------------+----------------------------------------------------------------
        /lnsigma |   1.698164   .0856653    19.82   0.000     1.530263    1.866065
    -------------+----------------------------------------------------------------
           sigma |   5.463908   .4680675                      4.619393    6.462817
    ------------------------------------------------------------------------------
    If my exercise is correct, I would be able to interpret directly the coefficients from the regression output; for example, having 3 correct answers in the crt test, on average, would decrease the number of events exhibited by 2.2.

    I am wondering if it makes any sense the exercise I am proposing to use my dependent variable as an ordered variable? Conditional on that, it would be reasonable to compare my interval regression results with the results that I could get estimating a model for count data?

    Any further suggestion is very welcome!

    Many thanks!

  • #2
    The number of events is necesarilly discrete, so I don't see how an interval regression would make sense. Interval regression is for situations where the variable that you want to measure is continuous, but you happend to use a question with a limited number of answer categories. For example, you asked someones age, and you gave them the options of saying 21-30, 31-40, etc.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Thanks for your answer, Maarten.

      Comment


      • #4
        Originally posted by Maarten Buis View Post
        The number of events is necesarilly discrete, so I don't see how an interval regression would make sense. Interval regression is for situations where the variable that you want to measure is continuous, but you happend to use a question with a limited number of answer categories. For example, you asked someones age, and you gave them the options of saying 21-30, 31-40, etc.
        Hi Maarten,

        I also have a question on intreg. Let take Gaston's data as an example. But instead of using the number of events (from 0-3), my case is income (a categorical variable), in which 0 corresponds to income <$5000; 1 corresponds to $5000-$10,000; 2 corresponds to $10k-$15k; and 3 corresponds to $15k-$20. My question is should I use intreg? and how to do an interval regression in my case or it is just similar to #1.

        Thank you.

        DL
        Last edited by Dung Le; 08 Apr 2020, 10:08.

        Comment


        • #5
          Dung Le, as I can tell from Maarten's comment, and since you know your interval's thresholds, an interval regression might be suitable.

          Comment


          • #6
            Here are some notes on intreg, which include notes on when it is ok to use it, and how you might modify the model if it doesn't seem to be working well.

            https://www3.nd.edu/~rwilliam/xsoc73994/intreg2.pdf
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Thank you Gaston Fernandez and Richard Williams for your esponses

              I have read throughout Richard's instruction and am I correct by doing so?
              Code:
               tab income
              
                   income |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |         69        1.79        1.79
                        2 |        702       18.19       19.98
                        3 |      1,895       49.11       69.09
                        4 |        812       21.04       90.13
                        5 |        336        8.71       98.83
                        6 |         45        1.17      100.00
              ------------+-----------------------------------
                    Total |      3,859      100.00
              
              * Generate a lower limit var
              .recode income (1=.) (2=1) (3=2) (4=3) (5=4) (6=5), gen(incl)
              (3859 differences between income and incl)
              
              * Generate an upper limit var
              recode income (6=.), gen(incu)
              (45 differences between income and incu)
              
              intreg y incl incu x1 x2 x3
              Thank you

              DL

              Comment


              • #8
                How is income coded? That is not consistent with what you said in #4. I’d be surprised if your coding is right.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Originally posted by Richard Williams View Post
                  How is income coded? That is not consistent with what you said in #4. I’d be surprised if your coding is right.
                  Hi Richard,

                  I am sorry for making you confused. Let me explain the income code in #7. That said:

                  1 corresponds to income <$1000
                  2 corresponds to income from $1000 to <$5000
                  3 corresponds to income from $5000 to <$10,000
                  4 corresponds to income from $10,000 to <$15,000
                  5 corresponds to income from $15,00 to <$20,000
                  6 corresponds to income from $20,000 to <$25,000

                  This income variable is categorical consisting of six categories as shown above. I think I can use oprobit regression as an alternative, however, I also want to try interval one so that I can compare estimates of the two.

                  Thank you.

                  Comment


                  • #10
                    Richard Williams thanks for sharing your notes.

                    Dung Le, I am wondering why would you like to estimate it through an oprobit regression if you know the cutpoints of your ordinal variable?

                    Comment


                    • #11
                      First off, I stole most of my notes from the Stata manual!

                      intreg makes assumptions that may be questionable. The Stata manual suggests that you compare results from intreg and oprobit. If the oprobit model fits much better than integ, then either you shouldn't use intreg or you should modify the intreg model so it fits better, e.g. add an x^2 term.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Originally posted by Gaston Fernandez View Post
                        Richard Williams thanks for sharing your notes.

                        Dung Le, I am wondering why would you like to estimate it through an oprobit regression if you know the cutpoints of your ordinal variable?

                        Hi Gaston Fernandez,

                        The reason that I may want to use oprobit is as explained by Richard Williams. My main concern is whether my codes used to generate incl and incu in #7 are correct?

                        Comment


                        • #13
                          The codes you used in #7 are not correct. The codes should correspond to the endpoints of the intervals, e.g. for income category 2, the lower and upper bounds should be coded 1000 and 5000.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Originally posted by Richard Williams View Post
                            The codes you used in #7 are not correct. The codes should correspond to the endpoints of the intervals, e.g. for income category 2, the lower and upper bounds should be coded 1000 and 5000.
                            Thank you, I get your point.

                            Comment


                            • #15
                              Hey Gaston, i am curious if you ever considered to evaluate the data using churdle? I have similar outcome and I was investigating ZIP, ZINB and Churdle options.

                              Comment

                              Working...
                              X