Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can I use NBREG for a non-negative integer that doesn't represent a count of occurrences?

    Dear all,

    I have a question regarding the appropriateness of using negative binomial regression (NBREG) for a dependent variable that is a non-negative integer but doesn’t represent a count of occurrences.

    My dependent variable is project performance, categorized by the company into about 10 intervals based on project margin (e.g., category one for negative project margin; category two for margins from 0 to 10,000; category three for margins from 10,000 to 50,000; ...; the last category is for margins beyond 1,000,000). Each category represents a different interval range.

    I'm considering both ordered probit and negative binomial models. However, I’ve read that having too many categories can make interpreting coefficients in an ordered probit model challenging. On the other hand, while the negative binomial model is typically used for count variables representing occurrences, I wonder if it can be applied to my dependent variable scenario.

    Are there any papers or books or posts that support using negative binomial regression for a variable like mine?

    I appreciate your time and effort in answering my question.

  • #2
    OLS will do. See post #11 here from J. Wooldridge.
    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1366887-decile-as-dependent-variable-what-should-be-the-right-model
    The problem is the scale of the coefficient. It will not match the actual effect when using the 1,2,3....10 scale. You might try scaling by the mean of the categories, which presents problems at the lower and upper categories so you'll have to noodle with that. There are some rules of thumb:
    HTML Code:
    https://www.scielo.br/j/rsp/a/SFCpXVvpPVWZcMSKtwdkM9s/?lang=en
    Code:
    clear all
    version 18
    
    set obs 1000
    g id = _n
    
    g x = rnormal(10,3)
    g z = rnormal(10,4)
    g t = runiform() > 0.50
    
    g p = 10 + 1*x - 0.75*z - 5*t + rnormal(0,2)
    g lp = ln(p)
    
    egen y = xtile(p), n(10)
    scatter y p
    g yf = y/10
    tab yf
    tabstat p, by(y) save
    g ys = .
    forv i = 1/10 {
        replace ys = r(Stat`i')[1,1] if y==`i'
    }
    
    reg p x z t
    margins, dydx(t)
    margins, dydx(x)
    margins, dydx(z)
    
    reg y x z t
    margins, dydx(t)
    margins, dydx(x)
    margins, dydx(z)
    
    reg ys x z t
    margins, dydx(t)
    margins, dydx(x)
    margins, dydx(z)
    
    nbreg ys x z t
    margins, dydx(t)
    margins, dydx(x)
    margins, dydx(z)
    
    poisson ys x z t , r
    margins, dydx(t)
    margins, dydx(x)
    margins, dydx(z)

    Comment


    • #3
      It seems you actually have data censoring where you know the so-called cut points. In other words, if it was reported, you could use the project margin. Are you interested in the effects on the product margin or on the (arbitrarily) defined categories. If the former, then use intreg and specify the upper and lower cutoffs as above. Then, you interpret the regression coefficients as if you observed the margins.

      Comment


      • #4
        HTML Code:
        https://stats.oarc.ucla.edu/stata/dae/interval-regression/

        Comment


        • #5
          Originally posted by Jeff Wooldridge View Post
          It seems you actually have data censoring where you know the so-called cut points. In other words, if it was reported, you could use the project margin. Are you interested in the effects on the product margin or on the (arbitrarily) defined categories. If the former, then use intreg and specify the upper and lower cutoffs as above. Then, you interpret the regression coefficients as if you observed the margins.
          Thanks a lot, Jeff, for the quick response!

          I'm studying the effects of team-level experiential diversity on project margin to see if it's positive, negative, or non-significant. So far, my research leans towards a positive effect. Interestingly, this positive effect remains consistent when using both interval regression (intreg) and negative binomial regression (nbreg).

          I've got a couple of questions:
          1. The intreg model assumes normality, but my original dependent variable, as well as its transformed versions (depvar1, depvar2), and even the log-transformed forms (log_depvar1, log_depvar2), don't adhere to a normal distribution. Considering this, is intreg still the method of choice?
          2. Just out of curiosity, is it acceptable to use NBREG for a non-negative integer that doesn't exactly represent a count of occurrences?

          Comment


          • #6
            Originally posted by George Ford View Post
            OLS will do. See post #11 here from J. Wooldridge.
            HTML Code:
            https://www.statalist.org/forums/forum/general-stata-discussion/general/1366887-decile-as-dependent-variable-what-should-be-the-right-model
            The problem is the scale of the coefficient. It will not match the actual effect when using the 1,2,3....10 scale. You might try scaling by the mean of the categories, which presents problems at the lower and upper categories so you'll have to noodle with that. There are some rules of thumb:
            HTML Code:
            https://www.scielo.br/j/rsp/a/SFCpXVvpPVWZcMSKtwdkM9s/?lang=en
            Code:
            clear all
            version 18
            
            set obs 1000
            g id = _n
            
            g x = rnormal(10,3)
            g z = rnormal(10,4)
            g t = runiform() > 0.50
            
            g p = 10 + 1*x - 0.75*z - 5*t + rnormal(0,2)
            g lp = ln(p)
            
            egen y = xtile(p), n(10)
            scatter y p
            g yf = y/10
            tab yf
            tabstat p, by(y) save
            g ys = .
            forv i = 1/10 {
            replace ys = r(Stat`i')[1,1] if y==`i'
            }
            
            reg p x z t
            margins, dydx(t)
            margins, dydx(x)
            margins, dydx(z)
            
            reg y x z t
            margins, dydx(t)
            margins, dydx(x)
            margins, dydx(z)
            
            reg ys x z t
            margins, dydx(t)
            margins, dydx(x)
            margins, dydx(z)
            
            nbreg ys x z t
            margins, dydx(t)
            margins, dydx(x)
            margins, dydx(z)
            
            poisson ys x z t , r
            margins, dydx(t)
            margins, dydx(x)
            margins, dydx(z)
            Thanks a lot, George, for your detailed reply! I have a question about the lower category. The file the company sent us only indicates "negative contribution margin" for it. If I use a "midpoint" strategy for categorizing, is there a good heuristic for deciding "how negative" I should set this category?

            Comment


            • #7
              As a general rule, when J Wooldridge says do X, then with high probability it's a good idea to do X. (I'd deleted my post after seeing his if it was permitted).

              I've never used intreg, but it appears to do the trick. Better yet, you don't have to arbitrarily sets the lower/upper midpoints.

              Here's a simple demonstration.

              Code:
              clear all
              version 18
              
              matrix R = J(200,3,.)
              
              forv i = 1/200 {
                  quietly {
                  drop _all
                  set obs 5000
                  g x = rchi2(5)
                  g z = rnormal(10,20)
                  g t = runiform() > 0.50
                  g p = 50 + 20*x -2*z - 10*t + rnormal(0,20)
                  *hist p
                  egen y = cut(p), at(-1000,0,50,100,150,200,250,300,350,400,1000)
                  recode y (-1000 = .) , g(ylow)
                  g yhigh = ylow + 50
                  replace yhigh = 0 if ylow==.
                  replace yhigh = . if ylow==400
                  intreg ylow yhigh x z t
                  matrix R[`i',1] = e(b)[1,1]
                  matrix R[`i',2] = e(b)[1,2]
                  matrix R[`i',3] = e(b)[1,3]
                  }
              }
              capture drop R*
              svmat R
              summ R*
              Code:
              Variable    Obs    Mean    Std. dev.    Min    Max
                                  
              R1    200    19.98064    .1237382    19.65523    20.29785
              R2    200    -1.998237    .0174618    -2.039773    -1.94177
              R3    200    -9.915153    .7431672    -12.01851    -7.576377

              Comment

              Working...
              X