Can I use NBREG for a non-negative integer that doesn't represent a count of occurrences?

Yunfan Wang

Join Date: Jul 2022

Posts: 4
#1

Can I use NBREG for a non-negative integer that doesn't represent a count of occurrences?

27 Mar 2024, 06:14

Dear all,

I have a question regarding the appropriateness of using negative binomial regression (NBREG) for a dependent variable that is a non-negative integer but doesn’t represent a count of occurrences.

My dependent variable is project performance, categorized by the company into about 10 intervals based on project margin (e.g., category one for negative project margin; category two for margins from 0 to 10,000; category three for margins from 10,000 to 50,000; ...; the last category is for margins beyond 1,000,000). Each category represents a different interval range.

I'm considering both ordered probit and negative binomial models. However, I’ve read that having too many categories can make interpreting coefficients in an ordered probit model challenging. On the other hand, while the negative binomial model is typically used for count variables representing occurrences, I wonder if it can be applied to my dependent variable scenario.

Are there any papers or books or posts that support using negative binomial regression for a variable like mine?

I appreciate your time and effort in answering my question.
Tags: None

George Ford

Join Date: Aug 2014
Posts: 3148

27 Mar 2024, 10:06

OLS will do. See post #11 here from J. Wooldridge.

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1366887-decile-as-dependent-variable-what-should-be-the-right-model

The problem is the scale of the coefficient. It will not match the actual effect when using the 1,2,3....10 scale. You might try scaling by the mean of the categories, which presents problems at the lower and upper categories so you'll have to noodle with that. There are some rules of thumb:

HTML Code:

https://www.scielo.br/j/rsp/a/SFCpXVvpPVWZcMSKtwdkM9s/?lang=en

Code:

clear all
version 18

set obs 1000
g id = _n

g x = rnormal(10,3)
g z = rnormal(10,4)
g t = runiform() > 0.50

g p = 10 + 1*x - 0.75*z - 5*t + rnormal(0,2)
g lp = ln(p)

egen y = xtile(p), n(10)
scatter y p
g yf = y/10
tab yf
tabstat p, by(y) save
g ys = .
forv i = 1/10 {
    replace ys = r(Stat`i')[1,1] if y==`i'
}

reg p x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

reg y x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

reg ys x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

nbreg ys x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

poisson ys x z t , r
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2158
#3

27 Mar 2024, 11:42

It seems you actually have data censoring where you know the so-called cut points. In other words, if it was reported, you could use the project margin. Are you interested in the effects on the product margin or on the (arbitrarily) defined categories. If the former, then use intreg and specify the upper and lower cutoffs as above. Then, you interpret the regression coefficients as if you observed the margins.
2 likes
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#4

27 Mar 2024, 13:59

HTML Code:

https://stats.oarc.ucla.edu/stata/dae/interval-regression/
Comment
Yunfan Wang

Join Date: Jul 2022

Posts: 4
#5

28 Mar 2024, 10:41

Originally posted by Jeff Wooldridge View Post

It seems you actually have data censoring where you know the so-called cut points. In other words, if it was reported, you could use the project margin. Are you interested in the effects on the product margin or on the (arbitrarily) defined categories. If the former, then use intreg and specify the upper and lower cutoffs as above. Then, you interpret the regression coefficients as if you observed the margins.

Thanks a lot, Jeff, for the quick response!

I'm studying the effects of team-level experiential diversity on project margin to see if it's positive, negative, or non-significant. So far, my research leans towards a positive effect. Interestingly, this positive effect remains consistent when using both interval regression (intreg) and negative binomial regression (nbreg).

I've got a couple of questions:
The intreg model assumes normality, but my original dependent variable, as well as its transformed versions (depvar1, depvar2), and even the log-transformed forms (log_depvar1, log_depvar2), don't adhere to a normal distribution. Considering this, is intreg still the method of choice?

Just out of curiosity, is it acceptable to use NBREG for a non-negative integer that doesn't exactly represent a count of occurrences?
Comment

Yunfan Wang

Join Date: Jul 2022
Posts: 4

28 Mar 2024, 10:50

Originally posted by George Ford View Post

OLS will do. See post #11 here from J. Wooldridge.

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1366887-decile-as-dependent-variable-what-should-be-the-right-model

HTML Code:

https://www.scielo.br/j/rsp/a/SFCpXVvpPVWZcMSKtwdkM9s/?lang=en

Code:

clear all
version 18

set obs 1000
g id = _n

g x = rnormal(10,3)
g z = rnormal(10,4)
g t = runiform() > 0.50

g p = 10 + 1*x - 0.75*z - 5*t + rnormal(0,2)
g lp = ln(p)

egen y = xtile(p), n(10)
scatter y p
g yf = y/10
tab yf
tabstat p, by(y) save
g ys = .
forv i = 1/10 {
replace ys = r(Stat`i')[1,1] if y==`i'
}

reg p x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

reg y x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

reg ys x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

nbreg ys x z t
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

poisson ys x z t , r
margins, dydx(t)
margins, dydx(x)
margins, dydx(z)

Thanks a lot, George, for your detailed reply! I have a question about the lower category. The file the company sent us only indicates "negative contribution margin" for it. If I use a "midpoint" strategy for categorizing, is there a good heuristic for deciding "how negative" I should set this category?

Comment

George Ford

Join Date: Aug 2014
Posts: 3148

29 Mar 2024, 08:54

As a general rule, when J Wooldridge says do X, then with high probability it's a good idea to do X. (I'd deleted my post after seeing his if it was permitted).

I've never used intreg, but it appears to do the trick. Better yet, you don't have to arbitrarily sets the lower/upper midpoints.

Here's a simple demonstration.

Code:

clear all
version 18

matrix R = J(200,3,.)

forv i = 1/200 {
    quietly {
    drop _all
    set obs 5000
    g x = rchi2(5)
    g z = rnormal(10,20)
    g t = runiform() > 0.50
    g p = 50 + 20*x -2*z - 10*t + rnormal(0,20)
    *hist p
    egen y = cut(p), at(-1000,0,50,100,150,200,250,300,350,400,1000)
    recode y (-1000 = .) , g(ylow)
    g yhigh = ylow + 50
    replace yhigh = 0 if ylow==.
    replace yhigh = . if ylow==400
    intreg ylow yhigh x z t
    matrix R[`i',1] = e(b)[1,1]
    matrix R[`i',2] = e(b)[1,2]
    matrix R[`i',3] = e(b)[1,3]
    }
}
capture drop R*
svmat R
summ R*

Code:

Variable    Obs    Mean    Std. dev.    Min    Max
                    
R1    200    19.98064    .1237382    19.65523    20.29785
R2    200    -1.998237    .0174618    -2.039773    -1.94177
R3    200    -9.915153    .7431672    -12.01851    -7.576377

Announcement