Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to incorporate exposure variable in Zero Truncated Negative Binomial Model

    Hello All,

    This is my first time using this platform, so apologies in advance for any faux pas. Also, I am new to model building and statistics world, so please bear with me.
    I am using a ZTNB model where my outcome is the number of waivered physicians at the ZCTA (Zip-code tabulation area) level and my explanatory variables are community characteristics - median income, race, education, marital status, rurality etc. I already checked for mean=variance assumption which was not met, hence I am using negative binomial and also my outcome has no zeros due to which I am using zero truncated. However, I am completely confused about the exposure variables that needs to be included in the model in order to account for any variance which may lead to biased estimation. Please correct me if I am wrong, but I was thinking to use population size as the exposure because population size will affect the number of waivered physicians available but I am confused (a) if this exposure makes sense and (b) how to incorporate exposure variable in the code. I read about it a lot and I am really confused because most of the forums, it was mentioned to write "exposure(varname)" after the tnbreg command but in one of STATA documentation, I also read to use "vce(cluster varname) nolog" after the tnbreg command. One last method, I read about was to enter "offset (log of variable)". Therefore, it would be really helpful to understand little bit about exposure variables and what is the right way to account for them in ZTNB model. Additionally, it would be nice to know if the way to use exposure variable changes with other count models - poisson, negative binomial, zero-inflated, zero-inflated negative binomial.

    I need to submit an abstract for a conference and I am stuck here, so your quick replies can help me a lot.

    Thank You,
    Sadia Jehan
    Last edited by sadia jeha; 25 Dec 2019, 23:39.

  • #2
    Your intuition that you should include the ZCTA's population as an exposure variable in the model is good: after all, if one ZTCA has twice the population of another, all else equal, it would have twice the number of waivered physicians. So you want your results to be understandable as rates per population, rather than as raw counts. This is what -exposure()- and -offset()- do. (Actually, the population of physicians in the ZTCA might be better still, if you have that data.)

    As for how to do it, either -exposure(varname)- or -offset(log of varname)- will do it. They will produce the same results. So if you have the population as a variable in the data set, using -exposure(varname)- is simplest. If you have the log of the population as a variable, then use -offset(log of varname)-. If you have both, go with -offset(log of varname)- because, internally, when you specify -exposure(varname)- Stata calculates the logarithm of the variable4 and uses that as an -offset()-. It works the same way in all of Stata's count variable models.

    As for -vce(cluster varname)- and -nolog-, these have nothing to do with adjusting for exposure. And they may or may not be needed. -nolog- does not affect the calculations: it just eliminates from the output the progress report that you would otherwise get from Stata while it searches for the maximum likelihood estimates. -nolog-makes the output shorter and neater, but in most cases you only save a few lines, so little is gained. And if you specify -nolog- and it turns out to be a long calculation, with no progress report you may be left wondering if Stata has somehow crashed.

    As for -vce(cluster varname)-, you would not use population as the variable for that. The variable used for cluster vce would be some grouping variable, chosen because you expect that the outcomes should be correlated within the groups defined by that variable. For example, if the ZCTAs exist within different states, and if different states have practices, procedures, regulations or legislation that affects the number of waivered physicians, then you would expect that ZCTAs within states would be more similar to each other than ZCTAs in different states. In that case -vce(cluster state)- is appropriate.

    Comment


    • #3
      Hi all, I am just reading this. Thank you for this discussion. If I may ask, how is the -exposure(population)- option noted above different from just adding a weight, like [pweight=population] to the tnbreg command?

      Comment


      • #4
        They are entirely different. I'm not even sure where to start to answer this.

        First, perhaps, look at an example and you will see that they can produce distinctly dissimilar results:
        Code:
        webuse airline, clear
        
        poisson injuries i.XYZowned, exposure(n) irr
        
        poisson injuries i.XYZowned [pweight = n], irr
        So what is each of these models actually doing? When you use the -exposure- option, you are identifying the variable n as the denominator for the risk ratio. Internally, this is accomplished by adding ln(n) to the model and constraining its coefficient to be 1.0. In other words, the model is
        Code:
        ln(E(injuries)) = ln(base incidence rate)+ ln(XYZowned IRR)*1.XYZowned + 1*ln(n)
        
        Exponentiate both sides:
        E(injuries) = base incidence rate * XYZowned IRR1.XYZowned * n
        
        Divide both sides by n
        E(injuries)/n = base incidence rate * XYZowned IRR1.XYZowned
        So this model gives as an equation for the expected number of injuries per unit of n in terms of the base incidence rate and the IRR's associated with the predictor variables. Note that in this analysis, all observations contribute equally to the regression calculations themselves.

        By contrast, if n is used as a pweight, Stata applies complicated formulas so as to upweight or downweight the contribution of each observation to the overall calculations in proportion to the value of n in the observation. Values with large n are weighted more heavily than values with small n in the regression calculations. And in this model there is no possible interpretation of the estimated rates as being per unit of n.

        Comment

        Working...
        X