Panel: xtnbreg with exposure options. Margins and predict

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

Panel: xtnbreg with exposure options. Margins and predict

27 Dec 2017, 10:26

Dear StataList users,

Let's say that I’m evaluating the effectiveness of a program that aims to reduce the number of students’ interruptions in a classroom. The program has 4 levels. The time subjects stay in the program’s levels differ from one another (not fixed time interval). My strategy is to compare counts of interruptions before and while being in the program using xtnbreg with the exposure option to control for the different time interval (days in the program).

1. Is my modeling technique/strategy valid?

Code:

. xtnbreg event i.time, fe irr exposure(days)
note: 3 groups (3 obs) dropped because of only one obs per group
note: 319 groups (676 obs) dropped because of all zero outcomes

Iteration 0:   log likelihood = -730.20997  
Iteration 1:   log likelihood = -683.11873  
Iteration 2:   log likelihood = -676.40059  
Iteration 3:   log likelihood =  -675.1029  
Iteration 4:   log likelihood =  -674.9261  
Iteration 5:   log likelihood = -674.91054  
Iteration 6:   log likelihood = -674.91035  
Iteration 7:   log likelihood = -674.91035  

Conditional FE negative binomial regression     Number of obs     =      1,409
Group variable: BOOKCASENUMBER                  Number of groups  =        564

                                                Obs per group:
                                                              min =          2
                                                              avg =        2.5
                                                              max =          5

                                                Wald chi2(4)      =      13.99
Log likelihood  = -674.91035                    Prob > chi2       =     0.0073

------------------------------------------------------------------------------
       event |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |
          1  |    .921984   .0767269    -0.98   0.329     .7832258    1.085325
          2  |   .7764469   .0750522    -2.62   0.009     .6424416    .9384039
          3  |   .8872854   .0842731    -1.26   0.208     .7365758    1.068831
          4  |   .4831852   .1423992    -2.47   0.014     .2711793    .8609355
             |
       _cons |   .3672584   .6315017    -0.58   0.560     .0126278    10.68108
    ln(days) |          1  (exposure)
------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate (conditional on zero random effects).

. 
end of do-file

2. I do not want to compare only pre vs all the program levels but also interruptions between levels. I did this manually using the test post command. Is there a way to compute all the combination (permutations) at once? Perhaps using margins?

Code:

test 1.time == 2.time
test 1.time == 3.time
test 1.time == 4.time
test 2.time == 3.time
test 2.time == 4.time
test 3.time == 4.time

3. How can I interpret the results taking into account the exposure option? My attempt- We estimate that level 1 interruption rate is 1.29 (1/.7764) times larger than pre program rates.

4. After running the regression I run the margin command. How can I interpret the results of the margin command? Are these the predicted average interruptions per day? Can I say on average, subjects in level 2 had 3.022 interruptions per day which is -.253 fewer questions than pre levels? or these are interruptions during all days in the program?

Code:

. margins time

Adjusted predictions                            Number of obs     =      1,409
Model VCE    : OIM

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |
          0  |   3.275641   1.719503     1.90   0.057     -.094522    6.645804
          1  |   3.194414   1.725336     1.85   0.064    -.1871829    6.576011
          2  |   3.022614    1.73579     1.74   0.082    -.3794707    6.424699
          3  |   3.156053   1.719668     1.84   0.066    -.2144342    6.526539
          4  |   2.548286   1.742859     1.46   0.144    -.8676556    5.964227
------------------------------------------------------------------------------

. margins, dydx(time) atmeans

Conditional marginal effects                    Number of obs     =      1,409
Model VCE    : OIM

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.time 2.time 3.time 4.time
at           : 0.time          =    .4002839 (mean)
               1.time          =     .259049 (mean)
               2.time          =    .1320085 (mean)
               3.time          =    .1433641 (mean)
               4.time          =    .0652945 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |
          1  |  -.0812274   .0832193    -0.98   0.329    -.2443343    .0818796
          2  |   -.253027   .0966611    -2.62   0.009    -.4424793   -.0635748
          3  |  -.1195886   .0949785    -1.26   0.208    -.3057431    .0665659
          4  |  -.7273553   .2947093    -2.47   0.014    -1.304975   -.1497356
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. 
end of do-file

Finally my predicted values seemed a little to high. I am predicting the total number of interruptions among all days the subject was in the program? Or interruption per day? Why is the predicted counts too high? Perhaps because fixed effect drops panels with 0 outcomes? Or is it because I am not including significant predictors in my model? Other thoughts?

Code:

. listsome id time event p, sepby(id)


. predict  p if e(sample)
(option xb assumed; linear prediction)
(679 missing values generated)

. listsome ID time event p, sepby(ID)

      +-------------------------------+
      | ID   time   event           p |
      |-------------------------------|
   1. |  1      0       1   -.3085425 |
   2. |  1      2       0    .6911934 |
      |-------------------------------|
   3. |  2      0       1    3.172698 |
   4. |  2      1       0    2.528001 |
      |-------------------------------|
   5. |  3      0       0           . |
   6. |  3      1       0           . |
      |-------------------------------|

Thank you!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#2

27 Dec 2017, 11:54

1. I'm not sure you're using -exposure()- correctly here. The idea behind the exposure variable is that it measures the opportunities for the outcome events. So, if this were a study of the number of potholes in the streets, the exposure variable would be the mileage, or perhaps the surface area, of the streets surveyed to find the potholes. If different students in your study were observed for different periods of time, then the duration of observation would be a good exposure variable, allowing you to estimate rates of classroom interruptions per hour of observation. You have used instead the duration of the intervention. I won't say that it's wrong, but neither is it clear that it's right. If your goal is to estimate the rate of interruptions (or the effect on the rate of interruptions) of the program per hour of program participation, that's fine. But it is important, in that circumstance, that the students all were observed for the same amount of time when assessing the number of interruptions.

2. What you have here is suitable for the stated purpose. You could use -margins time, predict(iru0) pwcompare(effects)- to get everything at once, but the method for calculating standard errors (and pvalues) is somewhat different from what -test- uses, so the results will not be the same. They should be similar overall, but not the same.

3. I think you have this wrong. You don't say as much, but I assume that your time variable has 5 levels, the four program components that you refer to, plus a zero value for pre-intervention time. The pre-intervention time period is the base category here. So the coefficient for time = 2, 0.7764 is the ratio of interruption rates in level 2 compared to pre-intervention. So they interrupt only 0.7764 times as often in level 2 as before intervention. There is no reason to take the reciprocal here.

4. No. Notice in the third line of the -margins- output, it says "Expression : Linear prediction, predict()." This means that the output of this -margins- commands are the values of xb. (You could also find this information by checking -help xtnbreg postestimation- and clicking on the link to -margins-. There you would see that the default is to do -margins- of xb.) But -xb- is not what you are interested in. You are apparently interested in rates of interruption. So you need to run -margins time, predict(iru0)-. This will give you predicted interruption rates conditional on the fixed effects being zero.

As to your final question (renumbered 1 for some reason), the problem you're having arises from the fact that you're not getting predicted interruptions per day. You're getting predict xb per day. When you do add the -predict(iru0)- option, you will get rates of interruptions per day (day of program participation, not day of classroom exposure). Then you will, indeed, also encounter an upward bias due to exclusion of participants with all 0 interruption results.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#3

27 Dec 2017, 13:31

Clyde Schechter

1. I won't say that it's wrong, but neither is it clear that it's right. If your goal is to estimate the rate of interruptions (or the effect on the rate of interruptions) of the program per hour of program participation, that's fine. But it is important, in that circumstance, that the students all were observed for the same amount of time when assessing the number of interruptions.

Let me clarify a little bit. In my data a student can be in level 1 for 20 days, another 15 days, and another for 100 days. Usually when using negative regressions the time interval is fixed but this is not my case. To fix this, I red I can use the exposure or offset option to control for different exposure time. My goal is to estimate the rate of interruptions per day.

3. So the coefficient for time = 2, 0.7764 is the ratio of interruption rates in level 2 compared to pre-intervention. So they interrupt only 0.7764 times as often in level 2 as before intervention.

I meant to say that the Pre the program interruption's rate is 1.29 (1/.7764) times larger than level 2. Is this correct? The problem is that non-technical people have a hard time understanding rates that are lower than 1. That is why I use the reciprocal value.

4. You are apparently interested in rates of interruption. So you need to run -margins time, predict(iru0)-. This will give you predicted interruption rates conditional on the fixed effects being zero.

Yes, I am interested in rates of interruption. Are these rates per day (which is my exposure variable)? How can you explain this to a non-technical audience?

Thank you very much for all the information!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#4

27 Dec 2017, 14:20

So, with your additional explanation of #1, I believe what you are doing is incorrect. One of the properties of the exposure variable in these models is that the coefficient of (the log of) the variable is constrained to 1. This means that you are certain that twice as many days corresponds to twice as many interruptions, etc. The thing that needs to be the exposure variable is something that varies exactly that way with the outcome. Thus the examples I gave in #2: mileage or area of road surface for potholes. If you have twice as much road for counting potholes, you expect twice as many potholes. If you used the number of days that you observed the students for classroom interruptions as the exposure the same reasoning would apply: if you observe them twice as long, you expect to see twice as many interruption events. But you are using here the number of days they are in the intervention program. There is no clear or obvious reason why twice as many days in the program would double the number of interruptions. (In fact, if the program is supposed to reduce interruptions, then you would expect something more like the opposite of that.) Your days variable is more like a dosage variable Now, if the days they spend in the program are exactly the days on which they are observed and have their interruptions counted, then I think we're OK. But you don't say that, and I'm guessing it's not the case. Using a dosage variable for an intervention as the exposure here does not seem appropriate. Rather I would probably enter it as another covariate in the model (and probably as a discrete one given that it takes on values of 20, 35, and 135 only). Depending on how you think it works, you might also include an interaction between days and program phase (which you are calling time). The exposure variable, if any is needed at all, should be the number of days on which the interruptions were counted, not the duration in the program (unless those are the same thing).

I meant to say that the Pre the program interruption's rate is 1.29 (1/.7764) times larger than level 2. Is this correct?

Yes.

Yes, I am interested in rates of interruption. Are these rates per day (which is my exposure variable)? How can you explain this to a non-technical audience?

Well, yes, they are rates of interruptions per whatever is in the exposure variable. As for explaining it to a non-technical audience, I think that is very difficult. All I would say to a non-technical audience is that the analysis was adjusted for the days of observation, so that the results are presented as interruptions per day, rather than total interruptions over the entire study period. I think any attempt to explain logarithms and constraining a coefficient to 1 will, at best, cause eyes to glaze over.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#5

27 Dec 2017, 14:48

Thank you so much for the explanation. I really appreciate it.

Now, if the days they spend in the program are exactly the days on which they are observed and have their interruptions counted, then I think we're OK. But you don't say that, and I'm guessing it's not the case.

I am sorry for not making this clear. But yes, the days they spend in the program are exactly the days on which they are observed and have their interruptions counted. So in that case the exposure option is suitable.

I agree with you regarding explaining logarithms and constraining a coefficient to 1 to non-technical audiences; I will keep it simple.

Thanks again and happy holidays!
Comment
Osman Suliman

Join Date: Jul 2017

Posts: 12
#6

27 Dec 2017, 16:19

Hi
Can somebody, kindly, educate me about an appropriate way of calculating the real return to physical and human capital in a country where financial markets are not developed and using The NPV may not work? I have seen some papers using estimation of cobb-Douglas production functions and dividing the coefficient of education ( human capital ) variable by the median value to education to calculate the rate of return.. same to physical capital. Thanks very much.
Best
Osman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#7

27 Dec 2017, 16:21

The question in #6 is unrelated to the topic of this thread. Please repost your question in a new thread, and give it an appropriate title.
Comment
Osman Suliman

Join Date: Jul 2017

Posts: 12
#8

27 Dec 2017, 19:32

Thanks Clyde.
Comment

Announcement

Panel: xtnbreg with exposure options. Margins and predict

Comment

Comment

Comment

Comment

Comment

Comment

Comment