Confidence in multi-level model results?

Julianna Kirschner

Join Date: Oct 2022
Posts: 6

Confidence in multi-level model results?

03 Oct 2022, 14:38

Hi, I just took a short course on multi-level modeling and decided to use some data I had on hand to practice but I am not sure how I can be confident in my results given it is my first time running this kind of model.

My data is repeated measures and very simple as of right now (already in long format). I am looking to see if the implementation of a certain policy affected the number of pending cases in a given county, nested within districts, during a given time period.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str12 county str3 dc_district long pop byte year long(cr_pend    dc_cont_pol)
"Currituck"         "01"    29305     22  1541    0
"Wake"              "10"   1156274  22  40508   0
"Cumberland"   "12"    334660   22  16323   0
"Columbus"      "13"    49307     22   5913    0
"Columbus"      "13"    48355     23   4981    1
"Durham"         "14"     329973  22   8560    0
"Union"            "20B"   247058  23  11769   1
"Watauga"       "24"     53639    22   2020    0
"Transylvania" "29B"   32785    22  1920     0
"Graham"         "30"     7967      22   905     0
end
label values dc_cont_pol yesno
label def yesno 0 "No", modify
label def yesno 1 "Yes", modify

year is displayed as 22 and 23 for their relative Fiscal Year, and 0 mean they did not have a policy and 1 is they have a policy. I am treating the data as a "treatment" so no county had a policy in year 22, (all 0's) and then whoever has submitted one recently has a 1. So it looks like this.

group N Mean SD Variance

22 No 100 7698.33 11743.41 1.38e+08
23 No 35 5660.171 6675.436 4.46e+07
23 Yes 65 6935.708 12047.6 1.45e+08

This is my code and output:

Code:

 mixed cr_pend pop i.dc_cont_pol i.year  dc_district:, reml dfmethod(repeated) covariance(unstructured)
note: single-variable random-effects specification in dc_district equation; covariance structure set to identity.

Performing EM optimization ...

Performing gradient-based optimization: 
Iteration 0:   log restricted-likelihood = -1854.8598  
Iteration 1:   log restricted-likelihood = -1854.8598  

Computing standard errors ...

Computing degrees of freedom ...

Mixed-effects REML regression                   Number of obs     =        200
Group variable: dc_district                     Number of groups  =         41
Obs per group:
min =          2
avg =        4.9
max =         14
DF method: Repeated                             DF:           min =       0.00
avg =     147.00
max =     196.00
F(3,   196.00)    =     136.24
Log restricted-likelihood = -1854.8598          Prob > F          =     0.0000


cr_pend  Coefficient  Std. err.      t    P>t     [95% conf. interval]

pop    .0569237   .0029007    19.62   0.000      .051203    .0626444

dc_cont_pol 
Yes     673.4643    551.141     1.22   0.223    -413.4637    1760.392
23.year   -1701.722   447.0522    -3.81   0.000    -2583.372   -820.0714
_cons    2102.219   1183.138     1.78       .            .           .



Random-effects parameters     Estimate   Std. err.     [95% conf. interval]

dc_district: Identity        
var(_cons)    4.46e+07   1.04e+07      2.83e+07    7.05e+07

var(Residual)     3578446     405949       2865047     4469481

LR test vs. linear model: chibar2(01) = 204.95        Prob >= chibar2 = 0.0000

. estat icc

Residual intraclass correlation


Level         ICC   Std. err.     [95% conf. interval]

dc_district    .9257837   .0180008      .8818603    .9542243

my understanding is that in FY 2023, a county has 1,702 less pending cases than in 2022, for every one unit increase in population, pending cases increase by .06 units, and that while not statistically significant, having a continuance policy actually increased the likelihood of having more pending cases?

Sorry if the interpretation is not accurate or well stated.

Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

03 Oct 2022, 15:40

Well, I think your interpretation is a correct reading of the results you show. But I don't agree with the underlying modeling.

I question whether using pop as a covariate in the model this way is a reasonable approach. I don't know what kind of "cases" we are talking about, but in most contexts where that word is used, you might expect the effect of cases to be multiplicative rather than additive. That is, if you double the population, you would, all else equal, expect to double the number of cases. Simply putting pop in as a covariate does not do that. So if I'm right about this, a simple linear model is not appropriate to the problem. I'm inclined to think about a Poisson model instead:

Code:

mepoisson cr_pend i.dc_cont_pol i.year, exposure(pop) || dc_district:, irr vce(cluster dc_district)

Here the IRR for 1.dc_cont_pol will give you the relative (proportional) difference in number of cases that is attributable to the intervention. The use of the clustered VCE will protect you from most overdispersion that is a common difficulty in using Poisson models. (The coefficient estimates are consistent even in the presence of overdispersion, so only the standard errors need that protection.)

By the way, is there a reason for not having a county level or fixed effect in the model? Is a county composed of dc_districts, or is it the other way around?

Last edited by Clyde Schechter; 03 Oct 2022, 15:52.
Comment
Julianna Kirschner

Join Date: Oct 2022

Posts: 6
#3

04 Oct 2022, 07:38

Hi Clyde,

That is more or less true, we expect larger counties to have more pending cases (court cases, btw). I ran the Poisson model and these are my results.

Code:

Mixed-effects Poisson regression Number of obs = 200 Group variable: dc_district Number of groups = 41 Obs per group: min = 2 avg = 4.9 max = 14 Integration method: mvaghermite Integration pts. = 7 Wald chi2(2) = 35.85 Log pseudolikelihood = -20998.799 Prob > chi2 = 0.0000 (Std. err. adjusted for 41 clusters in dc_district) Robust cr_pend IRR std. err. z P>z [95% conf. interval] dc_cont_pol Yes 1.140007 .0767152 1.95 0.052 .9991413 1.300732 23.year .7652586 .04363 -4.69 0.000 .6843501 .8557326 _cons .0745037 .00422 -45.85 0.000 .0666752 .0832513 ln(pop) 1 (exposure) dc_district var(_cons) .1175311 .0314637 .0695476 .1986203

While I think I understand the interpretation of the dc_cont_pol and year variables, I am struggling to understand how population as an exposure variable changes the coefficients for the others. From what I can gather, it has to do with the amount of "opportunity" that a county may have to rack up more pending cases? And then I am lost by the dc_district coefficient.

I have yet to include county level effects because I wanted to make sure I was a) understanding the model and 2) using the model correctly before introducing more variables. I was going to include poverty rates, crime rates, and potentially staffing levels of courts as fixed effects once I had a clearer understanding. And lastly, the dc_districts are composed of counties.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#4

04 Oct 2022, 11:45

I am struggling to understand how population as an exposure variable changes the coefficients for the others. From what I can gather, it has to do with the amount of "opportunity" that a county may have to rack up more pending cases?

-mepoisson- estimates an model with a log link. That is, the underlying equation is ln(E(y | X)) = b0 + b1X, where E stands for expected value, and X is a vector of explanatory variables. Let's let X1 denote a vector of all the "regular" explanatory variables, and single out population for special attention here. By specifying -exposure(pop)-, you are telling Stata to use ln(pop), not pop itself, as the variable, and to constrain its coefficient to be 1. So the underlying equation now is ln(E(y | X1, pop) = b0 + b1*X1 + ln(pop). If we now exponentiate both sides of the equation, and apply the usual law about exponentials we get

Code:

E(y | X1, pop) = exp[b0 + b1*X1 + ln(pop)] = exp[b0 + b1*X1] * exp[ln(pop)] = exp[b0 + b1*X1] * pop, and, denoting exp[b0 + b1*X1] by K(X1), this in turn becomes = K(X1) * pop

So this model tells us that, given all the variables in X1, i.e. everything but pop, the expected value of y is directly proportional to population.

And then I am lost by the dc_district coefficient.

It has no natural explanation. In linear mixed models, these variance components are relatively natural and easy to understand. But when we do logistic, probit, Poisson, or negative binomial models, the variance at the "residual" level is constrained by properties of the distribution itself. For Poisson, in particular, the variance is constrained to equal the mean. The variance attributable to the grouping variable(s), is, basically just what's left over. One could, I suppose, contrive some coherent explanation for what the means in each case, but it would be different for every new case. I don't think anybody bothers trying to do that. I suggest just ignoring it.

I have yet to include county level effects because I wanted to make sure I was a) understanding the model and 2) using the model correctly before introducing more variables. I was going to include poverty rates, crime rates, and potentially staffing levels of courts as fixed effects once I had a clearer understanding.

Understood, and this incremental approach is always wise!

And lastly, the dc_districts are composed of counties.

Thank you for indulging my curiosity.
Comment

Announcement

Confidence in multi-level model results?

Comment

Comment

Comment