logistic model for the cluster effect

Monika Kovacs

Join Date: Jun 2019

Posts: 8
#1

logistic model for the cluster effect

07 Jun 2019, 13:31

Hi,
I would like to assess the effect of GP-gender on receiving care (for patients (binary outcome, received=1, did not received=0)
adjusted for characteristics of patients (agegroup, gender), phyisican (age, gender) and practice (location, size).
As the outcome is binary, I am thinking on logistic model.
The patients are clustered within practices therefore adjustment of the model for the clustering effect is necessary.
The number of physicians are relatively high (N=4575).
I used conditional logistic regression grouped by GP-ID (clogit DEP_var INDEP_var, group(GP ID) robust).
The problem is that after running the model I have got the following error: "var omitted because of no within-group variance".
So most of the characteristics are dropped because characteristics don’t vary within GPs (e.g. characteristics of practice or GP).
The problem is the same with ’xtlogit’. The data were cross-sectional so the changes over time were not monitored (no panel data).
What should I do? Can I use random-effect here?

Best, Monika
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

07 Jun 2019, 15:47

Well, why were you using fixed effects regression to begin with,? You state that your goal is to estimate the effect of GP-gender on the patient outcome. But GP-gender is, almost always, an invariant property of the GP-id, so that its effects are inherently not estimable in a GP-id fixed effects model. So if that is your goal, you must use a random-effects model.

While we are discussing this project, are the patients and GP's also nested in practices? Is practice location accounted for in the model? How many practices are in your data? Should you be doing a three-level random-effects model here?

By the way, if you have a problem of any kind when running -clogit-, don't bother trying to fix it by switching to -xtlogit, fe-. The latter is just a wrapper for the former, so the same problem is absolutely guaranteed to come up again.
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#3

07 Jun 2019, 16:31

Thank you!!
Data came from solo practices, which means that one practice belongs to only one GP. In our analysis, number of GPs = number of practices. So there are 4575 practices, and patiens are nested in this. Is this number not to much for random-effect model?
The analysis is adjusted for location of practice, as well.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

07 Jun 2019, 17:02

All sounds good. I don't see any reason why you would have a problem with 4,575 random effects. At worst you might have to increase the matrix size to get it running.
Comment

Monika Kovacs

Join Date: Jun 2019
Posts: 8

08 Jun 2019, 05:20

Thank you very much!!!
Can I have one more question?
The codes for the the sample databases are below. The basic data were aggregated in GP level according to the patients’ gender and age. Based on the case numbers we have ”de-aggregated data”. I was suggested to use one of the two methods:
1. The first one is the above mentioned logistic regression with binary outcome. (db_1)
2. The second one would be the negative binomial regression with count data outcome (number of patient who received the care). I have checked the overdisperson, which showed the NBREG is more appropriate than Poisson. If I chose this model, should I also use random effect model here? (db_2)

What model would be appropriate to use here? The logistic regression or the negative binomial regression model? (the research question is the same, to assess the effect of GP-gender on receiving care)

db_1 (this is the first 14 records out of 1 300 000)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 GP_ID byte(DEP_VAR gender_GP) str5 agegroup_patient str1 gender_patient byte(age_GP location) str5 size byte I
"GP_0001" 1 1 "45-49" "N" 61 20 "<1000" .
"GP_0001" 1 1 "45-49" "N" 61 20 "<1000" .
"GP_0001" 0 1 "45-49" "N" 61 20 "<1000" .
"GP_0001" 1 1 "50-54" "N" 61 20 "<1000" .
"GP_0001" 1 1 "50-54" "N" 61 20 "<1000" .
"GP_0001" 0 1 "50-54" "N" 61 20 "<1000" .
"GP_0001" 1 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 1 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 1 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 0 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 0 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 1 1 "50-54" "F" 61 20 "<1000" .
"GP_0001" 0 1 "50-54" "F" 61 20 "<1000" .
"GP_0001" 0 1 "50-54" "F" 61 20 "<1000" .
end

db_2 (this is the first 4 records out of of 18 000)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 GP_ID byte(DEP_VAR gender_GP) str5 agegroup_patient str1 gender_patient byte(age_GP location) str5 size byte I
"GP_0001" 2 1 "45-49" "N" 61 20 "<1000" .
"GP_0001" 2 1 "50-54" "N" 61 20 "<1000" .
"GP_0001" 3 1 "45-49" "F" 61 20 "<1000" .
"GP_0001" 1 1 "50-54" "F" 61 20 "<1000" .
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#6

08 Jun 2019, 10:53

Using a negative binomial (or any other count model) on the second data set will give you misleading results, because the number of patients receiving the treatment will be affected by the total number of patients in the practice (or, worse, in the age-sex aggregate group.) You can partially overcome that by using the size of that group as the -exposure()- variable in the negative binomial model--but that isn't strictly speaking correct.

In general modeling count outcomes is best reserved for situations where the different observations represent entities of equal "size" so that the size itself does not distort the model.

You can, by the way, do the random effects logistic model on aggregated data: you need the aggregated data set to have a variable that tells you the size of the aggregated group each observation represents, and then you use that variable in the -binomial()- option of -melogit-. And, in fact, using aggregated data speeds up the computations considerably, so I recommend this approach if your data set is large.
2 likes
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#7

08 Jun 2019, 15:00

Thank you very much! It was a great help!
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#8

11 Jun 2019, 16:16

Dear Clyde,

I used -melogit- command and now I would like to run a goodness of fit test.
I read that the Hosmer-Lemeshow goodness of fit statistic may be appropriate here. I tried the -estat gof, group()- code, but it doesn’t work.
Could you help me how can I calculate this?

Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#9

11 Jun 2019, 17:22

Here's how I do it. Replace variable names by yours. Also, not tested, so beware of typos or similar problems, but this is the gist of it:

Code:

// YOU MAY WANT TO PRESERVE THE DATA FIRST predict mu, mu local ngroups 10 // OR HOWEVER MANY GROUPS YOU WANT xtile quantile = mu, nq(`ngroups') collapse (sum) predicted = mu observed = outcome_var, by(quantile) list quantile predicted observed, noobs clean graph twoway line predicted observed quantile, sort // CALCULATION OF CHI SQUARE STATISTIC gen chi2 = sum((observe-expected)^2/expected) replace chi2 = chi2[_N]

You can choose how many groups you want to have, and if the data set is large, you should choose a large number of groups to get a more fine-grained sense of the model's calibration.

I recommend paying less attention to the summary statistic and more attention to the table and graph. They show you the discrepancy between predicted and observed in each decile (or whatever group you chose) of the predicted values. You may observe systematic problems like over-prediction in the ends and underpediction in the middle, or the reverse, or whatever. That might, in turn, suggest how you could improve the model.

The overall chi square statistic has an approximate chi square distribution with degrees of freedom equal to the number of groups - 2. This fact is often used to do a goodness of fit "test." Be aware that in large samples this test can be extremely sensitive to small discrepancies between predicted and observed that are of no practical importance. I rarely do this test.
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#10

12 Jun 2019, 13:00

Thank you! It works. I’ve got both the table and the graph, but there is no summary statistic. How can I calculate it?

Furthermore, I think that there is no significant difference between observed and predicted values based on the graph. Am I right?

(Anyway, I also did the chi square statistics, but p value is <0.001. This may be because of the extrem sensitivity of the test in large samples, as you mentioned.)
Attached Files

graph2.tif (1.74 MB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#11

12 Jun 2019, 13:14

I’ve got both the table and the graph, but there is no summary statistic. How can I calculate it?

It's already calculated by the code in #9: it's in the "variable" chi2 (which is actually a constant).

I agree that based on the graph your model has excellent calibration and that it would be hard for me to imagine any practical situation in which greater agreement between model and data would be needed. So, yes, I would write-off the p-value as the extreme sensitivity in a large sample.
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#12

16 Jun 2019, 12:32

Thank you!
Comment
Monika Kovacs

Join Date: Jun 2019

Posts: 8
#13

20 Aug 2019, 04:59

Dear Clyde,

I used the following command in my analyses:

Code:

melogit DEP_var INDEP_var1 INDEP_var2 || cluster:,vce(robust) or

I used the age and gender as confounders, but I’ve got the criticism, that the age and gender cannot be used as confounders, if I don’t have appropriate literature in my field and I should at most check if these factors are potential effect modifiers. Therefore, I should test them out if they are effect modifiers, and then rerun the analyses by gender and/or age groups.

In the absence of appropriate literatures, how can I check whether the factors are effect modifiers in my model?

Thank you in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#14

20 Aug 2019, 05:59

I don't understand the question. Whether age and gender are confounders is one question, whether they are effect modifiers is a completely different question. To determine whether they are confounders, you can simply run the model with and without them, and if there is a meaningful difference in the effects of interest, they are confounders. Otherwise not.

To determine whether they are effect modifiers, you would add both age and sex as well as terms for their interactions with the effect(s) of interest and then look at whether the interaction terms have coefficients large enough to matter. If you are going to do this, I strongly urge you to use Stata's factor-variable notation to handle the interaction terms. (See -help fvvarlist- if you are not familiar with factor variable notation.)
Comment

Announcement