Regarding Poisson with GEE and robust variance estimators

Neha Jain

Join Date: Mar 2023

Posts: 3
#1

Regarding Poisson with GEE and robust variance estimators

27 Mar 2023, 00:14

Dear all,

Hello!

I am from public health field. My query is regarding a paper on 'Infectious diseases" that I am working with. I am trying to replicate the results of the paper which they say they have produced using- "Multivariate Poisson Regression using GEEs and Robust standard errors". I tried to run a simple Poisson regression for it as-

Poisson throm ccamal agecat gender_num, vce(r) irr

where throm= a binary outcome variable (actually a dichotomised variable created from a continuous/count variable 'platelets cell count')

ccamal= a categorical predicter variable

agecat and gender_num are categorical confounder for age and sex.

When I ran the above regression, I got nearly comparable results as paper authors. However the authors have reported the results as Prevalence ratios (PRs). Can i use the IRR coefficients as PRs?
Since I got near comparable results with just a multivariable poisson regression, I don't know if at all i need to insert a GEE equation into it and if so, how?
I tried giving this command in stata-

xtgee throm i.ccamal i.agecat i.gender_num, family(poisson) link(log) corr(ind) vce(robust)

this gave me results as Coefficients and some of those even negative ones. I am not able to figure out if the GEE mentioned by them is by error or they actually did it? if they did, then how do i use the GEE results to construct the Prevalence ratios? what is the interpretation of a neg GEE coefficient.

When i had put the correlation as exchangeable, the state gave me output as- estimates diverging (correlation > 1)

also, is the correlation we consider using GEE is between two variables like 'ccamal' and 'agecat' or between same variable for different individuals.

I used the equation-
log E(Y_i) = ln(lambda_i) = beta₀ + beta_1-4inf_cat + beta_5-7age_cat + beta_8-9gender_num+ error,

where lambda is my binary outcome variable and inf.cat is exposure variable and age and gender are confounders.

Please guide me if the equation is correct or to incorporate GEE, the equation needs to be tweaked?

Lastly, I am confused that my original variable was a raw cell count data of platelets which I felt as not normal. and neither had poisson distribution. Then it was dichotomised and a binary variable was created as "Thrombocytopenia- 0 is no, 1 if yes'. When i saw the distribution of this "Binary variable", it appeared to be follow poisson distribution as "Mean= Variance". So if this approach correct? can a binary variable be distributed via poisson distribution?

I am attaching the dataset here for your kind reference

Attached Files

Kamau_Malaria_schistosomiasis_final_v2.csv (50.6 KB, 1 view)
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

27 Mar 2023, 12:06

-poisson- with the -irr- option specified gives the results in terms of incident rate ratios (which you can interpret as prevalence ratios if your outcome variable is counts of prevalent, rather than incident, cases). -xtgee-, however, gives regression coefficients, which can be negative. To get rate ratios from -xtgee, family(poisson)- add the -eform- ratio. These numbers will all be positive and will be comparable to the -poisson- output you have.

-xtgee- is used with longitudinal data. -poisson- generally should not be used in that context. If you have reason to believe, based on strong theory or prior data analysis, that in fact that observations of the same entity at different time periods are as independent of each other as observations of different entities, then you can use -poisson- and the results will be more or less the same as those of -xtgee-. But it is uncommon to really know that.

When i had put the correlation as exchangeable, the state gave me output as- estimates diverging (correlation > 1)

This tells you that an exchangeable correlation structure is not really compatible with your observed data. If these are repeated measures data, then that would be surprising, and I would worry that you have important errors in the data itself. The correlations being estimated here, in answer to your next question, are correlations in the regression residuals within individuals.

Lastly, I am confused that my original variable was a raw cell count data of platelets which I felt as not normal. and neither had poisson distribution. Then it was dichotomised and a binary variable was created as "Thrombocytopenia- 0 is no, 1 if yes'. When i saw the distribution of this "Binary variable", it appeared to be follow poisson distribution as "Mean= Variance". So if this approach correct? can a binary variable be distributed via poisson distribution?

No, a binomial variable cannot have a Poisson distribution. However, if the value of the variable is mostly 0, with a relatively low proportion of 1's, it can be well approximated by a Poisson distribution. Also Poisson regression is remarkably robust to violation of most of the classical assumptions, and use of Poisson regression with binomial variables is fairly commonly used and usually produces results quite close to those of analyses more tailored to binomial outcomes. That said, if you have dichotomized your variable, why not use a regression model that is tailored to that kind of outcome, like logistic regression, or even a linear probability model.

I imagine that you created thrombocytopenia by dichotomizing a platelet count variable. The fact that the original platelet counts are not normally distributed is not even remotely a reason to do that. Because statistics is widely taught badly, many people come away with the incorrect impression that linear regression requires a normally distributed outcome variable. That is not true at all. At most, it requires that the residuals of the regression be normally distributed. Even that assumption is not necessary except in small samples. So you might want to reconsider your initial decision to dichotomize and go back to modeling the platelet count itself. Dichotomizing continuous variables is usually a bad idea. See https://www.fharrell.com/post/errmed/#catg for a very readable and compelling explanation of the many problems it creates.

Please read the Forum FAQ for more information about the best ways to post here. In particular, attachments are discouraged. Showing example data is usually a good idea, but the best way to do that is using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 27 Mar 2023, 12:09.
Comment
Neha Jain

Join Date: Mar 2023

Posts: 3
#3

21 Apr 2023, 03:53

Dear Prof. Schechter,

Thankyou very much for your kind inputs..

I had to take a sudden break from work due to personal reasons, hence a delayed response from my end.

1) I believe the authors used GEE in main regression analysis to account for the clustering as they are estimating the Prevalence of clinical outcomes at infection category level and not at individual level. Yet, even using GEE with either exchangeable (which as you suggested may not be apt for this data) or independent correlation didn't give me any different outcomes as using simple GLM Poisson.
Does that mean there is no correlation or clustering at all in this data? However, we are taught that especially in case of infectious diseases data- the assumption of INDEPENDENCE in the data doesn't hold. I am not able to figure out how to fix this?
I tried using a ICC to understand if there is clustering. However, the coefficient is too low ~ 0.013, this means that the correlation in observations from one cluster (here one infection category) is almost negligible?

2) Can you please guide me how to check if the residuals of a model are normal distributed? Shall i try to make a residual plot?

3) Lastly, I am trying to model the count data- raw platelet count data itself. My basic research Q is- Is there an association between infection categories (especially co-infection status) and platelet count? I have taken other covariates as well- age, gender and BMI etc. Can i do a simple multivariable linear regression here or given its a count data, I should do a poisson regression only? Is a Generalised linear mix model more suited here given that I intend to estimate the effects at infection category i.e. cluster level? My apologies for too many queries.

Please advice.

Regards
Neha
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

21 Apr 2023, 09:41

Well, you never did say how you -xtset- your data. It may well be that your choice of clusters was such that there is independence, or something very close to that, within them. That's a substantive question. If your panel variable (the first variable in your -xtset- command) is one where you should expect strong correlation among the observations, then either something is wrong with your data, or that strong correlation is largely to entirely explained by the explanatory variables in your model.

While I would expect to see fairly strong correlations with an infectious disease outcome in "panels" defined as households, or workplaces or other units where transmission is likely to occur, I would not necessarily expect that for outcomes like thrombosis which are not directly contagious. And if an infection that predisposes to thrombosis is used as a predictor variable in the model, that variable could easily absorb whatever dependence there is for the thrombosis outcome within those "panels."

If your data is simply repeated observations on the same individuals, so your "panel" variable is a single person, then, for almost any outcome it would be reasonable to start by assuming that there will be appreciable correlation. But it does not always turn out to be so, particularly if the series of repeated observations extends over a long time interval.

Can you please guide me how to check if the residuals of a model are normal distributed? Shall i try to make a residual plot?

Well, if you really want to waste your time doing this, calculate the residuals with -predict- and then use the -qnorm- command. (-help qnorm- for details if not familiar.)

But why do you want to do this? Normality of residuals is not expected in Poisson models. If you are thinking about a simple ordinary least squares linear regression, then normality of residuals is important only in very small samples. That is because in sufficiently large samples, the central limit theorem will attract the sampling distribution of the coefficient estimates to a normal distribution, so that normal-theory inference still works properly. And, frankly, there is a catch-22: if the sample is small enough that you can't rely on the central limit theorem for this, it is probably too small for an adequately powered test of normality of residuals! You probably just shouldn't even use such a sample for serious purposes. Unless your residual distribution is truly pathological, with a high skewness or mulitimodality, if your sample size is 50 or more, the central limit theorem will work for you. Even with N = 30 you should be good in nearly all circumstances.

Lastly, I am trying to model the count data- raw platelet count data itself. My basic research Q is- Is there an association between infection categories (especially co-infection status) and platelet count? I have taken other covariates as well- age, gender and BMI etc. Can i do a simple multivariable linear regression here or given its a count data, I should do a poisson regression only? Is a Generalised linear mix model more suited here given that I intend to estimate the effects at infection category i.e. cluster level? My apologies for too many queries.

Platelet counts are count data, but the numbers being counted are large. So there is hardly any distinction between a Poisson distribution and a normal distribution if you repeatedly sample platelet counts on the same person under the same conditions. I cannot tell from what you have disclosed whether this analysis will be based on a single observation per person or whether you have repeated measures. Either way, you will probably find the main difference between using linear regression and using a Poisson model is that in a linear regression the effect of the categorical predictor variable is to add a certain amount to the platelet count, whereas in a Poisson model the effect is to multiply it by a certain amount. So you should consider which of those sounds more appropriate to the real world data generating process. If you are using repeated observations on the same people, then you should use a multi-level model, either one of the -xt- commands or one of the -me- commands. If you have just a single observation per person, use ordinary -regress- or -poisson-.
Comment

Announcement

Regarding Poisson with GEE and robust variance estimators

Comment

Comment

Comment