An "epidemiological" opinion for the use of poisson regression for a Covid-19 project

Gianfranco Di Gennaro

Join Date: Oct 2020

Posts: 140
#1

An "epidemiological" opinion for the use of poisson regression for a Covid-19 project

17 Aug 2021, 12:18

Dear all, thank you in advance for your attention.

Only if you have time, I have an epidemiological question

I have a dataset consisting of about 300 records.
This is the list of admissions to an emergency room.
For each of these admissions I have the date of admission and other covariates.

I would like to compare the postcovid period (say from March 9, 2020
to June 30, 2021) and a comparable precovid period (say from March 9, 2018 to June 30, 2019).
I would like to compare the two periods in terms of the number and characteristics of hospitalizations.
I use a binary variable "prepostcovid" to identify the period.

I wonder if you think it is reasonable to use a Poisson regression.

In practice, for the outcome I would create a "count" variable with the value always equal to 1, that is the number of hospitalizations per
each patient (I have no repeated hospitalizations):

Then in STATA i run:
poisson count i.prepostcovid covariate1 covariate2 etc.

and get the IRR.

I obviously know that I have to consider all the assumptions (overdispersion and so on)

The question is: do you think it would make sense to speak of "incidence" from
moment that I am only considering the admissions and not the "no"
hospitalizations "?
Tags: covid, incidence, poisson
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

17 Aug 2021, 12:47

This is not going to work the way I think you want it to. Since you have no repeat admissions for any patient,you will simply discover what you already know: the incidence rate of admissions per patient is 1 in both periods, and the IRR is 1.

What might make more sense is to estimate the incidence of admissions per unit of time. This is only feasible if you have a variable that gives the date of each admission. Assuming you do, and that it's called adm_date, and that it's a real Stata internal format date variable, then you could, for example, look at the rate of admissions per month as follows:

Code:

gen int mdate = mofd(adm_date) format mdate %tm gen long obs_no = _n collapse (count) admissions = obs_no (first) prepostcovid, by(mdate) gen int month_duration = (dofm(mdate+1) - dofm(mdate)) + 1 poisson admissions i.prepostcovid, irr exposure(month_duration) // ADJUST FOR DIFFERENT DURATIONS OF DIFFERENT MONTHS

Note that I do not include your covariates here. You said you want to compare the characteristics of admissions in the two periods, so you should be doing analyses in which these are the outcome variables, and prepostcovid is the predictor. As you do not say anything specific about those characteristics, I cannot advise you what kind of models might be appropriate for this purpose.

There is nothing magical about months here. You could choose to look per quarter or week instead. But with a total of about 300 admission scattered over a total of a bit more than 900 days, you are running about 3 admissions per day on average. Aggregating to the monthly level will smooth out some of the noise but still leave you with a sufficient number of observations (months) to do a simple regression like this one. If you use a shorter aggregating interval like week, you will have a larger sample size, but the data will be noisier.

Bear in mind that since you are looking only at people who are being admitted, you are not getting any estimate of the incidence of admissions in any particular population of people. Rather you are getting the incidence of arrivals at your facility. There is nothing wrong with that, as long as you understand what it is.
1 like
Comment
Gianfranco Di Gennaro

Join Date: Oct 2020

Posts: 140
#3

17 Aug 2021, 15:17

Thank you very much Dr Schechter for your quick and broad reply.
It will be very useful to me.
Of course, I have the date of each hospitalization. In the code that I had indicated I forgot to insert the exposure. It would have been something like:
gen daysfrombegin = admission date - start date of the period
The model code would have been:
poisson count i.prepostcovid covariate1 covariate2 etc, exposure (daysfrombegin).

Concerning: Note that I do not include your covariates here. You said you want to compare the characteristics of admissions in the two periods, so you should be doing analyzes in which these are the outcome variables, and prepostcovid is the predictor. As you do not say anything specific about those characteristics, I cannot advise you what kind of models might be appropriate for this purpose.

My variables are related to the demographic and clinical characteristics of the patients such as gender, age, comorbidities, etc.

I don't know if it makes sense to insert them in the model you suggested with your code.

I tried to make a logistic model in which I used the binary prepostcovid variable as the dependent variable and the patient characteristics as predictors. However, the results are not easy to interpret.

But I don't want to steal any more time, you have been really very kind.
Thank you!
Gianfranco
Comment
Gianfranco Di Gennaro

Join Date: Oct 2020

Posts: 140
#4

28 Sep 2021, 02:16

Dear Clyde Schechter , I run your script regarding the problem discussed above. However what I get as the IRR is the simple ratio of the total number of hospitalizations between the two periods.
I therefore understand that a ratio of monthly incidences has not been estimated but a simple overall ratio of the two incidences over the entire period.

I wonder if I did something wrong or not.
Thanks as always. Hope not to bother you.
Gianfranco
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

28 Sep 2021, 10:43

[quote]I wonder if I did something wrong or not.[quote]
I wonder, too. But there is no way to tell from a brief narrative. To troubleshoot requires showing the exact code that you ran and an example of the data you ran it on, as well as the output you got from Stata (including any messages).

(Use -dataex- to show the data example.) If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
George Ford

Join Date: Aug 2014

Posts: 3153
#6

28 Sep 2021, 11:01

If you use xtpoisson, then use "robust" or "cluster" and you don't have to worry about overdispersion. See Wooldridge, Journal of Econometrics (1999).
Comment

Gianfranco Di Gennaro

Join Date: Oct 2020
Posts: 140

05 Oct 2021, 06:00

Dear Clyde Schechter ,
here's the dataex and, obviously, thanks again.

The code I ran is:
gen int mdate = mofd(datein)
format mdate %tm
gen long obs_no = _n
collapse (count) admissions = obs_no (first) prepost, by(mdate)
gen int month_duration = (dofm(mdate+1) - dofm(mdate)) + 1
poisson admissions i.prepost, irr exposure(month_duration) //

In the dataset you can see:
"prepost" (0= pre-covid, 1 is "post-covid")
"datein" (the date of hospital admission)

Then I have Sex and Age (what if I want to adjust for these variables?).
Thank you again.
Gianfranco

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Sex byte Age int datein float prepost
"F" 14 21292 0
"F" 15 21358 0
"F" 16 21303 0
"F" 14 21262 0
"F" 14 21283 0
"F" 16 21522 0
"F" 17 21287 0
"F" 17 21343 0
"F" 17 21516 0
"F" 17 21516 0
"F" 15 21309 0
"F" 14 21366 0
"F" 13 21308 0
"F" 16 21260 0
"F" 16 21345 0
"F"  6 21410 0
"F" 12 21525 0
"F" 15 21505 0
"F" 12 21407 0
"F" 12 21340 0
"F" 10 21286 0
"F" 13 21502 0
"F" 17 21601 0
"F" 10 21656 0
"F" 12 21680 0
"F" 12 21690 0
"F" 12 21705 0
"F" 12 21653 0
"F" 13 21651 0
"F" 15 21704 0
"M" 15 21356 0
"M" 11 21257 0
"M" 13 21534 0
"M" 17 21287 0
"M" 16 21314 0
"M" 14 21472 0
"M" 10 21406 0
"M" 13 21324 0
"M" 14 21395 0
"M" 15 21448 0
"M" 13 21530 0
"M" 17 21493 0
"M" 14 21286 0
"M"  8 21576 0
"M" 15 21703 0
"M" 16 21582 0
"M" 10 21676 0
"M" 13 21574 0
"M" 15 21605 0
"F" 14 21991 1
"F" 10 22056 1
"F" 10 22075 1
"F" 17 22091 1
"F" 12 22107 1
"F" 15 22159 1
"F" 16 22173 1
"F" 16 22188 1
"F" 16 22191 1
"M" 16 21987 1
"M" 15 21988 1
"M" 14 22019 1
"M" 17 22020 1
"M" 15 22046 1
"M" 17 22047 1
"M" 14 22056 1
"M" 10 22200 1
"F" 16 22203 1
"F" 15 22205 1
"M" 15 22214 1
"M" 14 22216 1
"F" 16 22232 1
"M" 11 22256 1
"F" 16 22281 1
"F" 14 22284 1
"M" 14 22288 1
"F" 15 22307 1
"F" 16 22308 1
"M" 16 22320 1
"F" 14 22333 1
"M" 10 22333 1
"F" 14 22336 1
"F" 13 22343 1
"F" 14 22350 1
"F" 16 22356 1
"F" 15 22361 1
"M" 15 22361 1
"F" 14 22364 1
"M" 16 22365 1
"F" 15 22368 1
"F" 15 22369 1
"M" 17 22371 1
"M" 15 22383 1
"M" 14 22384 1
"F" 16 22385 1
"M" 15 22385 1
"F" 15 22390 1
"F" 15 22392 1
"F" 13 22404 1
"F" 15 22410 1
"M" 16 22417 1
end
format %tdnn/dd/CCYY datein

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

05 Oct 2021, 09:18

I cannot reproduce the problem you report in #4 with your data and code.

Code:

. poisson admissions i.prepost, irr exposure(month_duration)

Iteration 0:   log likelihood = -58.549461  
Iteration 1:   log likelihood = -58.549461  

Poisson regression                                      Number of obs =     29
                                                        LR chi2(1)    =   0.29
                                                        Prob > chi2   = 0.5889
Log likelihood = -58.549461                             Pseudo R2     = 0.0025

------------------------------------------------------------------------------
  admissions |        IRR   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   1.prepost |   1.114147   .2228739     0.54   0.589     .7527798    1.648985
       _cons |    .104034    .014862   -15.84   0.000     .0786276    .1376497
ln(month_~n) |          1  (exposure)
------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate.

contrasts with

Code:

. tabstat admission, by(prepost) statistics(sum)

Summary for variables: admissions
Group variable: prepost ((first) prepost)

 prepost |       Sum
---------+----------
       0 |        49
       1 |        51
---------+----------
   Total |       100
--------------------

. display 51/49
1.0408163

So you are getting the adjustment for month duration you want.

I don't understand what you mean about adjusting for age or sex in this context. For that, you would need statistics about the age and sex distributions of the underlying population from which the admissions are drawn in each month (or at least in the pre- and post-Covid eras.) You have only the age and sex of the admissions themselves, which is probably different.

If you want agegroup -sex specific incidence rates for admissions, which is a different matter, then you need to settle on some categorization of age for the purpose, create an agegroup variable, and then collapse your original data -by(mdate sex agegroup)-. Then you would also do your -poisson- regression -by sex agegroup, sort: poisson ...-.

Comment

Gianfranco Di Gennaro

Join Date: Oct 2020

Posts: 140
#9

06 Oct 2021, 01:52

Thank you Clyde Schechter .
But this is incredible: when I perform Poisson regression on the entire dataset I get the exact same value of the ratio of the raw numbers.
Incredible, I can't understand.
Thanks again
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

06 Oct 2021, 09:20

Well, it may just be that coincidentally the distribution of the events over the months of different durations is such that the duration-corrected risk ratio is equal to the raw ratio. Look, there isn't all that much variation in month duration. Roughly equal numbers of months are 30 and 31 days, and the worst case scenario of 28 days only happens in one month each year. So I wouldn't expect the duration correction to be dramatic in any case. That it might coincidentally turn out to have no effect in some data isn't totally astonishing.
Comment
Gianfranco Di Gennaro

Join Date: Oct 2020

Posts: 140
#11

07 Jun 2022, 09:28

Dear Clyde Schechter, I thank you in advance.
I have a question in respons to your suggestion above
https://www.statalist.org/forums/for...42#post1623742

Last year you told me that a reasonable idea was to estimate the monthly IRR.
I wonder, with the same dataset, how it could be possible, to estimate the effect of covariates (e.g. age and sex) on the monthly IRR and wether these effects change between pre and post covid.
It's an interaction term, I know.
However, what is not clear to me, is how ho organize the dataset. Obviously, when I collapse data (as you suggested) I loose all my covariates.

PS: I'm aware it's a very old post and I apologize for that.

Gianfranco
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

07 Jun 2022, 10:00

Obviously, when I collapse data (as you suggested) I loose all my covariates.

You don't lose the covariates if you include them in your collapse command. At least, you won't lose discrete ones. Age might be a problem. But perhaps not. In the example data you show, age ranges only between 6 and 17, and there are only 10 distinct values. If the rest of your data is similar, you can:

Code:

gen int mdate = mofd(datein) format mdate %tm gen long obs_no = _n collapse (count) admissions = obs_no, by(mdate prepost Age Sex) gen int month_duration = (dofm(mdate+1) - dofm(mdate)) + 1 encode Sex, gen(sex) poisson admissions i.prepost##(i.sex c.Age), irr exposure(month_duration)

I'm not confident that treating Age is a linear continuous variable is appropriate. Most conditions show rather non-linear age trends during childhood, so it might make more sense to break Age into some age categories and use those instead. Or use a cubic spline for age (-help mkspline-).
1 like
Comment

Announcement