We have claims data for hundreds of millions of people and want to look at how emergency department visit diagnoses changed at different periods of time. We want to report the percent change from baseline in the rate of ED visits with diagnosis_i per 1000 person-months of insurance enrollment at time_j.
With the related problem looking at case mix (percent of ED visits with diagnosis_i at time_j), we had an easier time with calculations because there are a smaller number of ED visits than people with coverage. With our computing resources (which are fixed), we could just manage an ED visit-level dataset, but we don't have enough RAM for a person-month dataset. The grad student I'm working with was using a Poisson model for case mix with clustered standard errors at the patient level. We were talking about how to manage the person-month calculations. My suggestion was a Poisson model where the dependent variable is the total number of ED visits with diagnosis_i at time_j and an exposure variable of person_months_coverage_j. With this approach we can't account for the repeated observations of the same people, but my guess was that there would be minimal impact because most of the diagnoses we're looking at are unlikely to recur in the same patient over our study period. But I wanted to check to make sure this wouldn't be a huge issue. I created a very simple simulation to check the difference in standard errors using the summary dataset approach vs. the person-level analysis. I simulated two types of conditions with different correlation : one that's rare and can't recur in later periods (appendicitis) and one that's more common and is more likely to occur at time_j if it occurred at time_(j-1) [MI/heart attack].
I used three approaches to the calculation in my little simulation: Poisson with clustered standard errors, population averaged/GEE Poisson, and summary analysis with no correction of standard errors. The results are the same to the 3rd decimal place, so I'm happy with the summary approach, but the grad student is concerned about a regression with 3 observations and 3 estimated coefficients (constant plus two time period IRRs), which goes against all the rules we teach in econometrics. Given the results, I suspect the effective sample size in the summary regression is similar to that in the individual-level analyses, but I have no idea how to explain it. I asked two outstanding econometricians on Twitter, but neither had any suggestions.
Note on code: I used Ben Jann's eststo and esttab to display results
With the related problem looking at case mix (percent of ED visits with diagnosis_i at time_j), we had an easier time with calculations because there are a smaller number of ED visits than people with coverage. With our computing resources (which are fixed), we could just manage an ED visit-level dataset, but we don't have enough RAM for a person-month dataset. The grad student I'm working with was using a Poisson model for case mix with clustered standard errors at the patient level. We were talking about how to manage the person-month calculations. My suggestion was a Poisson model where the dependent variable is the total number of ED visits with diagnosis_i at time_j and an exposure variable of person_months_coverage_j. With this approach we can't account for the repeated observations of the same people, but my guess was that there would be minimal impact because most of the diagnoses we're looking at are unlikely to recur in the same patient over our study period. But I wanted to check to make sure this wouldn't be a huge issue. I created a very simple simulation to check the difference in standard errors using the summary dataset approach vs. the person-level analysis. I simulated two types of conditions with different correlation : one that's rare and can't recur in later periods (appendicitis) and one that's more common and is more likely to occur at time_j if it occurred at time_(j-1) [MI/heart attack].
I used three approaches to the calculation in my little simulation: Poisson with clustered standard errors, population averaged/GEE Poisson, and summary analysis with no correction of standard errors. The results are the same to the 3rd decimal place, so I'm happy with the summary approach, but the grad student is concerned about a regression with 3 observations and 3 estimated coefficients (constant plus two time period IRRs), which goes against all the rules we teach in econometrics. Given the results, I suspect the effective sample size in the summary regression is similar to that in the individual-level analyses, but I have no idea how to explain it. I asked two outstanding econometricians on Twitter, but neither had any suggestions.
Note on code: I used Ben Jann's eststo and esttab to display results
Code:
clear set obs 3000000 gen id=_n gen t=3 expand t bysort id : gen time=_n drop t gen appendicitis=0 replace appendicitis=1 if runiform()<.0006 & t==1 replace appendicitis=1 if runiform()<.0009 & t==2 replace appendicitis=1 if runiform()<.0003 & t==3 xtset id t replace appendicitis=0 if L.appendicitis==1 replace appendicitis=0 if L2.appendicitis==1 gen mi=0 replace mi=1 if runiform()<.01 & t==1 replace mi=1 if runiform()<.013 & t==2 replace mi=1 if runiform()<.011 & t==3 replace mi=1 if L.mi==1 & runiform()<.15 replace mi=1 if L.mi==1 & runiform()<.15 poisson appendicitis i.time, vce(cluster id) irr eststo a_cluster xtgee appendicitis i.time, family(poisson) link(log) corr(exchangeable) vce(robust) eform eststo a_gee preserve contract time appendicitis reshape wide _freq, i(time) j(appendicitis) gen exposure=_freq0+_freq1 rename _freq1 appendicitis poisson appendicitis i.time, exposure(exposure) irr eststo a_summary restore poisson mi i.time, vce(cluster id) irr eststo m_cluster xtgee mi i.time, family(poisson) link(log) corr(exchangeable) vce(robust) eform eststo m_gee preserve contract time mi reshape wide _freq, i(time) j(mi) gen exposure=_freq0+_freq1 rename _freq1 mi poisson mi i.time, exposure(exposure) irr eststo m_summary restore esttab a_cluster a_gee a_summary, b(3) ci(3) eform mtitles drop(1.time) esttab m_cluster m_gee m_summary, b(3) ci(3) eform mtitles drop(1.time)