Hi all,
I am using a large data base to study length of hospital stay in patients with a rare disease. I have data on 4,326 patients who had a total of 16,428 hospitalizations over the study period. Many of the patients have frequent hospitalizations, so there are multiple observations (hospitalization) per patient. I want to perform a simple linear regression analysis in which my independent variable is "median household income" (dichotimized above and below the median) and my dependent variable is "length of hospital stay". I would like to cluster this regression by patient.
I used the following code to perform linear regression:
"regress length_of_stay ib2.hhi_abovebelow, vce (cluster patient_id)"
Where:
- ib2.hhi_abovebelow = Household income (above and below median)
- patient_id = unique patient identifier
The output is statistically significant and the standard error is adjusted for the 4,326 clusters (each cluster representing a patient).
Problem: After looking at the total cohort (16,428 hospitalizations with 4,326 patients) I used this same command ("regress length_of_stay ib2.hhi_abovebelow, vce (cluster patient_id)") to look only at the first hospitalization per patient by dropping all subsequent hospitalizations. In this case, I had 4,326 patients with 4,326 hospitalizations. The output that I got when I used the above command is exactly the same as the output I got when looking at the total cohort.
Questions:
- When I am using the cluster command, is this simply telling Stata to include one observation (hospitalization) per cluster (by patient) and drop all additional observations within that cluster?
- Is there a better approach I should use to perform linear regression that accounts for the fact that the same patient will have multiple hospitalizations?
Thanks in advance.
Apologies for not using dataex - please let me know if it will be useful for this query.
I am using a large data base to study length of hospital stay in patients with a rare disease. I have data on 4,326 patients who had a total of 16,428 hospitalizations over the study period. Many of the patients have frequent hospitalizations, so there are multiple observations (hospitalization) per patient. I want to perform a simple linear regression analysis in which my independent variable is "median household income" (dichotimized above and below the median) and my dependent variable is "length of hospital stay". I would like to cluster this regression by patient.
I used the following code to perform linear regression:
"regress length_of_stay ib2.hhi_abovebelow, vce (cluster patient_id)"
Where:
- ib2.hhi_abovebelow = Household income (above and below median)
- patient_id = unique patient identifier
The output is statistically significant and the standard error is adjusted for the 4,326 clusters (each cluster representing a patient).
Problem: After looking at the total cohort (16,428 hospitalizations with 4,326 patients) I used this same command ("regress length_of_stay ib2.hhi_abovebelow, vce (cluster patient_id)") to look only at the first hospitalization per patient by dropping all subsequent hospitalizations. In this case, I had 4,326 patients with 4,326 hospitalizations. The output that I got when I used the above command is exactly the same as the output I got when looking at the total cohort.
Questions:
- When I am using the cluster command, is this simply telling Stata to include one observation (hospitalization) per cluster (by patient) and drop all additional observations within that cluster?
- Is there a better approach I should use to perform linear regression that accounts for the fact that the same patient will have multiple hospitalizations?
Thanks in advance.
Apologies for not using dataex - please let me know if it will be useful for this query.
Comment