Clustering in Panel Dataset

Rohan Kundra

Join Date: Jun 2025

Posts: 2
#1

Clustering in Panel Dataset

29 Jun 2025, 06:23

Hello,

I have an unbalanced panel dataset with 187,805 observations resulting from healthcare surveys taken every two or so years over a 16 year time period from 15 different European countries.

I am doing research on two dependent variables:

Code:

doctor visits in the previous 12 months (0 to 98 visits)

hospital stays in the previous 12 months (yes or no)

Following are my independent variables:

Code:

age (65 and above years of age)

gender (1 = female)

education (low, medium, high)

household income (6 categories)

disease level (none, 1-2 diseases, 3-4 diseases, 5 or more diseases)

depression level (scale ranging from 0 to 12 - lowest to highest depression)

lagged physical activities - using values of previous survey (more than once a week, once a week, once a month, never)

Since doctor visits is a count variable, rightly skewed and shows overdispersion, I proceeded with the following negbin model:

Code:

xtnbreg doctor_visits age i.female i.education i.hhincome i.disease_level depression_level i.physical_activities

However, I am unable to cluster when adding vce(cluster id) at individual level for both fixed and random effects. Stata says vcetype cluster not allowed.
Also, the fixed effects negbin model estimates the gender variable, which in fixed effects regression shouldn't happen. I checked, no one in the dataset changed their gender. So, it is time-invariant.

I read in this forum that negbin model isn't recommended most of the time. So, I tried to run a regress with xtpoisson. Clustering works for random effects, but not for fixed effects.

Now, coming to second dependent variable, I am using the following command for a fixed effects logistic regression. However,I am still stuck at clustering at individual level. This command works for random effects, but not for fixed effects. Stata says vcetype cluster not allowed.

Code:

xtlogit hospitalised age i.female i.education i.hhincome i.disease_level depression_level i.physical_activities

My plan was to use clustering for the above individual models. Then do a Hausman test on the models without clusters; which by the way prefer fixed effects.

I would be really grateful for advice on the above clustering issue.

Also, how should I proceed further if a variable category is significant in both FE and RE, but has opposite signs?

Thank in advance.

Last edited by Rohan Kundra; 29 Jun 2025, 06:31.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3152
#2

30 Jun 2025, 16:48

is the data xtset on individuals?

As long as you use some type of robust error you can use Poisson rather than negbin.

might try ppmlhdfe which can handle high dimensional fixed effects. xtpoisson might get bogged down if id==patient.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#3

30 Jun 2025, 16:50

LPM might be better for the second model. It should give you the nearly the same marginal effects (though not predictions). If predictions are of interest, I think if you convert age to a categorical var it should predict on the unit interval.
Comment
Rohan Kundra

Join Date: Jun 2025

Posts: 2
#4

01 Jul 2025, 11:47

George Ford Thank you for your response!

Yes, the panel data is structured using the individual ID:

Code:

xtset id year

.

I was able to run the following regress:

Code:

xtpoisson doctor_visits age i.female i.education i.hhincome i.disease_level depression_level i.physical_activities, fe/re vce(robust)

The Stata output states "Std. err. adjusted for clustering on id". So, I guess Stata internally adjusts for clustering when vce(robust) is specified in xtpoisson. Which I assume is the same as vce(cluster id)?

As far as the coefficients are concerned, the main independent variables (age, disease level and depression level) are highly significant in all negbin and poisson models. Other variable categories like education, hhincome and physical activities differ in significance depending on the model.

Now coming to the second model with hospitalised, the only cause of concern is that the variable gender behaves counter-intuitively in both xtlogit and xtreg with random effects - females have decreased odds of getting hospitalised than men.

The same goes for the never category of physical activities - "Never" is associated with lower log-odds of hospitalisation in both the fixed effects models, but behaves as expected (+ sign) in random effects.

Thanks in advance.

Last edited by Rohan Kundra; 01 Jul 2025, 11:58.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#5

02 Jul 2025, 15:26

I think robust = cluster in all the xt commands.

You might try ppmlhdfe. Probably faster but should give you pretty much the same results.

You might think about xtreg cre for the second model, which mixes FE and RE. The changing signs is worrisome.

I'd think about doing a lot of tabulations across the variables in the second model, including age. The coeff are ceteris paribus, and one of the variable may be having a strong influence and making things look a little funky. You could also start with either female or activity only, then add variables to see if the sign flips. Then give the data a good look.
1 like
Comment

Announcement

Clustering in Panel Dataset

Comment

Comment

Comment

Comment