Problem with clustering

June Le

Join Date: Aug 2023

Posts: 12
#1

Problem with clustering

14 Sep 2023, 07:01

Dear Statalister,

I encounter with a problem about clusterting. I hope to receive your advice on it.
I am examining the effect of a policy on spousal earnings using two-way FE (occupation and time) with a continuous treatment. Since the policy affects both couples at the same time, my equation is like this

log(earnings) of wife = a1* eligility of husband + a2* eligibility of wife + a3* occupation of husband + a4* occupation of wife + a5* year + other controls

Clustering at the level of husband's occupation or clustering at the level of wife's occupation gives me the same estimate, but very different standard errors.
Since my main interest is a1, should I cluster at husband's occupation? Is that okay to use clustering at husband's occupation?

Or I am thinking about creating a composite category variable based on the occupation of both husband and wife of the year before treatment, following the guide of Nick Cox (https://www.stata.com/statalist/arch.../msg00095.html)
egen both_occ = group(husband's occupation wife's occupation), label
Do you think clustering at "both_occ" is more appropriate than using either husband's occupation or wife's occupation?
Another thought about using group cluster "both_occ" is because if I use either husband's occupation or wife's occupation, total numbers of cluster is below 50. And I read somewhere that the rule of thumb for total number of cluster should be 50.

Thanks in advance.

Last edited by June Le; 14 Sep 2023, 07:24.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#2

14 Sep 2023, 08:32

June:
as far as I can get from your description, you're actually investigating couples.
Therefore, I would give:

Code:

egen both_occ = group(husband's occupation wife's occupation), label

a shot as -clusterid-.
However, if you have a panel dataset, I remains to be seen whether the aforementioned -clusteri- clashes with the way your dataset was -xtset-.
Conversely, if you have at least 30 clusters, I think you can safely go clustered standard error.
That said, I'd also take a look at the literature in your research field an see what others did in the past when presented with the same research design/goals.

Kind regards,
Carlo
(Stata 19.0)
Comment

June Le

Join Date: Aug 2023
Posts: 12

15 Sep 2023, 06:48

Dear Carlo,
Thank you for your kind suggestion.

I would like to send herewith my data sample:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(hid16 id16 hid18 id18) float(occ_2016_hus occ_2016_wf eligibility_2016_hus eligibility_2016_wf year hours lhours lincome_year)
314101 31410101 301413 30141301 18 18   .7405373     .7405373 2016   60 4.0943446  14.01436
314101 31410101 301413 30141301 18 18   .1552688     .1552688 2018 72.5 4.2835865 14.472904
314101 31410102 301413 30141302 18 18   .7405373     .7405373 2016   60 4.0943446 14.166167
314101 31410102 301413 30141302 18 18   .1552688     .1552688 2018 72.5 4.2835865 14.472904
314009 31400901 301417 30141701 74 92  .03204585   .009122826 2016   44   3.78419 14.457364
314009 31400901 301417 30141701 74 92          0 3.785157e-06 2018   44   3.78419 14.696048
314009 31400902 301417 30141702 74 92  .03204585   .009122826 2016   44   3.78419  14.31931
314009 31400902 301417 30141702 74 92          0 3.785157e-06 2018   44   3.78419 14.300533
505006 50500601 500504 50050401 22 22   .1749318     .1749318 2016   52 3.9512436  13.77469
505006 50500601 500504 50050401 22 22 .003453914   .003453914 2018 49.5  3.901973 14.131978
505006 50500602 500504 50050402 22 22   .1749318     .1749318 2016   52 3.9512436  13.96393
505006 50500602 500504 50050402 22 22 .003453914   .003453914 2018   48  3.871201  14.27038
505015 50501501 500509 50050901 28 18   .3909481     .7405373 2016   56 4.0253515         .
505015 50501501 500509 50050901 28 18  .02927846     .1552688 2018   70  4.248495         .
505015 50501502 500509 50050902 28 18   .3909481     .7405373 2016   63 4.1431346  13.77469
505015 50501502 500509 50050902 28 18  .02927846     .1552688 2018   84 4.4308167         .
505107 50510701 500514 50051401 93 15         .2    .14916484 2016   50  3.912023         .
505107 50510701 500514 50051401 93 15       .014   .025445513 2018   50  3.912023         .
505107 50510702 500514 50051402 93 15         .2    .14916484 2016   39 3.6635616         .
505107 50510702 500514 50051402 93 15       .014   .025445513 2018 6.25 1.8325815         .
end

My dataset was:
xtset id16 year

If I use either husband's occupation or wife's occupation, there are about 30 (sometimes 29) clusters.
If I use clusterid , there would be 185 clusters.

I hope to receive your comments on my case.
Many thanks.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#4

15 Sep 2023, 07:39

June:
the difference in the number of clusters presumably depends on the combinations between wife and husband's occcupation, that may change over time.
What if you cluster on -hid16-?

Kind regards,
Carlo
(Stata 19.0)
Comment
June Le

Join Date: Aug 2023

Posts: 12
#5

15 Sep 2023, 23:32

Dear Carlo,

I create the clusterid based on wife and husband's initial occupation which was in 2016. The eligibility (treatment variable) was also based on initial occupation.
My equation includes individual FE, occupation FE and year FE.
I am not sure and really concern about what cluster is precise in my case.
I thought standard errors should be clustered at the treatment level which in my case is occupation level. There is heterogeneity in my treatment effect.

I would like to hear your suggestion on that.

Many thanks,
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#6

16 Sep 2023, 02:11

June:
I would stick with the -clusterid- that you created.

Kind regards,
Carlo
(Stata 19.0)
Comment
June Le

Join Date: Aug 2023

Posts: 12
#7

16 Sep 2023, 06:18

Dear Carlo,

Thank you so much for kind suggestion.
Comment

Announcement

Problem with clustering

Comment

Comment

Comment

Comment

Comment

Comment