Clustering standard errors and adding variable as covariate?

Cat Santos

Join Date: Dec 2019

Posts: 62
#1

Clustering standard errors and adding variable as covariate?

01 Jun 2024, 04:48

Dear Statalist Community,

I have a somewhat silly question, but I hope you can help me clarify it. When running a Regression Discontinuity (or any other model, actually), I am clustering my standard errors by the population of the place of the interview to avoid within-group correlation:

Code:

rdrobust dv running, c(0) all covs(male y1 y2 y3) vce(cluster pop)

However, my question is: do I also have to add my clustering variable as a covariate? Why/why not?

Thank you very much!

Best,
Cat
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3135
#2

02 Jun 2024, 12:24

is population fixed and unique within the ID?
Comment
Cat Santos

Join Date: Dec 2019

Posts: 62
#3

03 Jun 2024, 03:13

Dear George,

Thank you for your reply.

The data I'm using is a little rough because it's surveys carried out in the 1990s, and so this variable categorizes from 1 to 6 the population in the place/town where the interview was carried out (basically town size). In addition to country-level, this is the only regional variable I have. I was asking this question because I don't think controlling for this variable will help, as this variable is post-treatment, taking my identification strategy into account. However, because of the way in which data was gathered (targeting certain people in each of these "town sizes") I believe I should cluster my SEs using this variable, following Abadie et al. 2022. But I'm still unsure about also including it or not as a covariate.

Code:

. tab pop population of place of interview Freq. Percent Cum. capital 5,604 27.27 27.27 100,000-500,000 3,190 15.52 42.79 50,000-100,000 1,831 8.91 51.70 20,000-50,000 1,573 7.65 59.35 5,000-20,000 2,194 10.68 70.03 rural, village 6,159 29.97 100.00 Total 20,551 100.00

Code:

. label list pop pop: 1 capital 2 100,000-500,000 3 50,000-100,000 4 20,000-50,000 5 5,000-20,000 6 rural, village

Once again, thank you so much for your help!

Last edited by Cat Santos; 03 Jun 2024, 03:22.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#4

03 Jun 2024, 03:23

Cat:
if you have solely 6 potential clusters (as I surmise from the results of your -label list pop- code) the resulting standard errors would be misleading (Cameron_Miller_Cluster_Robust_October152013.pdf (ucdavis.edu).
Therefore, I would add your categorical variable -pop- as -i.pop- in the right-hand side of your regression equation or use -weights-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Cat Santos

Join Date: Dec 2019

Posts: 62
#5

03 Jun 2024, 03:45

Dear Carlo,

Thank you very much! Yes, I only have 6 potential clusters. Unfortunately the data is not more detailed. If I add the variable as i.pop (fixed effects) the model doesn't run because my sample is relatively small and I'm already adding country and survey-year FEs. I will try the weights option! Thank you very much.
Comment

Lisa Ni Dhu

Join Date: Jun 2024
Posts: 6

10 Jul 2024, 05:32

Hi Carlo, thank you for this and I hope you can also help me - I am having a very similar debate as I have 7 clusters (schools) where the reason for the clustering is that the intervention was delivered at school level (n = 3) and the rest are controls (n =4); data is at individual level clustered within schools. I am using a linear regression to estimate the difference-in-difference of the intervention one year later using a repeated cross-sectional approach - intervention school A vs. all 4 controls, int school B vs all 4 controls, int school C vs all 4 controls and then using a meta-analysis to combine.

I have tried to use the schoolname as a categorical variable as suggested and it results in even larger standard errors than using vce(cluster schoolname). I would be really grateful if you could you explain how adding the cluster as a categorical variable is preferable to using vce(cluster schoolname) for small cluster numbers?

Code:

.  regress outcome group##wave i.schoolyr i.gender i.ethnicity  i.car i.schoolname
note: 4.schoolname omitted because of collinearity.

Source    SS    df    MS    Number of obs    =    928                F(12, 915)    =    4.29
Model    12.2942705    12    1.02452255    Prob > F    =    0.0000
Residual    218.46004    915    .238754142    R-squared    =    0.0533
                Adj R-squared    =    0.0409
Total    230.75431    927    .248925901    Root MSE    =    .48862



                        
outcome Coefficient    Std. err.    t    P>t    [95% conf.    interval]
                        
group    
Intervention   .0348927    .0632722    0.55    0.581    -.0892827    .1590681
    
wave    
After    -.0272526    .0361301    -0.75    0.451    -.09816    .0436549
    
group#wave    
Interventiont#After    -.0334142    .0798542    -0.42    0.676    -.1901328    .1233044
    
schoolyr    
Year 10    .0294868    .0322677    0.91    0.361    -.0338405    .0928141
    
gender    
Male (boy)    .0224743    .0323593    0.69    0.488    -.0410327    .0859814
Other          .067571    .2029443    0.33    0.739    -.3307194    .4658614
    
ethnicity_3    
White      .0462259    .0656334    0.70    0.481    -.0825836    .1750353
Hispanic -.0592477    .0390647    -1.52    0.130    -.1359145    .0174191
    
carhome    
Yes    -.1942087    .0437808    -4.44    0.000    -.2801311    -.1082863
    
schoolname    
School1    -.2038297    .0609188    -3.35    0.001    -.3233865    -.0842729
School2   -.0476183    .0455881    -1.04    0.297    -.1370877    .041851
School3    -.2220704    .0654937    -3.39    0.001    -.3506058    -.0935351
School4    0    (omitted)
    
_cons    .7642983    .0611347    12.50    0.000    .6443178    .8842788
        

 regress dodayswk group##wave schoolyr i.gender i.ethnicity i.car  Clusterlevel_var, vce(cluster schoolname)

  Linear    regression    Number of    obs     =    928
        F(3, 4)    =    .
        Prob > F    =    .
        R-squared    =    0.0342
        Root MSE    =    .493

        (Std. err.    adjusted for    5 clusters    in    schoolname)

                                
                                                   Robust
        outcome         Coefficient    std. err.    t    P>t    [95% conf.    interval]
                                
        group    
        Intervention    .1143638    .0572222    2.00    0.116    -.0445105    .2732381
            
        wave    
        After    -.0122701    .0613022    -0.20    0.851    -.1824722    .157932
            
        group#wave    
        Intervention#After    -.0614294    .0670268    -0.92    0.411    -.2475257    .1246669
            
        schoolyr    
         Year 10    .0304156    .0196893    1.54    0.197    -.0242506    .0850818
            
        gender    
        Male (boy)    .0253428    .0226121    1.12    0.325    -.0374384    .0881239
        Other     .0457941    .1315896    0.35    0.745    -.3195572    .4111454
            
        ethnicity    
        White    -.0460159    .0864585    -0.53    0.623    -.2860632    .1940314
        Hispanic-.0838729    .0670433    -1.25    0.279    -.2700148    .102269
            
        car    
        Yes    -.2116598    .0411314    -5.15    0.007    -.325859    -.0974606
        Clusterlevel_var -.0028147    .0081136    -0.35    0.746    -.0253415    .0197121
        _cons    .7371518    .2453501    3.00    0.040    .0559507    1.418353

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#7

10 Jul 2024, 07:44

Lisa:
in your case with 5 clusters only, -i.schoolname- is the way to go.

Kind regards,
Carlo
(Stata 19.0)
Comment
Lisa Ni Dhu

Join Date: Jun 2024

Posts: 6
#8

10 Jul 2024, 08:44

Carlo - thank you so much for your guidance. I have looked at the Cameron and Miller reference you suggested for Cat but can't seem to find this option in Section IV - please do you have a reference I could use to justify this to my supervisor?

Also, are the default standard errors acceptable in the case of using -i.schoolname- ?

Thank you again
Comment
George Ford

Join Date: Aug 2014

Posts: 3135
#9

10 Jul 2024, 08:48

With few clusters, one option is to cluster and use boottest for hypothesis testing.
Comment
Lisa Ni Dhu

Join Date: Jun 2024

Posts: 6
#10

10 Jul 2024, 09:10

Thanks for your suggestion - I wasn't aware of this so will look into it.

Could I analyse separately and meta-analyse? i.e. School 1 vs all controls, School 2 vs all controls and so on, grouping control schools to overcome potential difference in baseline and trajectories amongst controsl, and therefore there would be no need to account for schoolname as either a covariate or cluster or would that not be appropriate?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#11

11 Jul 2024, 02:47

Lisa:
the following statements are convincing: (page 5 of the quoted reference) is convincing: "In practice the most difficult complication to deal with can be "few" clusters, see Section VI. There is no clear-cut definition of "few"; depending on the situation "few" may range from less than 20 to less than 50 clusters in the balanced case.";
2) (page 21 of the quoted reference): "Second, ..., is the subject of Section VI."

Kind regards,
Carlo
(Stata 19.0)
Comment
George Ford

Join Date: Aug 2014

Posts: 3135
#12

11 Jul 2024, 08:00

I do not recommend #10. It solves nothing.

You're treatment coefficient is nowhere near statistically significant, in either of your models. Fiddling with the standard errors isn't going to change that. Looks like no effect. While the absence of evidence is not evidence of absence, this model suggests that you can't find an effect with this data and this modeling approach. A useful result.
1 like
Comment
Lisa Ni Dhu

Join Date: Jun 2024

Posts: 6
#13

19 Jul 2024, 07:14

Thank you again Carlo and George. I don't want to artificially find a 'statistically significant' effect but I just want to make sure that the estimates I report are the correct ones - with Carlo's approach of using i.schoolname are the default SE still acceptable/correct to report? (as mentioned here alternative for clustered standard errors when having too few clusters - Statalist)

Thanks again,
Lisa
Comment
George Ford

Join Date: Aug 2014

Posts: 3135
#14

19 Jul 2024, 11:24

if you do not cluster(schoolname) but use the traditional SE, then the SE will likely be incorrect and likely far too small.

if you do cluster on schoolname and there are few of them, then you need to boottest after the regression. But, if it's insignificant in the model, boottest is not going to change that (in almost all cases).
Comment

Announcement