Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering standard errors and adding variable as covariate?

    Dear Statalist Community,

    I have a somewhat silly question, but I hope you can help me clarify it. When running a Regression Discontinuity (or any other model, actually), I am clustering my standard errors by the population of the place of the interview to avoid within-group correlation:

    Code:
    rdrobust dv running, c(0) all covs(male y1 y2 y3) vce(cluster pop)
    However, my question is: do I also have to add my clustering variable as a covariate? Why/why not?

    Thank you very much!

    Best,
    Cat




  • #2
    is population fixed and unique within the ID?

    Comment


    • #3
      Dear George,

      Thank you for your reply.

      The data I'm using is a little rough because it's surveys carried out in the 1990s, and so this variable categorizes from 1 to 6 the population in the place/town where the interview was carried out (basically town size). In addition to country-level, this is the only regional variable I have. I was asking this question because I don't think controlling for this variable will help, as this variable is post-treatment, taking my identification strategy into account. However, because of the way in which data was gathered (targeting certain people in each of these "town sizes") I believe I should cluster my SEs using this variable, following Abadie et al. 2022. But I'm still unsure about also including it or not as a covariate.


      Code:
      . tab pop
      
      population of    
      place of    
      interview    Freq.    Percent    Cum.
                  
      capital    5,604    27.27    27.27
      100,000-500,000    3,190    15.52    42.79
      50,000-100,000    1,831    8.91    51.70
      20,000-50,000    1,573    7.65    59.35
      5,000-20,000    2,194    10.68    70.03
      rural, village    6,159    29.97    100.00
                  
      Total    20,551    100.00
      Code:
      . label list pop
      pop:
                 1 capital
                 2 100,000-500,000
                 3 50,000-100,000
                 4 20,000-50,000
                 5 5,000-20,000
                 6 rural, village
      Once again, thank you so much for your help!
      Last edited by Cat Santos; 03 Jun 2024, 03:22.

      Comment


      • #4
        Cat:
        if you have solely 6 potential clusters (as I surmise from the results of your -label list pop- code) the resulting standard errors would be misleading (Cameron_Miller_Cluster_Robust_October152013.pdf (ucdavis.edu).
        Therefore, I would add your categorical variable -pop- as -i.pop- in the right-hand side of your regression equation or use -weights-.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Carlo,

          Thank you very much! Yes, I only have 6 potential clusters. Unfortunately the data is not more detailed. If I add the variable as i.pop (fixed effects) the model doesn't run because my sample is relatively small and I'm already adding country and survey-year FEs. I will try the weights option! Thank you very much.

          Comment


          • #6
            Hi Carlo, thank you for this and I hope you can also help me - I am having a very similar debate as I have 7 clusters (schools) where the reason for the clustering is that the intervention was delivered at school level (n = 3) and the rest are controls (n =4); data is at individual level clustered within schools. I am using a linear regression to estimate the difference-in-difference of the intervention one year later using a repeated cross-sectional approach - intervention school A vs. all 4 controls, int school B vs all 4 controls, int school C vs all 4 controls and then using a meta-analysis to combine.


            I have tried to use the schoolname as a categorical variable as suggested and it results in even larger standard errors than using vce(cluster schoolname). I would be really grateful if you could you explain how adding the cluster as a categorical variable is preferable to using vce(cluster schoolname) for small cluster numbers?


            Code:
            .  regress outcome group##wave i.schoolyr i.gender i.ethnicity  i.car i.schoolname
            note: 4.schoolname omitted because of collinearity.
            
            Source    SS    df    MS    Number of obs    =    928                F(12, 915)    =    4.29
            Model    12.2942705    12    1.02452255    Prob > F    =    0.0000
            Residual    218.46004    915    .238754142    R-squared    =    0.0533
                            Adj R-squared    =    0.0409
            Total    230.75431    927    .248925901    Root MSE    =    .48862
            
            
            
                                    
            outcome Coefficient    Std. err.    t    P>t    [95% conf.    interval]
                                    
            group    
            Intervention   .0348927    .0632722    0.55    0.581    -.0892827    .1590681
                
            wave    
            After    -.0272526    .0361301    -0.75    0.451    -.09816    .0436549
                
            group#wave    
            Interventiont#After    -.0334142    .0798542    -0.42    0.676    -.1901328    .1233044
                
            schoolyr    
            Year 10    .0294868    .0322677    0.91    0.361    -.0338405    .0928141
                
            gender    
            Male (boy)    .0224743    .0323593    0.69    0.488    -.0410327    .0859814
            Other          .067571    .2029443    0.33    0.739    -.3307194    .4658614
                
            ethnicity_3    
            White      .0462259    .0656334    0.70    0.481    -.0825836    .1750353
            Hispanic -.0592477    .0390647    -1.52    0.130    -.1359145    .0174191
                
            carhome    
            Yes    -.1942087    .0437808    -4.44    0.000    -.2801311    -.1082863
                
            schoolname    
            School1    -.2038297    .0609188    -3.35    0.001    -.3233865    -.0842729
            School2   -.0476183    .0455881    -1.04    0.297    -.1370877    .041851
            School3    -.2220704    .0654937    -3.39    0.001    -.3506058    -.0935351
            School4    0    (omitted)
                
            _cons    .7642983    .0611347    12.50    0.000    .6443178    .8842788
                    
            
             regress dodayswk group##wave schoolyr i.gender i.ethnicity i.car  Clusterlevel_var, vce(cluster schoolname)
            
              Linear    regression    Number of    obs     =    928
                    F(3, 4)    =    .
                    Prob > F    =    .
                    R-squared    =    0.0342
                    Root MSE    =    .493
            
                    (Std. err.    adjusted for    5 clusters    in    schoolname)
            
                                            
                                                               Robust
                    outcome         Coefficient    std. err.    t    P>t    [95% conf.    interval]
                                            
                    group    
                    Intervention    .1143638    .0572222    2.00    0.116    -.0445105    .2732381
                        
                    wave    
                    After    -.0122701    .0613022    -0.20    0.851    -.1824722    .157932
                        
                    group#wave    
                    Intervention#After    -.0614294    .0670268    -0.92    0.411    -.2475257    .1246669
                        
                    schoolyr    
                     Year 10    .0304156    .0196893    1.54    0.197    -.0242506    .0850818
                        
                    gender    
                    Male (boy)    .0253428    .0226121    1.12    0.325    -.0374384    .0881239
                    Other     .0457941    .1315896    0.35    0.745    -.3195572    .4111454
                        
                    ethnicity    
                    White    -.0460159    .0864585    -0.53    0.623    -.2860632    .1940314
                    Hispanic-.0838729    .0670433    -1.25    0.279    -.2700148    .102269
                        
                    car    
                    Yes    -.2116598    .0411314    -5.15    0.007    -.325859    -.0974606
                    Clusterlevel_var -.0028147    .0081136    -0.35    0.746    -.0253415    .0197121
                    _cons    .7371518    .2453501    3.00    0.040    .0559507    1.418353

            Comment


            • #7
              Lisa:
              in your case with 5 clusters only, -i.schoolname- is the way to go.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Carlo - thank you so much for your guidance. I have looked at the Cameron and Miller reference you suggested for Cat but can't seem to find this option in Section IV - please do you have a reference I could use to justify this to my supervisor?

                Also, are the default standard errors acceptable in the case of using -i.schoolname- ?

                Thank you again

                Comment


                • #9
                  With few clusters, one option is to cluster and use boottest for hypothesis testing.

                  Comment


                  • #10
                    Thanks for your suggestion - I wasn't aware of this so will look into it.

                    Could I analyse separately and meta-analyse? i.e. School 1 vs all controls, School 2 vs all controls and so on, grouping control schools to overcome potential difference in baseline and trajectories amongst controsl, and therefore there would be no need to account for schoolname as either a covariate or cluster or would that not be appropriate?

                    Comment


                    • #11
                      Lisa:
                      the following statements are convincing: (page 5 of the quoted reference) is convincing: "In practice the most difficult complication to deal with can be "few" clusters, see Section VI. There is no clear-cut definition of "few"; depending on the situation "few" may range from less than 20 to less than 50 clusters in the balanced case.";
                      2) (page 21 of the quoted reference): "Second, ..., is the subject of Section VI."
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        I do not recommend #10. It solves nothing.

                        You're treatment coefficient is nowhere near statistically significant, in either of your models. Fiddling with the standard errors isn't going to change that. Looks like no effect. While the absence of evidence is not evidence of absence, this model suggests that you can't find an effect with this data and this modeling approach. A useful result.

                        Comment


                        • #13
                          Thank you again Carlo and George. I don't want to artificially find a 'statistically significant' effect but I just want to make sure that the estimates I report are the correct ones - with Carlo's approach of using i.schoolname are the default SE still acceptable/correct to report? (as mentioned here alternative for clustered standard errors when having too few clusters - Statalist)

                          Thanks again,
                          Lisa

                          Comment


                          • #14
                            if you do not cluster(schoolname) but use the traditional SE, then the SE will likely be incorrect and likely far too small.

                            if you do cluster on schoolname and there are few of them, then you need to boottest after the regression. But, if it's insignificant in the model, boottest is not going to change that (in almost all cases).

                            Comment

                            Working...
                            X