Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustered Standard Errors in Stata replication in Matlab

    Hi, I am replicating a paper in Matlab that uses Stata. Problem setup: paper uses, for example, -reg ... i.clustervar cluster(clustervar)- to obtain POLS estimates with clustered standard errors. The panel data is clustered at the individual (country) level. When Stata omits the clustervar dummies for collinearity, it still uses the in-group variance of those omitted groups to calculate the standard errors for non-omitted coefficients. I stumbled upon this calculation method by pure luck as I was only considering the variances of those groups not omitted previously. More precisely, the sum over the groups of u*u' within the asymptotic variance formula is over all variables, including omitted variables. Is this a feature or a bug? If it is a feature, why?

    It seems to me that, from a model specification view, we should not be using data from omitted variables to calculate standard errors. On the other hand, this method increases the standard errors and makes them more robust(?) so I can see the desire from that direction. However, it seems that if the latter case is true, there is positive bias on standard errors.

  • #2
    Perhaps this is a misunderstanding of how cluster standard errors operate
    mita not about coefficients but about errors. Specifically allowing errors to be correlated within specific groups
    this link may help
    https://friosavila.github.io/app_met..._metrics7.html

    Comment


    • #3
      Originally posted by FernandoRios View Post
      Perhaps this is a misunderstanding of how cluster standard errors operate
      mita not about coefficients but about errors. Specifically allowing errors to be correlated within specific groups
      this link may help
      https://friosavila.github.io/app_met..._metrics7.html
      Maybe I wasn't clear enough in my post. As stated in your link, errors are assumed to be uncorrelated between groups. When we omit groups at the K level, we are admitting to correlation between at least two group variables and potentially other explanatory variables. From a model specification view, omitting at the K level for collinearity is admitting the change in the dependent variable explained by this dummy variable can be explained by other explanatory variables including other groups. Thus, I would venture to say that we are admitting that the errors between these two groups/variables are highly correlated and should NOT be used to calculate standard errors in risk of violating the correlation assumption.

      Comment


      • #4
        That is an interesting point
        Perhaps I should include I. My blog the assumption.. asume the model is correctly specified.
        if you ignore other variables, and they are correlated across groups for some reason, I do think there would be some misspecifiation in the estimation of standard rrrors
        now, omitting variables due to collinearity is not a problem. The omitted variables simply do not ad any additional explanatory power to the model so they will not affect how standard errors are estimated

        Comment


        • #5
          Just venturing an idea here, but it seems like two things are happening. Some levels of cluster are omitted from the fixed effects, indicating that no additional information exists in the model to predict a point estimate for those omitted clusters. But, I don’t see how this should affect the variance estimation. Wouldn’t you just have some clusters having the same intracluster correlation estimate?

          Comment


          • #6
            Originally posted by Leonardo Guizzetti View Post
            Just venturing an idea here, but it seems like two things are happening. Some levels of cluster are omitted from the fixed effects, indicating that no additional information exists in the model to predict a point estimate for those omitted clusters. But, I don’t see how this should affect the variance estimation. Wouldn’t you just have some clusters having the same intracluster correlation estimate?
            This is also what I was thinking at first. However I compared some of the standard errors and, ceteris paribus, some of the standard errors were up to 50% smaller using only non-omitted groups to calculate the "meat" of the sandwich estimator. I imagine I should post a sample of the (not perfect) MATLAB codeI am writing to ensure I am correctly replicating the procedure. This is my POLS function that returns a structure for coefficients, standard errors, p, and n. To translate my concern, g here includes ALL group dummy column indexes in X while the alternative is to only include non-omitted group dummy indexes in g.

            %% POLS Cluster Function

            function pols = pols(Y,X,gdum,omit)

            % gives clustered SE
            % Y: dependent var
            % X: explanatory vars including group dummies
            % g: vector of column indexes for ALL group dummies in X
            % omit: vector of column indexes for variables to omit in X

            % remove missing obs
            in_sample = [X Y];
            X(any(ismissing(in_sample),2), = [];
            Y(any(ismissing(in_sample),2), = [];

            Xfull = X; % Xfull includes dummy group variables omitted
            X(:,omit) = []; % X omits the omitted variables

            k = size(X,2);
            n = size(Y,1);

            b = (X'*X)\(X'*Y);

            V1 = zeros(k,k);
            count = 0;
            % calculating variance:
            % note: STATA cluster() still calculates variance from omitted groups.
            for i = 1:numel(gdum)
            Xg = X(Xfull(:,gdum(i))==1,; % grab X obs in group i
            % Xg is nxk and (k excludes omitted) but still pulls data from omitted groups
            Yg = Y(Xfull(:,gdum(i))==1,; % same goes for Yg
            Ug = Yg-Xg*b;
            Vg = Xg'*(Ug*Ug')*Xg;
            if Vg == 0
            count = count + 1; % Stata exludes these from df adj.
            end
            V1 = V1 + Vg;
            end
            V = (X'*X)\V1/(X'*X);

            g = numel(g)-count; % excludes groups with 0 variance
            df = ((n-1)/(n-k))*(g/(g-1));

            se = sqrt(df*diag(V));

            tstat = b./se;
            pvalue = 2*tcdf(abs(tstat),size(Y,1)-size(b,1),"upper");

            pols.b = b;
            pols.se = se;
            pols.n = size(Xm,1);
            pols.p = pvalue;

            end

            Comment

            Working...
            X