Clustered Standard Errors in Stata replication in Matlab

Jackson Alexander

Join Date: May 2023

Posts: 3
#1

Clustered Standard Errors in Stata replication in Matlab

06 May 2023, 19:21

Hi, I am replicating a paper in Matlab that uses Stata. Problem setup: paper uses, for example, -reg ... i.clustervar cluster(clustervar)- to obtain POLS estimates with clustered standard errors. The panel data is clustered at the individual (country) level. When Stata omits the clustervar dummies for collinearity, it still uses the in-group variance of those omitted groups to calculate the standard errors for non-omitted coefficients. I stumbled upon this calculation method by pure luck as I was only considering the variances of those groups not omitted previously. More precisely, the sum over the groups of u*u' within the asymptotic variance formula is over all variables, including omitted variables. Is this a feature or a bug? If it is a feature, why?

It seems to me that, from a model specification view, we should not be using data from omitted variables to calculate standard errors. On the other hand, this method increases the standard errors and makes them more robust(?) so I can see the desire from that direction. However, it seems that if the latter case is true, there is positive bias on standard errors.
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2470
#2

07 May 2023, 03:07

Perhaps this is a misunderstanding of how cluster standard errors operate
mita not about coefficients but about errors. Specifically allowing errors to be correlated within specific groups
this link may help
https://friosavila.github.io/app_met..._metrics7.html
Comment
Jackson Alexander

Join Date: May 2023

Posts: 3
#3

07 May 2023, 12:53

Originally posted by FernandoRios View Post

Perhaps this is a misunderstanding of how cluster standard errors operate
mita not about coefficients but about errors. Specifically allowing errors to be correlated within specific groups
this link may help
https://friosavila.github.io/app_met..._metrics7.html

Maybe I wasn't clear enough in my post. As stated in your link, errors are assumed to be uncorrelated between groups. When we omit groups at the K level, we are admitting to correlation between at least two group variables and potentially other explanatory variables. From a model specification view, omitting at the K level for collinearity is admitting the change in the dependent variable explained by this dummy variable can be explained by other explanatory variables including other groups. Thus, I would venture to say that we are admitting that the errors between these two groups/variables are highly correlated and should NOT be used to calculate standard errors in risk of violating the correlation assumption.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2470
#4

07 May 2023, 13:42

That is an interesting point
Perhaps I should include I. My blog the assumption.. asume the model is correctly specified.
if you ignore other variables, and they are correlated across groups for some reason, I do think there would be some misspecifiation in the estimation of standard rrrors
now, omitting variables due to collinearity is not a problem. The omitted variables simply do not ad any additional explanatory power to the model so they will not affect how standard errors are estimated
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#5

07 May 2023, 19:56

Just venturing an idea here, but it seems like two things are happening. Some levels of cluster are omitted from the fixed effects, indicating that no additional information exists in the model to predict a point estimate for those omitted clusters. But, I don’t see how this should affect the variance estimation. Wouldn’t you just have some clusters having the same intracluster correlation estimate?
Comment
Jackson Alexander

Join Date: May 2023

Posts: 3
#6

07 May 2023, 20:30

Originally posted by Leonardo Guizzetti View Post

Just venturing an idea here, but it seems like two things are happening. Some levels of cluster are omitted from the fixed effects, indicating that no additional information exists in the model to predict a point estimate for those omitted clusters. But, I don’t see how this should affect the variance estimation. Wouldn’t you just have some clusters having the same intracluster correlation estimate?

This is also what I was thinking at first. However I compared some of the standard errors and, ceteris paribus, some of the standard errors were up to 50% smaller using only non-omitted groups to calculate the "meat" of the sandwich estimator. I imagine I should post a sample of the (not perfect) MATLAB codeI am writing to ensure I am correctly replicating the procedure. This is my POLS function that returns a structure for coefficients, standard errors, p, and n. To translate my concern, g here includes ALL group dummy column indexes in X while the alternative is to only include non-omitted group dummy indexes in g.

%% POLS Cluster Function

function pols = pols(Y,X,gdum,omit)

% gives clustered SE
% Y: dependent var
% X: explanatory vars including group dummies
% g: vector of column indexes for ALL group dummies in X
% omit: vector of column indexes for variables to omit in X

% remove missing obs
in_sample = [X Y];
X(any(ismissing(in_sample),2), = [];
Y(any(ismissing(in_sample),2), = [];

Xfull = X; % Xfull includes dummy group variables omitted
X(:,omit) = []; % X omits the omitted variables

k = size(X,2);
n = size(Y,1);

b = (X'*X)\(X'*Y);

V1 = zeros(k,k);
count = 0;
% calculating variance:
% note: STATA cluster() still calculates variance from omitted groups.
for i = 1:numel(gdum)
Xg = X(Xfull(:,gdum(i))==1,; % grab X obs in group i
% Xg is nxk and (k excludes omitted) but still pulls data from omitted groups
Yg = Y(Xfull(:,gdum(i))==1,; % same goes for Yg
Ug = Yg-Xg*b;
Vg = Xg'*(Ug*Ug')*Xg;
if Vg == 0
count = count + 1; % Stata exludes these from df adj.
end
V1 = V1 + Vg;
end
V = (X'*X)\V1/(X'*X);

g = numel(g)-count; % excludes groups with 0 variance
df = ((n-1)/(n-k))*(g/(g-1));

se = sqrt(df*diag(V));

tstat = b./se;
pvalue = 2*tcdf(abs(tstat),size(Y,1)-size(b,1),"upper");

pols.b = b;
pols.se = se;
pols.n = size(Xm,1);
pols.p = pvalue;

end
Comment

Announcement

Clustered Standard Errors in Stata replication in Matlab

Comment

Comment

Comment

Comment

Comment