Hello Statalisters,
first the question, because possibly it can be answered without regarding my specific case:
Is there a need to vce(cluster), even if i can detect no correlation between residuals and the cluster-variable?
In "Microeconometrics Using Stata, Cameron and Trivedi" it is said that
[p. 83]. Which, i would say, is the case in my case.
On the other hand, the stata manual has an entry, where clustering is suggested, but ultimately discarded: "20.21.2 Correlated errors: cluster–robust standard errors".
************************************************** ************************************************** **************************************************
Now my specific case:
I'm regressing the logarithm of charitable donations on a set od independent variables. My data consists of two combined cross-sectional datasets: one is the german socio-economic panel and the other a survey among german people considered to be "very rich". The latter is named hvid. I use the hvid-dataset to offset the undersampling of rich people in my first sample.
Indeed, the overlap in terms of wealth and donations between both samples is quite small (around 10 persons out of 13.500 in the combined dataset).
Thus, it seemed obvious to me to use the vce(cluster hvid)- option, when running my regression. The cluster-variable is a dummy-variable for the original dataset of the observation (the hvid-dataset or the socio-economic panel). It has a value of 1 if people came from the hvid-dataset and 0 elsewise.
My standard-errors become very small, when using this option. The stata-faqs state that this can be
.
I tried to check for correlation between residuals and the rich-variable, using a ttest and a boxplot.:
T-Test:

Boxplot:
And i would say that there is no correlation between the residuals and the clustervariable. If there would be one, i would expect that the residuals would be, for example, higher in the HViD-sample then in the SOEP-sample.
One could argue that there must be some kind of correlation (a negative one), because that is what is causing the standard errors to become so small. But before i found the mentioned explanation in the stata-faqs i really had no idea as how to explain the small standard errors. So, it might be that there is still another reason for the small standard errors out there which i just haven't found yet (if someone could indicate me an explanation i would be very happy).
Thanks in advance for your feedback!
Caspar
PS: I'm usually using a heckman regression to correct for the selection bias. I use the OLS here because it is so nice to have actual residuals which one can compare and because the coefficients of the OLS and the heckman are very close, so i believe the OLS is not so far off.
first the question, because possibly it can be answered without regarding my specific case:
Is there a need to vce(cluster), even if i can detect no correlation between residuals and the cluster-variable?
In "Microeconometrics Using Stata, Cameron and Trivedi" it is said that
Cluster-robust standard errors must be used when data are clustered.
On the other hand, the stata manual has an entry, where clustering is suggested, but ultimately discarded: "20.21.2 Correlated errors: cluster–robust standard errors".
************************************************** ************************************************** **************************************************
Now my specific case:
I'm regressing the logarithm of charitable donations on a set od independent variables. My data consists of two combined cross-sectional datasets: one is the german socio-economic panel and the other a survey among german people considered to be "very rich". The latter is named hvid. I use the hvid-dataset to offset the undersampling of rich people in my first sample.
Indeed, the overlap in terms of wealth and donations between both samples is quite small (around 10 persons out of 13.500 in the combined dataset).
Thus, it seemed obvious to me to use the vce(cluster hvid)- option, when running my regression. The cluster-variable is a dummy-variable for the original dataset of the observation (the hvid-dataset or the socio-economic panel). It has a value of 1 if people came from the hvid-dataset and 0 elsewise.
My standard-errors become very small, when using this option. The stata-faqs state that this can be
when the intracluster correlations are negative
I tried to check for correlation between residuals and the rich-variable, using a ttest and a boxplot.:
T-Test:
Boxplot:
And i would say that there is no correlation between the residuals and the clustervariable. If there would be one, i would expect that the residuals would be, for example, higher in the HViD-sample then in the SOEP-sample.
One could argue that there must be some kind of correlation (a negative one), because that is what is causing the standard errors to become so small. But before i found the mentioned explanation in the stata-faqs i really had no idea as how to explain the small standard errors. So, it might be that there is still another reason for the small standard errors out there which i just haven't found yet (if someone could indicate me an explanation i would be very happy).
Thanks in advance for your feedback!
Caspar
PS: I'm usually using a heckman regression to correct for the selection bias. I use the OLS here because it is so nice to have actual residuals which one can compare and because the coefficients of the OLS and the heckman are very close, so i believe the OLS is not so far off.
Comment