Is clustering a necessity?

Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#1

Is clustering a necessity?

17 Feb 2016, 10:01

Hello Statalisters,

first the question, because possibly it can be answered without regarding my specific case:

Is there a need to vce(cluster), even if i can detect no correlation between residuals and the cluster-variable?

In "Microeconometrics Using Stata, Cameron and Trivedi" it is said that

Cluster-robust standard errors must be used when data are clustered.

[p. 83]. Which, i would say, is the case in my case.
On the other hand, the stata manual has an entry, where clustering is suggested, but ultimately discarded: "20.21.2 Correlated errors: cluster–robust standard errors".

************************************************** ************************************************** **************************************************
Now my specific case:

I'm regressing the logarithm of charitable donations on a set od independent variables. My data consists of two combined cross-sectional datasets: one is the german socio-economic panel and the other a survey among german people considered to be "very rich". The latter is named hvid. I use the hvid-dataset to offset the undersampling of rich people in my first sample.
Indeed, the overlap in terms of wealth and donations between both samples is quite small (around 10 persons out of 13.500 in the combined dataset).

Thus, it seemed obvious to me to use the vce(cluster hvid)- option, when running my regression. The cluster-variable is a dummy-variable for the original dataset of the observation (the hvid-dataset or the socio-economic panel). It has a value of 1 if people came from the hvid-dataset and 0 elsewise.

My standard-errors become very small, when using this option. The stata-faqs state that this can be

when the intracluster correlations are negative

.

I tried to check for correlation between residuals and the rich-variable, using a ttest and a boxplot.:

T-Test:

Boxplot:

And i would say that there is no correlation between the residuals and the clustervariable. If there would be one, i would expect that the residuals would be, for example, higher in the HViD-sample then in the SOEP-sample.

One could argue that there must be some kind of correlation (a negative one), because that is what is causing the standard errors to become so small. But before i found the mentioned explanation in the stata-faqs i really had no idea as how to explain the small standard errors. So, it might be that there is still another reason for the small standard errors out there which i just haven't found yet (if someone could indicate me an explanation i would be very happy).

Thanks in advance for your feedback!

Caspar

PS: I'm usually using a heckman regression to correct for the selection bias. I use the OLS here because it is so nice to have actual residuals which one can compare and because the coefficients of the OLS and the heckman are very close, so i believe the OLS is not so far off.

Last edited by Caspar Aumueller; 17 Feb 2016, 10:06.
Tags: cluster, regress
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#2

20 Feb 2016, 07:30

Cross-posted on Cross Validated.

least squares - Is clustering a necessity? - Cross Validated

http://stats.stackexchange.com

this is a cross-post. The original question was posted on Statalist, the 17.2.2016. First the question, because possibly it can be answered without regarding my specific case: Is there a need to
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#3

20 Feb 2016, 08:09

Caspar:
you should give us some more details concerning your OLS (i.e. posting what you typed and, especially in your case, what Stata gave you back, as per FAQ). Thanks.
At a very first glance, the small standard errors may well be due to a large N.

Kind regards,
Carlo
(Stata 19.0)
Comment
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#4

21 Feb 2016, 12:57

Hello Carlo,

i'm sry i didn't ad this right from the start, but as there is now specific problem with my code i thiught this would be some kind of overload.
I'm using a loop to go through several estimation methods (OLS, Probit, Logit, Tobit, Heckman). This is what happens if i run the OLS:

Code:

*** preparation *** #delimit ; global main = "c.wealth_norm##(c.wealth_norm#c.wealth_norm) ib0.soc_eng_2011 ib0.child_dum ib0.married c.age_norm##c.age_norm ib0.unempl ib1.sex ib3.casmin_comp"; global income = "c.netinc_norm##(c.netinc_norm#c.netinc_norm)"; global selection = "ib0.married ib0.unempl ib0.retired ib0.position"; #delimit cr *** regression *** regress lgivings ib0.hvid##($main $income $income $selection) /// if givings_dum == 1, vce(cluster hvid) *** predict *** predict latent if e(sample), xb predict regressres if e(sample), residuals gr box regressres, over(hvid) ttest regressres, by(hvid)

Results of the regression-command. I'm sorry if this is still too large. I tried both manual rescaling and the option (small, medium, etc.) but it seems to jump back all the time and i'm quite tired right now.

Last edited by Caspar Aumueller; 21 Feb 2016, 13:12. Reason: Edited regression results.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#5

21 Feb 2016, 16:57

Your problem here, or at least a big part of it, is the use of vce(cluster hvid). As the output tells you, you have only two such clusters. The cluster robust estimator only works well when the number of clusters is large. While experts differ on how large is large enough, I don't think anyone would say that 2 is adequate. In general, results using vce(cluster) are not reliable with small numbers of clusters.

In your case, however, the problem bites even harder because your model has so many variables. With only two clusters, you only have 1 degree of freedom for estimation. Notice that in the header your model's F statistic has 0 ndf and 1 ddf! That is why no value is reported for it, and this accounts for the bizarre standard errors of many of the variables. You are trying to estimate 44 parameters with only 1 degree of freedom. I"m surprised this doesn't look even worse than it does.

In addition, even if all else were fine, this is an incredibly complicated model: you have a four-way interaction term and several three-way interactions as well. How will you be able to interpret the results of all this? Most people have difficulty wrapping their minds around two-way interactions. Unless these interactions are there to just provide very fine-grained adjustment for nuisance variables that might confound the hvid <-> lgivings relationship, I think you are in for a world of difficulty explaining this to anybody.

Finally, a couple of pieces of advice about the use of macros (that have no bearing on the current difficulties). First, use local in preference to global: if any program called by any program called by any program.... also uses a global macro named main, select, or income, then you have a name clash and your global will overwrite that one, or that one will overwrite yours. The results can be entirely unpredictable, and debugging can be a nightmare as you don't even necessarily know what programs are being called several layers down during an analysis. Global macros should be reserved for the rare situation where it is absolutely crucial that a piece of information be available up and down the program chain and it is not possible or intolerably cumbersome to pass the information in program arguments or variables. (And when using a global, it is usually a good idea to give it a name that is unlikely to be used by anyone else for anything else.) Also, for the particular macro definitions you are using, neither the equals sign nor the quotation marks are needed. The quotation marks do no harm. Neither does the equals sign in this instance, but in situations where the text you are putting into the macro is longer than the allowable length of a string variable, it can lead to truncation--which is another thing that can produce strange results that are hard to debug.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#6

21 Feb 2016, 22:50

Caspar:
as a small print to Clyde's superb advice, an unreported F-test value is usually an alarm chime, which calls for a comprehensive data check.

Kind regards,
Carlo
(Stata 19.0)
Comment
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#7

22 Feb 2016, 03:22

Thank you both very much. Although this are not happy news^^.
I used the globals on purpose because i actually have another do.-file that i want to estimate the exact same model and i don't want to have to replace it in both do-files. I'm sure that nowhere in the code are any other calls for those macros (the entire code is from me), but i condsider changing back to locals to be absolute sure.
I didn't know about the equals sign and the quotation (obviously) - thanks again.

Now to my main-problem: i thought that the missing f-test is a result of the clustering. The moment i set off the clusters the f-test is reported. It also doesn't seem to matter which cluster i use:
household(4063 cluster), wealth-deciles(10 cluster) or datasource(that is the one i used in the regression above). Thats why i asked if the clustering is necessary.

Model-size: you are of course right. I just started with it because these are the variables the literature suggests as right-hand variables and the interaction with hvid seems obvious to me as i would suppose that very rich people have another way of donating, eg. because they attend to charity-events and have some kind of social code which changes there attitude towards donating (Why the wealthy give: A study of elite philanthropy in New York city; Ostrower, Francie).

The variables from the global/local "selection" i include because i want to show that they have no significant influence on ln(donations) and i can thus savely use them as selection variables in the heckman-regression.

I dropped the interaction terms with all but wealth(this is imperative to me because i want to see if the effect of wealth on donations changes if people are "very rich" and because i want to control for reporting bias in the second group of people).

I don't wand you to think i would like to do my work for me, but as Carlo asked me to include my code and the results the first time i'll do it know as well, so that you see what happened.

Code:

#delimit ; local wealth = "c.wealth_norm##(c.wealth_norm#c.wealth_norm)" ; local main = "ib0.soc_eng_2011 ib0.child_dum ib0.married ib4.age_gr ib0.unempl ib1.sex ib3.casmin_comp"; local income = "c.netinc_norm##(c.netinc_norm#c.netinc_norm)" ; local selection = "ib0.married ib0.unempl ib0.retired ib0.position"; #delimit cr local depvar_lin = "lgivings" local depvar_sel = "givings_dum" local latent = "(c.latent##c.latent)#(c.latent##c.latent)" local clustervar = "vce(cluster wealth_perc)" local clustervar2 = "vce(cluster hid)" local censor = "-0.1" *** regression *** regress lgivings ib0.hvid##($main $income $income $selection) /// if givings_dum == 1, vce(cluster wealth_perc) eststo regress regress lgivings ib0.hvid##($main $income $income $selection) /// if givings_dum == 1, vce(cluster hid) eststo regress2 regress lgivings ib0.hvid##($main $income $income $selection) /// if givings_dum == 1 esttab . regress regress2, label noomitted nobase mtitle("no cluster" "wealth-cluster" "household-clusters" > ) scalars(N r2_a ll ll_0 clustvar N_clus F) coeflabels(1.retired "retired" 1.unempl "unemployed" 1.soc_eng_2 > 011 "social engagement" 1.child_dum "children_dummy" 1.married "married")

wealth_perc contains the 10 wealth-deciles and hid is the household identifier.

And i'm sorry about the table again...

Last edited by Caspar Aumueller; 22 Feb 2016, 03:58.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#8

22 Feb 2016, 03:44

Caspar:
thanks for providing your feedbacks.
Again on the necessity of clustering: if you perform a pooled OLS the assumption that within variation is not different from between variation is hardly to prove, due to serial correlation: that's why clustered standard errors are almost mandatory (unless the literature in your research field reports something different).
Your problem seem to rest on the best perspective from which your regression should be performed (i.e. which -id- should be investigated).
However (if I got it right from your last post) it sounds weird that the F-test is still unreported when you select household as -id-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#9

22 Feb 2016, 06:42

Carlo,

yes, also selecting the household-id as cluster doesn't solve the problem of the f-test. If i regress only on one cluster(with clustering) f-test is reported.
If i regress on both cluster, with clustering, the f-test is not reported, regardless which cluster i use. The same applies for vce(robust).

Could it be that the f-test is not reported because some values of some variables(say wealth) are only encountered in one cluster(the hvid-cluster)?

Thank you for the remark on clustering. One question: I can't detect the correlation between residuals and the clusters because the cluster-variable is absorbing some of the correlation from the residuals?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#10

22 Feb 2016, 08:29

Caspar:
if you're interested in double-clustered SEs. you may find this (and related ones) thread interesting: http://www.stata.com/statalist/archi.../msg01041.html.
The other questions are difficult to reply positively (for me, at least) without seeing your data.
However, I suppose you have just tried the correlation between residuals via both parametric and non-parametric tests that Stata offers.

Kind regards,
Carlo
(Stata 19.0)
Comment
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#11

22 Feb 2016, 11:36

Carlo:
Thank you very much for your advice.
Actually i was referring to my very first post: i tried to "see" from the residuals if there is some kind of relationship between the residuals and the clusters(at this time the two groups), and ran a ttest to see if the mean of the residuals differ between both groups (assuming that, if there is some relation, this should reflect in the residuals).

I quoted your previous answer to my original question on Cross Validated, because i believe that wraps it up quite good. If you want you may check if i quoted you correctly here(third edit in the original question).

Last edited by Caspar Aumueller; 22 Feb 2016, 11:42.
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#12

22 Feb 2016, 11:51

I am puzzled by this discussion. As I understand it, you are using GOSEP, which is based on a probability sample and requires the use of Stata's survey analysis capabilities. That means you have to specify PSU, Strata and weights. It's not clear how the wealthy sample was chosen, but I would guess that it was not a probability sample. Thus, I don't see how you can just graft the supplemental sample of wealthy individuals on to the original sample without taking the design into account. In any case, as Clyde Schechter discussed in some detail, clustering is not a fix for this problem. You could think of the wealthy sub sample as a separate stratum I suppose but I am not at all clear how you would specify such a design. I don't quite understand what the t test you reported is intended to accomplish but aside from the survey design issue I would think that you face an unequal variance problem. The t test you report is based on residuals from an unspecified model, but if the model involved clustering I don't think the test tells you what you want to know. To summarize, (a) you need to think about how, if at all, you can combine these these two samples and (b) how yu can deal with what I would think would be serious heterosketasticity in some way other than clustering.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#13

22 Feb 2016, 12:05

Caspar:
thanks for correctly quoting my previous reply on Cross Validated (no doubts about that even before checking, anyway), where I'm less active that on Statalist (time constraints impose choices, you know).

Kind regards,
Carlo
(Stata 19.0)
Comment
Caspar Aumueller

Join Date: Oct 2015

Posts: 31
#14

22 Feb 2016, 13:26

Richard:
I don't know the GOSEP. I'm using the German SOEP (Socio-Economic Panel) and i thought that i wouldn't have to use the svy-design because i'm only using one year(not several). I agree with you that the combination of SOEP and HViD is difficult. I have not yet found any better solution then to assume that i can just add them to the first sample. I have no information about the sampling which took place for this survey.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#15

22 Feb 2016, 14:34

This thread is evolving in directions that I have little to contribute to (which is fine!). I just want to briefly comment on

I used the globals on purpose because i actually have another do.-file that i want to estimate the exact same model and i don't want to have to replace it in both do-files. I'm sure that nowhere in the code are any other calls for those macros (the entire code is from me), but i condsider changing back to locals to be absolute sure.
I didn't know about the equals sign and the quotation (obviously)

This is not a circumstance that requires using globals. I am often in this situation myself. The way to handle it is to take these key local macros and put them in a separate do-file, which I often give the name variable_selections.do (because it is, as in your case, typically a list of variables selected for a variety of analyses.) Now, you may be thinking that if they're in a separate do-file, then they are out of scope in the analysis do-files. That is the usual case. But there is a little-used (and I think little known) Stata command -include-. When you use -include name_of_do_file_here- in another do-file, Stata reads in the contents of that mentioned do-file and treats them as if they were actually part of the do-file that calls on include. So in any analysis file where you want to use these same local macros, all you have to do is -include variable_selection- and the locals are there and in scope.

And while you may feel quite sure that there are no other references to global macros having those same names, the truth is you never really know it. After all, do you know what goes on deepinside the bowels of every Stata command you use in a do-file? Have you actually checked the source code of every .ado you use, and of every .ado that is invoked by that first .ado, and every .ado invoked by those ad infinitum? While the probability of a name clash may be low, certainty is not achievable without Herculean efforts. Moreover if you are wrong in your assumption, the consequences can be bugs that are maddeningly difficult to find and remove (or worse, subtle changes that invalidate your results, but that are not so obvious that you delve into them.)
1 like
Comment

Announcement

Is clustering a necessity?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment