Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to analyse using clusters

    Hello, I am doing some analysis to different levels using multilevel analysis, I did it, but now I was asked to cluster by country.
    I have 3 models, one I am using fracreg and the others xtlogit. I saw that I can use the command vce(cluster), but I am not clear about how to define what to use as clusters. I saw in another forum a vce(cluster industry) command, should I use the same command but country instead of industry?

    ​​​​​something like:
    xtlogit depvar indvar control var i.year, vce(cluster country)?

    could be that correct? I have 40 countries.
    using last version of stata.

    thank you so much for your kind help!!

    best regards

  • #2
    If you were asked to cluster your standard errors by country, then, yes the -vce(cluster country)- option (not command) is the way to do that and the syntax you show for that is correct.

    If you are not sure why you were asked to do this, you should probably ask the person who told you to do it in the first place. It's really a matter about the substance of your problem and is only indirectly a statistical question. Presumably whoever advised you to do this is concerned about correlation of error terms within countries. Whether that is a reasonable concern depends on what the variables are and how they work in the real world--so not a statistical issue.

    As for the number of countries, it is true that with a small number of clusters, the cluster-robust variance estimator is not valid. There is no universally agreed upon definition of "small number" for this purpose. Suffice it to say that most people would say that 40 is enough, though there might be some who would disagree.

    Comment


    • #3
      Hello Clyde,

      thank you very much for your answer, its clear and as you said, its about how the variables works at real world.
      I will test with the option then.
      best regards and thank you

      Comment


      • #4
        Dear Clyde,

        I am writing you because I ran the analysis with the syntax I wrote before, but all the analysis was absolutely strange (at least for me), in one xtlogit, for my independent variable I dont have robust standard error in the outcome, in a second analysis with another dependent variable, I don´t have any significant variable (of a total of 26) and in the fracreg analysis I lost significance in several variables, including my Independent Variable.

        As a example I used (idc is my variable for country id):

        xtlogit dce lib pdh lpd roe bs bi lev siz Polrightinv religdiver lingdiver ethnic democracy autocracy RuleofLaw ShRights CredRight mcap lngdp i.industry i.year , vce(cluster idc)

        In other words, it is a complete disaster.

        I really appreciate any comment.

        Thank you !!

        Comment


        • #5
          Well, the focus on statistical significance may be causing you to over-react to minor changes that happen to cross the .05 "threshold." Or things may really have changed. For specific advice, please post the actual commands and output, wrapped in code delimiters. (See Forum FAQ #12 for information about code delimiters if you are not familiar with them.)

          My comments will be limited to the statistical issues. If all the analyses have been done correctly, then it is a matter of choosing which model is appropriate, and, as already indicated, that is a content issue, not a statistical one.

          Comment


          • #6
            Hello Clyde, Thank you again.

            Following your instructions I am sending you the testing of one of my three models.

            The actual command and output:

            Code:
            xtlogit csr  lib lth llt roe bs bi lev siz Polrightinv religdiver lingdiver ethnic democracy autocracy  RuleofLaw ShRights CredRight mcap lngdp  i.industry i.year , vce(cluster idc)
            Thank you for your comments.


            Click image for larger version

Name:	CSR clustered.png
Views:	1
Size:	49.0 KB
ID:	1431294

            Comment


            • #7
              I forgot to say that my independent variable is lib.
              Thank you

              Comment


              • #8
                Well, the point was to compare the different results, so showing only one set isn't really sufficient for the purpose.

                That said, these results are very, very strange. Logistic regression coefficients with magnitudes like 9 and 10 just don't happen in the real world. Those are equivalent to odds ratios of the order of magnitude of 10,000! When I see results like that, I know something is very, very wrong. Usually the problem is in the data. Take religdiver as an example, with a coefficient of 19. That's just not possible. That's an odds ratio of 18,000,000. Usually that kind of result arises when there is only a single observation in the data set with religdiver = 1 and all the rest are 0, or something very similar to that. It looks to me as if you are trying to analyze some predictors that have very, very lopsided distributions like that. I would look carefully at all of the distributions of your predictors, and get rid of any that have a very small number in any category. I would also look for predictors that correlate very strongly with the outcome variable (sometimes called "perfect" predictors). I think you have that going on in this data. These results simply can't be valid.

                Comment


                • #9
                  Dear Clyde,
                  Thank you again.

                  I am posting an outcome from another analysis that have another dependent variable and more years.

                  I know the perfect predictors, I had the problem once and stata showed me a warning, but I don´t know if there are some way to find it. You mentioned correlation but I don´t have variables with a high correlation (max 0.65).
                  At another side, what do you call a
                  very lopsided distributions? what could I consider a very small number in any category? are there any rule of thumb for that (any percentage)? How do you recommend me to check distributions? CSR has a distribution of 13310: 0 and 1550:1 (using codebook); now, why religdiver could have that problem if it is like this:

                  Click image for larger version

Name:	Religdiver stats.png
Views:	2
Size:	6.5 KB
ID:	1431312





                  Now, I am sending you a second analysis, and I didn´t send you before because I didn't want to overwhelm you (I am doing now)

                  The command and output are:

                  Code:
                  xtlogit dce lib pdh lpd roe bs bi lev siz Polrightinv religdiver lingdiver ethnic democracy autocracy  RuleofLaw ShRights CredRight mcap lngdp i.industry i.year , vce(cluster idc)



                  Click image for larger version

Name:	DCE Clustered.png
Views:	1
Size:	65.5 KB
ID:	1431313




                  Finally, why do I have this problem using the vce(cluster country)? I did the same analysis but without that option and using 2 level multilevel analysis, not having the problem.

                  Thank you so much for your help.

                  Best Regards

                  Comment


                  • #10
                    Well that distribution of religdiv is not, itself a problem. That result may have to do with its relationship to other variables. What catches my eye now is the _cons term in the second analysis, the one you put up a screenshot of in #9: -106.7321. That means that when all your other variables are zero, the probability of dce is 4x10-47 If your distribution of dce is nearly all zero, with just a handful of ones, that would explain this. Try -tab dce if e(sample)- and see. If that's what it looks like, then this analysis is simply not feasible.

                    If not, then I suggest trying this over with all of the variables centered around their means (except for the 0/1 variables, which you should leave as they are). That will probably bring the constant term, as well as some of the other coefficients, back towards a more normal range.

                    A similar consideration applies to the analysis with the csr outcome. If csr is almost always zero, then this analysis is not feasible. If there are plenty of 1 values of csr, then try centering the predictor variables (other than the 0/1 variables).

                    By the way, in the future, please do not post screenshots. The one in #9 is just barely readable on my computer. Often they are completely unreadable. The best way to show Stata results is to copy/paste them directly from the Results window or your log file into the forum, between code delimiters, just as you did to show the command.

                    Comment


                    • #11
                      Hello Clyde, I can´t sat thank you enough-

                      about the DCE I got this:


                      dce | Freq. | Percent | Cum.

                      0.00 | 11,806 | 59.83 | 59.83
                      1.00 | 7,926 | 40.17 | 100.00

                      Total 19,732 | 100.00

                      Do you think that the best way to center around its means is to créate a new variable subtracting the mean?


                      Another analysis is:

                      Code:
                      fracreg logit femp lib mah lma Polrightinv religdiver lingdiver ethnic democracy autocracy  RuleofLaw ShRights CredRight mcap lngdp femeduc genderquotas  roe bs bi lev siz i.industry i.year, vce(cluster idc)
                      the output is (it is a png file as recommended, I could´t paste directly but I am going to see how can I do it the next time) :

                      Code:
                       
                      Click image for larger version

Name:	femp clustered.png
Views:	2
Size:	53.6 KB
ID:	1431372

                      Thank you so much !!
                      Attached Files

                      Comment


                      • #12
                        Do you think that the best way to center around its means is to créate a new variable subtracting the mean?
                        Yes.

                        And even though most of the results shown in the latest results you posted are not absurd, that constant term is still disturbingly low. At best it suggests that the bulk of your predictor variables are bounded far away from zero. So centering will make the calculations more numerically stable. It probably won't affect this latest set of outputs that much, but I think you will see a big improvement in the other results you have shown. (By which I mean that the results will be within the range of the possible and will make some sense. Whether you will get the statistically significant findings you are hoping for, I cannot say. But that's not really the point. Getting the results correct is what counts, and then the chips fall where they may.)

                        Comment


                        • #13
                          Thank you very much Clyde, do you think that standardise is good too? centering and dividing by its standard deviation or just centering?
                          I really appreciate it.

                          Comment


                          • #14
                            I hate standardized variables. I almost never use them. The problem with them is that they make the results difficult or impossible to interpret. So if, say, age is a continuous variable, used as is, or centered, its marginal effect is the expected rate of change in outcome per year of age. If you standardize it, then the marginal effect is the expected rate of change in outcome per standard-deviation change in x. But the standard-deviation change in x depends specifically on your estimation sample, so it doesn't generalize to anything else. And nobody but you even knows how big a standard deviation change in x is: it could be quite a few days or many decades or anything in between depending on your data. And maybe even you don't know how big it is if you haven't specifically calculated it. So when you report the results of an analysis with standardized variables, your audience will have no idea, even qualitatively, whether age effects are big or small.

                            The one place where I find standardization acceptable is when the variable being standardized has no natural units at all: it is on a totally arbitrary scale that nobody is familiar with or has intuitions about, and when there is no expectation that the results for this variable will be generalized to other settings. Needless to say, there aren't very many variables like that in real life. In fact, in my experience, this mostly describes latent variables in structural equations models, where identification by fixing the variance to 1 is an acceptable approach.

                            And, yes, I'm aware that many people in the mental health discipline disagree with me on this.

                            Comment


                            • #15
                              Thank you Clayde, a last question, about standardized variables, at the other side, why it could be better?
                              Thank you

                              Comment

                              Working...
                              X