How to analyse using clusters

Alejandro Torres

Join Date: Jan 2018

Posts: 152
#31

24 Feb 2018, 19:57

Dear Clyde, I just ran a Collin (multicoliniarity test) for the DCE and CSR variables. I did it with industry and Industry only for DCE and with industry dummies for CSR, this is what I got.
I ran it for all the models and almost all variables have a low VIF, except by my independent variable (about VIF=10) and the industry dummies are giants !!! But when I run other variables from the DCE model I have a very low VIF.
What do you think? How can I know what variable is causing colinearity with lib?
Thank you so much !!!

Code:

Last edited by Alejandro Torres; 24 Feb 2018, 19:59.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#32

24 Feb 2018, 21:50

So, with the industry indicators, you expect the VIF's to be high. The industry variables are a group of variables that, if the reference indicator were include, would be perfectly colinear, adding up to 1. When you have a large number of such indicators, as here, removing 1 still leaves the sum of the rest of them nearly always equal to 1, especially if the reference level is a relatively infrequent one. So very high VIF's for these are not surprising, and they are also not a problem because they don't involve any other variables, and the effects of industry aren't of primary interest. The concern would be with variables that you are interested in. There are several high VIF's here: a few that are above 10 and many that are nearly 10. Those variables may be a problem and regressing one of them against the others would be a good idea. Actually if you just regress all of the variables (except ind*) in a linear regression, with one of them, say lib, as the dependent variable, you'll see just how much colinearity there really is by looking at R², and the coefficients will show you which variables participate strongly in the colinearity.

Normally, I don't pay much attention to colinearity, at least not if doesn't involve my key variables of interest. The reason I'm concerned about it here is that your logistic regression outputs have been implausible, and of the type that is typically seen when you have some variables that are near perfect predictors of the outcome. With your first, very limited, data example, it looked like there were some near perfect predictors. But in your larger data set, that was no longer the case. Then it occurred to me that if your variables had a very strong colinearity relationship, then it could turn out that some of the variables were near perfect predictors when the others are adjusted for. This is what I believe is going on to cause your problems. Now, of course, I haven't seen the full data set, which is very large. And it may be that the strong colinearity observed in the smaller examples does not hold in the larger data. But the later data examples you showed were large enough that it is likely that what I observed there does hold in the full data set. So that's still how I'm thinking about your problems.

In any case, I still recommend the same approach to solving the problem. Order your variables in order of importance to your research goals. Start out with a model that includes only the most important predictor. Then add variables one at a time and keep going as long as things continue to converge and the results look reasonable.

A word about what is a reasonable coefficient in a logistic regression. Remember that the logistic regression coefficients are the logarithms of odds ratios (or, for the constant term, the logarithm of the odds of the outcome when all predictors are zero.) Since you can easily calculate the overall probability of the outcome in your data set, it is reasonable to expect that the constant term will be of the same order of magnitude as the logit of that probability. For dichotomous variables, if you see a coefficient of magnitude 4 (positive or negative) you are looking at an odds ratio of about 55. In the real world odds ratios that large simply aren't seen. When you see coefficients that big in a logistic regression it usually means there is a problem with the model or the data. Even coefficients of magnitude 3 correspond to an odds ratio of about 20, which is really stretching the limits of credibility in the real world. For continuous variables, obviously it depends on the scale of the variable. But it is easy enough to interpret them. The exponential of the coefficient is the odds ratio associated with a unit change in the variable. So if you have a coefficient of, say 10, that's an odds ratio of about 22,000. Based on your knowledge of what the variable is, you can determine whether a 1 unit change could possibly be associated to an odds ratio that large. If the units of measurement of the variable are reasonable ones, the answer will clearly be no. That could only be reasonable if the units of measurement are such that a full 1 unit change is much larger than could occur in nature.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#33

25 Feb 2018, 13:37

Dear Clyde, I don't know what to say to express my gratitude to your help and explanation, I really thank you, its been really helpfull.
thank you so much,
best regards !
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#34

25 Feb 2018, 16:15

Dear Clyde,

I was playing with the data and there is two obvious variables correlated, lib and lpd for example. the reason is that variables is an interaction lib x pdh because I am testing that interaction, the issue is that I am testing interactions in all models, actually, I tested collinearity without the interations and all variables are very low (around 2). Actually, I tested the cluster by country and lib was significant at 10%. The problema now is that I need those interactions.
What do you recommend me to do?
Thank you !!!
Alejandro
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#35

25 Feb 2018, 17:14

Well, a certain amount of colinearity between an interaction and its constituent effects is expected (athough you can minimize it by centering the main effects before calculating the interaction--which you should definitely do here.) As your original models are now several pages back and hard to find, I'll just point out also that if you have one variable and an interaction term involving it in the model, the other variable that participates in the interaction also needs to be there (unless it is completely colinear with something else in the model, typically a fixed effect in an -xt...,fe- model). I don't recall whether you have violated that rule or not.

Not that it will help with these estimation issues, but also with modern Stata you should not calculate interaction variables as products of the main variables. You should let Stata create them automatically through factor variable notation. So, instead of

Code:

regression_command outcome_var lib pdh lpd...

do

Code:

regression_command outcome_var c.lib##c.pdh

Note: If either lib or pdh is a discrete variable, replace c. by i. for the corresponding variable.

That way Stata will create a virtual product variable for the regression, and it will populate the regression command variable list with lib, pdh, and their interaction for you. Best of all, you will be able to use -margins- after the regression to correctly estimate predicted values and marginal effects involving these variables. (With lib pdh lpd, -margins- will run and give incorrect answers with no warning, because -margins- will have no way of knowing that lpd is the interaction of lib and pdh, and that will cause it to treat it as an unrelated variable.)

Getting back to estimation issues, while some colinearity between lib and lpd is to be expected when lpd is the lib#pdh interaction, if it is extreme enough to be causing this kind of problem, it suggests that pdh does not vary very much, or, if it does, pdh is itself strongly correlated with lib. In either case, your data simply will not support a usable estimation of the interaction between them. I am known on this forum for discouraging people from worrying about multicolinearity among variables, but the one situation where it matters a great deal is where a) it is very strong, and b) a variable whose effect you actually want to estimate carefully for your research goals is involved. If the colinearity is not too strong, centering the variables may help (do not center the interaction: just center lib and pdh and let Stata deal with the interaction directly). If it does not, then you are stuck: there will be no simple fix and your options all involve getting different data. The different data needed would either come from a different study design that samples in such a way as to provide much more variation in pdh, or breaks the correlation between pdh and lib), or uses the same study design but is a much, much larger sample.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#36

26 Feb 2018, 10:07

Dear Clyde,

Thank you again for your time, I will try you code.
Thank so much again.
Alejandro
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#37

26 Feb 2018, 12:42

Dear Clyde, I tested with the command you mentioned, and for one model, there were important differences, I mean, with my normal command, introducing myself the interaction I got a coefficient for lib: - 0.72 and when I used the interacción you tought me I got lib: 0.133 , for me, is better the second coefficient, but I would like to know why could be that difference.
Thank you again !!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#38

26 Feb 2018, 12:50

Without showing the exact commands you ran and the output Stata gave you from each, it is impossible to say.
Comment

Alejandro Torres

Join Date: Jan 2018
Posts: 152

#39

26 Feb 2018, 13:04

Thank you Clyde, now I am sending you all, first as I was working and second with the command you tought me, thank you !!!

Code:

fracreg logit femp rightleft mah lma roe bs bi lev siz Polrightinv religdiver lingdiver ethnic democracy autocracy  RuleofLaw ShRights CredRight mcap lngdp femeduc genderquotas  i.industry i.year, vce(robust)

Code:

fracreg logit femp c.rightleft##c.mah roe bs bi lev siz Polrightinv religdiver lingdiver ethnic democracy autocracy  RuleofLaw ShRights CredRight mcap lngdp femeduc genderquotas  i.industry i.year, vce(robust)

Code:

Thank you again

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#40

26 Feb 2018, 13:17

So, the commands are identical except that in the first version you have variables rightleft mah and lma, whereas in the second you have c.rightleft##c.mah. This substitution should produce the same results provided the estimation samples are the same (which they appear to be, at least they have the same N), and if lma == rightleft*mah. So I suspect that the latter is not true. Run

Code:

assert lma == rightleft*mah if e(sample)

My bet is that the output will say that the assertion is false, and it will tell you for how many observations it is false. Then run

Code:

browse if lma != rightleft*mah & e(sample)

to find out what's going on.

If lma is not the same as rightleft*mah, then substituting c.rightleft##c.mah for the latter changes the model and is probably inappropriate. So you will need to decide whether lma really is supposed to be rightleft*mah or not. If it is supposed to be, then you have to fix the data. If it is not supposed to be, then you cannot use c.rightleft##c.mah.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#41

27 Feb 2018, 05:39

Dear Clyde,

you are right, I checked the data and was not exactly the same, so I fix it and the outcome was the same, Thank you !!

Now I think I have the last question, a basic one.

My question is: In my data, I have several dummies, the variable lib is a dummy and then I have 9 industry dummies and 8 years dummies. I know that when I have a dummy, I have to drop one and my constant is going to get its value, or I can have all dummies but I should not use constant.

Now, what happen with my data? I mean, what if I have dummies en several variables (lib, 9 industries and 8 years)? What happen if I use all industries and all years in the model? what happen with the constant if I drop one dummy for industry and one for year?

Thank you very much Clayde, you are helping me a lot with my first real (complicated) data.

Thank you again.

Last edited by Alejandro Torres; 27 Feb 2018, 05:42.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#42

27 Feb 2018, 09:08

I know that when I have a dummy, I have to drop one and my constant is going to get its value, or I can have all dummies but I should not use constant.

Yes, but using all the indicators ("dummies") and omitting the constant only works when you have just one group of indicators. You have two, and you can only apply this trick to one of them.

It doesn't really matter which way you do this. The predicted model outcomes will be the same either way, as will any contrasts among levels of the categorical variables provided you subsequently calculate them correctly. Now, at one time you had to give a fair amount of thought to these matters because it would affect the complexity and difficulty of interpreting the regression results. But now we have the -margins- command to simplify that. So, regardless of how you do it, -margins- will enable you to easily calculate the predicted values at all levels of these variables and also any differences between them.

The "path of least resistance" is to use the default version of factor-variable notation. So, don't generate your own indicators. Use factor-variable notation to do it. i.industry, i.year. If you haven't already read -help fvvarlist- and the associated manual section, do so now. It will also explain the c.var1##c.var2 device recommended earlier in this thread.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#43

27 Feb 2018, 10:01

Hello Clyde, thank you.
I read the information about i. but I am going to read it again, because I didn´t put atention to the
c.var1##c.var2. Actually, I did finally the i.year and i.industry on my final models.
Finally, what if I use all the dummies for industry and year? (not dropping any dummy and not dropping the constant neither)? I did this in my initials models and I am worried if I need to run the models again.
Thank you Clayde.
Alejandro.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#44

27 Feb 2018, 10:24

Finally, what if I use all the dummies for industry and year? (not dropping any dummy and not dropping the constant neither)?

Try though you may, this is not possible. You can put all of them into the regression command if you like, but because they are colinear, Stata will pick one from among the industries and one from among the years and drop those. (It will tell you that it is doing that.) If you think you did this in your initial models, then you did not read the output closely enough: one industry and one year will be omitted; if you don't pick them yourself, Stata picks them for you.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#45

27 Feb 2018, 10:27

Thank you Clyde, you are right, Stata did it.
Thank you so much again for your help. Honestly speaking, I really appreciate your help.
Thank you.
(I will try not writing you again for the rest all the day)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment