Hello Statalist users,
I would really appreciate help on the topic of multicollinearity.
I am running an OLS regression in stata with fixed effects on firm (or country or industry) level and time effects, and clustered standard errors on firm and time level.
As control variables, I want to include variables that sum to 1 (100%) for each firm (e.g., market shares of certain categories). The values thus can be from 0 to 1.
My questions:
Any insights or best practices would be greatly appreciated. Thanks!
I would really appreciate help on the topic of multicollinearity.
I am running an OLS regression in stata with fixed effects on firm (or country or industry) level and time effects, and clustered standard errors on firm and time level.
As control variables, I want to include variables that sum to 1 (100%) for each firm (e.g., market shares of certain categories). The values thus can be from 0 to 1.
My questions:
- Can I include all these variables directly, or does this create a multicollinearity problem? Basically I calclulated dummies for each category and then calculated % shares for each of these dummies (20 dummies),
- If it is an issue, what are the best ways to address it? I have heard that dropping one category as a reference might help (at least for dummies, but these are percentages between 0-100%, but what happens if this reference group does not appear for all firms (i.e., has a 0% share for some firms)? The remaining variables would still sum to 100%, potentially keeping the collinearity issue. Would transformations like ratios or differences be a better approach? If I leave out one category % I can run the regression, but for some of these control variables I receive no coefficient. If I use a reference category and use the calculation: category % minus category % (of reference group) I get coefficients for all of these variables, but it is still highly singular (at least Stata says so). Or are there any other approaches to make use of these "%-variables" as controls? I want to control for these category compositions.
- Are there cases where the sum-to-100% structure is not problematic in a regression? And is it problematic if just for some firms the sum is 1 as I use the difference to the reference group category %, but some firms have 0% reference group category%, as explained above.
- My constant is very high in my opinion (around 2000-4000, depending on the fixed effects I use) while the mean of the the dependent variable is 400. I thought the constant is the value of the dependent variable if all independent variables have the value of zero. But I also read, that this does not apply when fixed effects are used. Is this true?
- Is it common or advised to use clustered SE on firm and time level for all fixed effect models (firm, country, industry)?
Any insights or best practices would be greatly appreciated. Thanks!
Comment