Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression with Variables Summing to 100% – Multicollinearity Issue?

    Hello Statalist users,

    I would really appreciate help on the topic of multicollinearity.

    I am running an OLS regression in stata with fixed effects on firm (or country or industry) level and time effects, and clustered standard errors on firm and time level.
    As control variables, I want to include variables that sum to 1 (100%) for each firm (e.g., market shares of certain categories). The values thus can be from 0 to 1.

    My questions:
    1. Can I include all these variables directly, or does this create a multicollinearity problem? Basically I calclulated dummies for each category and then calculated % shares for each of these dummies (20 dummies),
    2. If it is an issue, what are the best ways to address it? I have heard that dropping one category as a reference might help (at least for dummies, but these are percentages between 0-100%, but what happens if this reference group does not appear for all firms (i.e., has a 0% share for some firms)? The remaining variables would still sum to 100%, potentially keeping the collinearity issue. Would transformations like ratios or differences be a better approach? If I leave out one category % I can run the regression, but for some of these control variables I receive no coefficient. If I use a reference category and use the calculation: category % minus category % (of reference group) I get coefficients for all of these variables, but it is still highly singular (at least Stata says so). Or are there any other approaches to make use of these "%-variables" as controls? I want to control for these category compositions.
    3. Are there cases where the sum-to-100% structure is not problematic in a regression? And is it problematic if just for some firms the sum is 1 as I use the difference to the reference group category %, but some firms have 0% reference group category%, as explained above.
    4. My constant is very high in my opinion (around 2000-4000, depending on the fixed effects I use) while the mean of the the dependent variable is 400. I thought the constant is the value of the dependent variable if all independent variables have the value of zero. But I also read, that this does not apply when fixed effects are used. Is this true?
    5. Is it common or advised to use clustered SE on firm and time level for all fixed effect models (firm, country, industry)?
    In Stata I am able to run the regression but it says that the variance-covariance matrix would be highly singular. (F-Test p-value <0.05) and R2 = 55%.

    Any insights or best practices would be greatly appreciated. Thanks!

  • #2
    Can I include all these variables directly, or does this create a multicollinearity problem? Basically I calclulated dummies for each category and then calculated % shares for each of these dummies (20 dummies),
    No, you can't include them all; they are completely colinear. From a mathematical point of view, your variables are no different from a set of indicator ("dummy") variables where you forgot to omit one of the categories.

    If it is an issue, what are the best ways to address it? I have heard that dropping one category as a reference might help (at least for dummies, but these are percentages between 0-100%, but what happens if this reference group does not appear for all firms (i.e., has a 0% share for some firms)? The remaining variables would still sum to 100%, potentially keeping the collinearity issue.
    The alternative to omitting one of them is to impose a constraint on their coefficients instead. Remember that when you have a set of colinear variables in your model, the model is unidentifiable. It's like modeling something where two of the variables are the temperature in Farenheit degrees and the temperature in Kelvin. There is no way to identify the coefficients of those variables because you can pick either of the coefficients to be anything at all and then there will be a corresponding coefficient for the other that produces the same best-fit linear combination.

    Also, it's important to remember that while you can get coefficients, either through a constraint or by omission of one variable (which is the same thing as a constraint that that variable's coefficient is zero) those coefficients do not meaningfully express relationships between those variables and the outcome--if you impose a different constraint (or choose a different variable to omit) you will get different coefficients, although the two models will be equivalent in terms of predicted values.

    Would transformations like ratios or differences be a better approach? If I leave out one category % I can run the regression, but for some of these control variables I receive no coefficient. If I use a reference category and use the calculation: category % minus category % (of reference group) I get coefficients for all of these variables, but it is still highly singular (at least Stata says so). Or are there any other approaches to make use of these "%-variables" as controls? I want to control for these category compositions.
    If you instead use as your variables the difference between each variable and some chosen base category variable, that will work, because it is more or less equivalent to contstraining the coefficient of the base category variable to be zero. However, the fact that you are getting other covariates ("control variables") omitted, or a singular results tells me that these covariates are variables that are constant within each category--this is yet another collinearity that is not allowable. If you have variables that effectively identify the categories, then you cannot have invariant category attributes as covariates in the model. Again, it's linear algebra and there is no getting around it in a linear model.

    [The following paragraph added in edit]
    Also, if you want to "control" for (really, in observational data you never actually control for anything, you adjust for things) these categories you accomplish that completely by including all but one of them. The omitted category's effect gets distributed over the others and the constant term. So with one category omitted, you have accomplished your goal of adjusting for all of the categories. What you have not done, and cannot do, no matter what clever trick you try, is estimate the effects of all of the categories. Because of the colinearity, they do not have separate effects: they exert their effects as a group.

    Are there cases where the sum-to-100% structure is not problematic in a regression?
    No. This is linear algebra and there is no getting around it.

    And is it problematic if just for some firms the sum is 1 as I use the difference to the reference group category %, but some firms have 0% reference group category%, as explained above.
    No, this would not be problematic. Now, if there were only one firm where the sum was different, you would get results, but probably with this data being very close to colinear, your standard errors would be very wide and your results pretty inconclusive. And given how you have defined these variables, I don't see how they could not sum to 100%. You just have to reconcile yourself to getting rid of one, or otherwise constraining the unidentifiable model.

    My constant is very high in my opinion (around 2000-4000, depending on the fixed effects I use) while the mean of the the dependent variable is 400. I thought the constant is the value of the dependent variable if all independent variables have the value of zero. But I also read, that this does not apply when fixed effects are used. Is this true?
    That's right, it doesn't apply. The constant term in a fixed effects model is complicated to explain. The most important thing to know about it is that it is almost always meaningless. You have already wasted too much of your time even glancing once at it, let alone thinking about its value. Just ignore it.

    Is it common or advised to use clustered SE on firm and time level for all fixed effect models (firm, country, industry)?
    Not always. You shouldn't use clustered SE if there aren't a sufficiently large number of different values of the clustering variable. While there is no simple rule to say how many is sufficient, if you have, say, only 10 countries, you should not cluster on country. A simple rule of thumb would be about 50 or more to support clustered standard errors, although others might give you a higher or lower number or a more complicated approach. Assuming you have sufficiently many of them, the clustering is typically done at the highest level.
    Last edited by Clyde Schechter; 06 Feb 2025, 17:39.

    Comment


    • #3
      Also see the last section of https://www.maartenbuis.nl/publicati...oportions4.pdf
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        To complicate the question further: if market shares are predictors I wouldn't assume that their effect was linear in principle over the entire possible range.

        Comment


        • #5
          Hello and thanks for your answers!
          Your detailed answer really helped me, Clyde - as well as the document from Maarten in order to understand more of this topic, as I could find no papers about it before, thanks!

          I now leave out one category % (these are not market shares, but it is a bit too complicated to explain, so just an example) and my regression works.
          I just have the "problem" that for some categories their average % in each firm is very low (e.g., 0.000012 or lower). Now these have very high coefficients and standard errors, but some are still significant.
          My question now would be: Should I exclude these categories as they have very low average % in the sample, so a lot of firms have 0%.

          I would be very happy if you coud help me out further!

          Comment


          • #6
            So, first, just remember that nothing is wrong with what you are getting. When a variable has very small values, its coefficient will be large compared to that of a variable which takes on moderate or large values. Similarly, the large standard error is exactly what you would expect. I'm not sure why you want to fix this: it doesn't sound to me like it's broken.

            That said, if you have some reason you can't abide this situation, you might consider combining some of these categories that are "undersubscribed" into a single category. It's kind of like when you have a variable like, say, religious preference and two are three categories predominate and then there are five or six others with just a handful of people choosing them, you might combine those others into a single "Other" bucket. You can do the equivalent here by creating a new variable that is the sum of these small ones and use that in place of the small ones in your regression. If the meanings of the categories are sufficiently compatible that combining them into a single "Other" category makes sense, this would be better than omitting them. Remember also that if you omit them, you are, in effect, combining them not just with each other but with the one category you previously selected as the omitted reference category. So it's actually a bigger deviation from the original data. I think that when we revise models, we should do the minimum data tampering possible consistent with fixing whatever problem the current model has.

            And, once again, it doesn't look broken to me, and I'd think twice or more about fixing it.

            Comment


            • #7
              Are they actually percentages (sum to 100) or proportions (sum to one). I guess the former, but sum of those averages seem small.

              Comment


              • #8
                Thank you Clyde!
                I have thought of the method of combining these groups as well! I just was not sure if this is a good way of doing research.
                So this would have been my next question, which you answered already. I think, I will use this method and have a look on the results. I really appreciate your comments and time.

                Hello Jeff, they are actually proportions, but I scale them with 100 in order to receive percentages. Either way, some of them are very small in comparison to the other categories.

                Comment


                • #9
                  I also would like to know of your opinion wether it is important, which category I leave out of the model. I know that the interpretations are related to the reference group. Does it matter which group is used and if yes, should I use a group that is highly represented in my data or the group "others" in my case?
                  Thank you in advance!

                  Comment


                  • #10
                    wether it is important, which category I leave out of the model
                    It depends on what your modeling purpose is. If you are building a model to predict expect outcomes, it makes no difference. If you are doing a model to estimate causal effects or associations, but these variables are included only as "control variables," it also makes no difference. The only time it makes a difference is when you are trying to estimate the effects of these particular variables themselves. Now, due to the colinearity among them, no estimate of their absolute effects is even possible. But effects relative to the omitted category are estimated. So you would chose to omit the category such that effects relative to that category are of greatest interest. This in turn depends on exactly what these categories are and what they mean in the real world.

                    For example, when we do health disparities research in the United States, the usual interest is how everybody else is doing compared to white people (the conventional term for people of European ancestry). So when we introduce a race variable into the analysis, we would usually make white the omitted reference category. However, I have worked at a place where there are very few whites and the disparities of greatest interest are between black people (African ancestry) and latino people (Central and South American + Mexicans), so making whites the reference group would not make the difference of interest immediately obvious in the outputs: additional calculation subtracting two coefficients would be needed. So in that setting, typically the reference category would be either black or latino.

                    In short, in most situations it makes no difference. In the situations where it does matter, the choice is not based on statistics, it is based on the real-world aspects of the focus of your study.

                    Comment


                    • #11
                      Thank you for your detailed response.
                      Now I can adjust my model as recommended by you and as you said, the estimates of the "control" variables are not of great interest to me.
                      I am now able to apply your recommended approach and be aware of the interpretation!

                      Comment


                      • #12
                        Hello Clyde,

                        I would have further questions on interpretations and standardizations of variables or coefficients.
                        I am unsure about the interpretation of the coefficients.
                        When using dummies, I know that we would interpret the coefficients as the effect in comparison to the reference group. E.g. white has a positive effect on smoking in comparison to latino, right?

                        Now, if I use proportions of categories instead of dummies and use a reference group, how would I interpret the coefficient?
                        Does it tell me: "If the proportion of A is greater than B (reference group) than the proportion A has a positive effect on the dependent var."
                        or: "A higher proportion of A has a positive effect on the dependent variable in comparison to B (reference) where a higher proportion of B has lower or negative effect?"

                        Also I thought about standardizing these category % variables using z-scores. Do you think this would be a good idea?
                        I know that standardizing is good if I want to compare how important each variable is in order to explain the dependent variable.
                        In theory every proportion can be in the range of 0 to 1, but in reality this is not the case.
                        Is the real range important?

                        Can I apply z-scores to just these proportion variables and not the others?

                        If the independent variable, I am interested in, has the same metric as the dependet variable, but ofcourse the range is different, wouldn't it be dump to standardize this var as the interpretation changes and also I am not interested in the importance between most variables (maybe the proportion variables).
                        Thanks in advance!

                        Comment


                        • #13
                          With proportions, the role of the omitted category is a bit different from what happens with indicator ("dummy") variables. If we have N categories, and have variables for the proportions of something among them, and we omit one category to serve as the reference category, the coefficient, b, for any of the other categories is interpreted as: for a given difference between observations of amount D in the proportion variable, the expected outcome difference is b*D provided that the other non-reference proportions are unchanged and the amount D is compensated by a change -D in the reference category.

                          I don't think z-scores would be helpful here at all. In fact, they would only muddy the waters. If you tell people that you have found that a 1 percentage point increase in the share of X in category J is associated with a difference delta in the expected outcome, they know what that means. If you tell them that a 1 standard deviation increase in the share of X in category J is associated with a difference delta in the expected outcome, what does that mean? Who knows what the standard deviation of the share of X in category J is? Maybe you do, since you will have run the analysis yourself. But your audience won't, unless you tell them. And even if you do, their first instinct will be to try to convert that from a 1 standard deviation increase to some corresponding increase measured in percentage points, which will require them to do mental arithmetic. At best, this will be regarded as a not reader-friendly presentation of the findings.

                          I know that standardizing is good if I want to compare how important each variable is in order to explain the dependent variable.
                          You may know it, but it isn't true. This is a widely believed fallacy. At most it is true when the variables being compared are all measured on the same scale. Even then, it isn't clear that it's true in any useful or meaningful way. Standardizing variables generally serves only two purposes: it turns easily understandable findings into obscure, opaque claims, and it gives the illusion that the author/presenter is mathematically/statistically sophisticated. There are exceptional situations where standardizing variables is useful, but they are actually pretty rare in practice, and this isn't one of them.

                          Comment


                          • #14
                            You are for sure right about that. I will leave it at the simplest interpretation possible.

                            Comment

                            Working...
                            X