Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to treat Compositional Explanatory Variables

    Dear Community,

    I have a question regarding the treatment of independent variables that are compositional, i.e. single variables are relative contributions to a whole and facing a sum constraint (they all sum up to 1 or 100%).
    In an earlier post of mine, I was considering to just drop one variable as reference or to drop the constant term to avoid multicollinearity. However, after going a bit deeper into the topic, I came across some other problems that come with compositional independent variables. First, following Aitchison (1986), compositional data can also be seen as singular, saying they are data with a singular covariance matrix. Hence, the interpretation of the coefficients might be somewhat difficult. Because of the constant sum constraint, it is impossible to alter one proportion without changing the other. Hron, Filmoser and Thompson (2009) are speaking of a singularity problem of the data. Hence they suggest to log transform the data in order to present the data in the standard Euclidian space (whereas untransformed they are rather in a Simplex.)

    For me, this all causes a bit of confusion. Is it right, that I have to treat compositional independent variables differently and to log transforms them? Further, if some of my compositional variables are zero (zero shares of spending on tobacco of a household, lets say) it is not possible to apply a log transformation. Aitchison (1986) proposed to replace zero values with very little non-negative values.

    This entire thing seems to be a rather rarely handled topic, because a cannot find much about it. Does anybody know more about this topic and can suggest a solution on how to treat compositional independent variables with some values being zero? Can I just stick to drop one compositional part or the constant, or do I really have to find a solution with log-transforming the variables?

    I hope for some fruitful points on this.
    Thank you very much in Advance!

    Lirerature:
    Hron, Filzmoser and Thompson (2009) "A linear regression with compositional explanatory variables". Journal of Applied Statistics, Vol 00, No 00, pp. 1-15.
    J. Aitchison (2003[1986]) "The statistical analysis of compositional data". Blackburn Press





  • #2
    There is a substantial literature on compositional data analysis (there is a new book about every other year) but it is extraordinarily self-contained.

    Watch that replacing 0 by smidgen where smidgen is very small is likely to create massive outliers as log(smidgen) will arbitrarily large negative if smidgen is arbitrarily close to zero.

    I have never seen a really convincing solution for zero proportions in this context.

    Comment


    • #3
      Thanks a lot Nick!

      Comment


      • #4
        I have found, in my limited sampling, that the compositional data analysis literature is often in seeming denial about exact zeros. There are papers on the issue, to be sure, but the prevailing tone seems to include reluctance to admit that it is a common problem. I would be happy to learn that I exaggerate.

        I would be more worried about getting closer to linearity than about covariance problems.

        In many ways logit is a more obvious transform for proportions, but you would then also have problems with exact ones.

        I would mention also folded power transformations for proportions p such as

        root of p - root of (1 - p)

        cube root of p - cube root of (1 - p)

        which have no problem at 0 or 1. These go back to John W. Tukey.

        Logit is evidently log of p - log of (1 - p).

        Comment

        Working...
        X