Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCA index

    I constructed an index on Stata using "pca rotate predict". When I produced the summary statistics table, I noticed that the index has a mean of zero and a standard deviation of 1. The index is composed of five pollutants, where all pollutants are measured in the same scale μg/m3. Therefore, I am not sure whether standardisation is necessary in this case.
    The API index takes values between -3 and 5. Can I interpret these values in a cardinal way, i.e. large values represent a high level of pollution. What about negative values?
    Moreover, I have panel data. The dependent variable "hospital admissions" is specified in logs, and the main explanatory variable is the API. I'm not sure how to interpret coefficients.
    Thank you in advance!
    Last edited by Kate Richards; 22 Sep 2020, 06:38. Reason: Added tags

  • #2
    The sign of PC scores is arbitrary given the symmetry of an ellipsoid (otherwise put, which end of a rugby ball or American football is which?). To check whether e.g. PC1 is aligned with your variables as you expect, you would need to check the sign of its correlations with those variables.

    In this example the fact that the coefficients are positive is sufficient (but not necessary) to ensure that alignment any way. But for the variables shown the correlations are all positive between various size measures and PC1, which according to the usual PCA sales pitch is our best single summary of those measures. Here cpcorr is from SSC and isn't essential as correlate will show the same correlations (and several others not so relevant).

    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    
    . pca headroom trunk length displacement
    
    Principal components/correlation                 Number of obs    =         74
                                                     Number of comp.  =          4
                                                     Trace            =          4
        Rotation: (unrotated = principal)            Rho              =     1.0000
    
        --------------------------------------------------------------------------
           Component |   Eigenvalue   Difference         Proportion   Cumulative
        -------------+------------------------------------------------------------
               Comp1 |      2.92212      2.28986             0.7305       0.7305
               Comp2 |      .632263       .32651             0.1581       0.8886
               Comp3 |      .305754      .165892             0.0764       0.9650
               Comp4 |      .139861            .             0.0350       1.0000
        --------------------------------------------------------------------------
    
    Principal components (eigenvectors) 
    
        --------------------------------------------------------------------
            Variable |    Comp1     Comp2     Comp3     Comp4 | Unexplained 
        -------------+----------------------------------------+-------------
            headroom |   0.4446    0.7399    0.4944   -0.1022 |           0 
               trunk |   0.5142    0.2423   -0.7596    0.3160 |           0 
              length |   0.5328   -0.3755   -0.0732   -0.7548 |           0 
        displacement |   0.5041   -0.5028    0.4161    0.5656 |           0 
        --------------------------------------------------------------------
    
    . predict PC1
    (score assumed)
    (3 components skipped)
    
    Scoring coefficients 
        sum of squares(column-loading) = 1
    
        ------------------------------------------------------
            Variable |    Comp1     Comp2     Comp3     Comp4 
        -------------+----------------------------------------
            headroom |   0.4446    0.7399    0.4944   -0.1022 
               trunk |   0.5142    0.2423   -0.7596    0.3160 
              length |   0.5328   -0.3755   -0.0732   -0.7548 
        displacement |   0.5041   -0.5028    0.4161    0.5656 
        ------------------------------------------------------
    
    . cpcorr headroom trunk length displacement \ PC1
    (obs=74)
    
                     PC1
        headroom  0.7601
           trunk  0.8789
          length  0.9108
    displacement  0.8617
    
    .
    If the correlations are all or mostly negative, negate PC1. I don't address any extra complication from rotation.

    Standardisation of PC scores to mean 0 and SD 1 is conventional. So a value of 0 just implies mean conditions and large negative scores imply in your case relatively low pollution (or exceptionally relatively high pollution if scores are not signed as wanted).

    That said, I'd expect these pollution measures to be highly skewed, so it's a moot point whether your purposes are better served by transforming before PCA. It's a moot point whether you wouldn't be better off using the original pollution measures as predictors and letting them drop out of a model if any is redundant). What PCA does is mush together different measures. Even if they are all very highly correlated it doesn't follow that PC1 is better than any original variable for predicting something else.

    PCA is one of several techniques that divide the statistical world. Some smart people have devoted large fractions of their careers to becoming expert in nuances of its theory and application while others dismiss it as obvious, over-sold or both. The brief discussion in https://www.springer.com/gp/book/9780387954578 is a minor classic (although even in 2002 they missed several books on PCA in their count).

    Comment


    • #3
      Nick Cox thank you so much for your prompt and detailed reply.
      In my case, I have positive correlation between each pollutant and PC1. So, based on your explanation it is not necessary to negate PC1, and I can interpret negative values as low pollution levels. Also, you correctly mention, the distribution of the index is skewed to the right even after standardisation. However, since the index is the explanatory variable, I do not think I should concern myself with its skewness (correct me if I'm wrong). When, I regress ln(y) on the air pollution index (API), I get a coefficient of -0.024 (I use a GMM estimator). However, I'm not sure how to accurately interpret this coefficient. Shall I say "a one standard deviation increase in the API lowers hospital admissions by about 2.4 per cent"? My concern is that this is not an intuitive result. Perhaps, is there a way to convert a "one standard deviation" unit into the scale used by pollutants: μg/m3?
      While I perfectly agree with you that it is a moot point of whether to introduce use the index or not, my supervisor specifically asked me to use PCA to form an index.

      Comment


      • #4
        You're right that skewness as such in a predictor is not of immense concern, but outliers might be and any nonlinearity in the relationship should be. My comment was about skewness in the original variables and whether they should be transformed before PCA; transformation of PC scores after estimation is much more problematic.

        Your question underlines my prejudice that principal component scores can be awkward to talk about. As you say, I don't think there is any known mechanism whereby higher pollution is benign medically so what you're seeing is presumably saying more about the other predictors than about pollution.

        "I am just following orders" is something I find myself saying too in quite different contexts.

        Comment


        • #5
          Nick Cox thank you for your reply! The original variables are also skewed. So, do you recommend that I standardize them before I conduct PCA? Also, I am really sorry to ask again, but is my interpretation of the coefficient right? Because I have serious doubts about it.

          Comment


          • #6
            No; standardizing variables before PCA makes zero difference to the result so long as you are using the default version of PCA based on eigenvectors and eigenvalues of the correlation matrix. The only idea I've floated is whether you should transform variables first.

            You are I think correct in your wording on the coefficient but the bigger deal is that, as you delicately put it, the negative sign is not intuitive, or as I might say on the face of it epidemiologically absurd. You didn't say anything about a confidence interval.

            Comment


            • #7
              I'm not really understanding the part about transforming variables first. I re-run the estimation and realised that I copied the wrong coefficient. The correct coefficient is 0.034 - which is now intuitive and significant at the 5% level. Regarding your previous comment about other predictors, I also include a set of control variables.

              Comment


              • #8
                The results of PCA will not be invariant under transformations. Take logarithms of the variables first, and you will get different PCs. I did indeed realise that you would have other predictors too, as in my comment in #4.

                Comment

                Working...
                X