Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal component analysis

    Hi dears,
    I have a question. I would like to apply the PCA for constructing a single index. I have variables measured in different units.
    Before the application of the PCA, Should I normalize the variables?

  • #2
    Normalize can mean anything from (1) scale by measures of level and spread -- most commonly, (value - mean) / SD, also called standardizing -- to (2) transform one or more variables to get closer to symmetry or even a normal distribution.

    I will guess at (1). Such prior standardization is not needed, as the equivalent is the default in pca, to base calculations on the correlation matrix. Other way round, it is only when all variables are measured in the same units that PCA of the covariance matrix has any substantive meaning at all, and even then it may or may not be a good idea.

    Questions about getting a single index out of PCA (why? perhaps before a regression) are common here. It would be interesting if people asking this gave literature references to successful applications. I can't see any advantage in mushing together predictor variables into a PC of dubious pedigree or meaning over a direct approach in which different predictors show what they can do individually (or combined).

    Comment


    • #3
      I think the main advantage is the intuitive appeal. I think it is common when you want to rank units based on one measure which is not directly observed, but rather likely approximated by a number of variables. For instance, ranking poverty status based on asset ownership of 50-60 different items. The PCA-index is then put into a regression and interpreted directly as the composite we were trying to measure. Here is a famous application.
      Filmer, D., & Pritchett, L. H. (2001). Estimating wealth effects without expenditure data—or tears: an application to educational enrollments in states of India. Demography, 38(1), 115-132.

      Stas Kolenikov argued that in such cases it would be more sensible to model the unobserved factor directly using system of equation modelling
      https://www.stata.com/meeting/chicag..._kolenikov.pdf

      From your skeptic reply, I presume you would also prefer such an approach?
      Last edited by Felix Stips; 15 Jun 2021, 09:27.

      Comment


      • #4
        Felix Stips Thanks for you thoughts and the specific references.

        In my own work, I prefer direct, simple approaches that are effective and easy to explain and look for that in work I grade or review or read.

        That's a counsel of perfection as better theories have often seemed strongly challenging or even crazy, and personal preferences have little or no bearing on how nature or society works.

        I am always wary of appeals to intuition, however, as (1) what happens when my intuition clashes with yours? (2) in practice they often seem largely rhetorical.

        In this context, the appeal to an ideal of quantification of an interesting and important latent variable that is what you care about is good sales talk, but what I read often falls a long way short. First off, you can't know how well you have measured your latent variable, which need not stop people trying. Second off, the practice is that different teams try similar but not identical sets of variables. We then have to try to compare the results of different teams producing different PCs from overlapping but not identical sets of variables. That seems to me to moving away from any palpable concept of poverty, or intelligence, or whatever else is the target).

        My field is geography and related sciences. In the late 1960s and early 1970s PCA was just about the most talked about statistical method. Fifty years later, not nearly so much. These shifts and contrasts in what different groups use are puzzling, because the underlying problems are not that different.

        Comment


        • #5
          Good points about the lacking robustness of latent variable methods. I think for social sciences this problem of having latent variables that are measured by a battery of proxies is more common, think of concepts like personality traits, health status, opinions or values, socio-economic status, political indicators etc. etc. I don't think social scientists will want to stop measuring these, even if imperfectly. So, I guess you can expect many more posts about PCA unless other latent variable methods gain traction
          Last edited by Felix Stips; 15 Jun 2021, 11:18.

          Comment

          Working...
          X