Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCA for a composite index

    Hello Statalist users,

    I have read questions posted on this forum regarding PCA and forming an index. But I have a suite of queries that need to be tackled together.

    I have a panel data of 45 countries and period 2003-2018. The data is comprised of 9 variables related to gender statistics retrieved from World Bank data. I would like to create a single composite index, which I call Women Empowerment index, from all these 9 variables using PCA. This index would then be used as an independent variable in a later regression analysis. Making this index is important—instead of performing the usual multivariate regression with these variables separated—because the other independent variables in my regression framework contain a host of cognitive-test-score parameters and, thus, through this index I intend to capture women empowerment prevalent in the macro-environment of different countries.

    The following are my queries:

    1. This one is obvious but I'm finding it hard to wrap my head around on how to select the number of principal components (PCs). According to Kaiser's criterion (eigenvalue > 1), I am getting 3 such PCs. The 3 PCs are able to explain 69.62% variance; this could be considered alright. But, on the other hand, scree plot is exhibiting a wide gap and sharp edge between component 1 and component 2; telling that only component 1 fulfils this criterion. However, if I pick only the 1st component (Comp1), that would explain only 44.26% of the total variance. Am I good to go with 3 components?

    --------------------------------------------------------------------------
    Component | Eigenvalue Difference Proportion Cumulative
    -------------+------------------------------------------------------------
    Comp1 | 3.9832 2.80695 0.4426 0.4426
    Comp2 | 1.17625 .0699101 0.1307 0.5733
    Comp3 | 1.10634 .349979 0.1229 0.6962
    Comp4 | .756364 .124735 0.0840 0.7802
    Comp5 | .631629 .0777993 0.0702 0.8504
    Comp6 | .553829 .101858 0.0615 0.9120
    Comp7 | .451971 .159765 0.0502 0.9622
    Comp8 | .292207 .244002 0.0325 0.9946
    Comp9 | .0482048 . 0.0054 1.0000
    --------------------------------------------------------------------------


    2. Now that I have selected the components, I obtain their loadings on each variable. These are unrotated for now. The loadings on Comp1 make sense from theory stand-point, while those on Comp2 and Comp3 don't. This has great significance in my question.
    ----------------------------------------------------------
    Variable | Comp1 Comp2 Comp3 | Unexplained
    -------------+------------------------------+-------------
    v4_ | 0.4169 0.2472 -0.2823 | .1477
    v5_ | 0.1017 -0.7868 -0.1225 | .214
    v8_ | -0.3573 -0.0330 0.0064 | .4902
    v9_ | -0.1447 0.1295 0.8026 | .1841
    v10_ | 0.4157 0.2776 -0.1901 | .181
    v11_ | 0.3866 -0.2515 0.1857 | .2922
    v13_ | -0.3753 0.2580 -0.2979 | .2626
    v14_ | 0.3552 -0.0617 0.2460 | .4261
    v15_ | 0.2794 0.3021 0.2027 | .5362
    ----------------------------------------------------------



    After that, I run
    Code:
    rotate, varimax
    . This yields the following result:

    Rotated components

    ----------------------------------------------------------
    Variable | Comp1 Comp2 Comp3 | Unexplained
    -------------+------------------------------+-------------
    v4_ | 0.3739 -0.3607 -0.2113 | .1477
    v5_ | 0.0030 -0.1346 0.7914 | .214
    v8_ | -0.3505 0.0769 0.0004 | .4902
    v9_ | 0.0287 0.8145 -0.1332 | .1841
    v10_ | 0.3938 -0.2703 -0.2406 | .181
    v11_ | 0.3909 0.1076 0.2878 | .2922
    v13_ | -0.4014 -0.2199 -0.2944 | .2626
    v14_ | 0.3895 0.1716 0.0966 | .4261
    v15_ | 0.3403 0.1415 -0.2731 | .5362
    ----------------------------------------------------------



    Comp1 now shows somewhat different loadings on variables. The most striking is the change in sign of v9_ from negative to positive, which is not right from theory. Can someone please explain why this is happening and should I consider rotating at all?

    3. My final aim is to produce one single composite Women Empowerment index. Some of my peers have suggested to weight each PC with its proportion-of-explained-variance and add all of these weighted PCs together. Is this a correct approach? Are there any other alternatives to do so? Or should I at all consider mushing the components together? Looking at the loadings, it's increasingly difficult to justify the existence and characteristic of each PC in my regression framework.

  • #2
    These are hard questions. But they don't seem consistent even. You start with saying that a single measure or index of women's empowerment is needed first and then will be used in a regression with other predictors. You finish by saying that the index is your final goal.

    All the indications here are that the PCA doesn't identify well a single component to serve as index. Without seeing your data I am totally unsurprised. Many important ideas in social science can't be captured well by single precise measures. Some of the reasons might include nonlinearity, skewness and outliers messing up correlations, but the main reason is just too few strong correlations. Rotation cannot be a solution to this. If you want to use more PCs in the regression you can, but then all ways of combining them are arbitrary and even if you let the regression sort out the weights

    response = intercept + terms using the PCs + terms using the other predictors

    that is in practice just a messy alternative to

    response = intercept + terms using women's empowerment variables + terms using the other predictors.

    Much of this pivots on your statement

    Making this index is important—instead of performing the usual multivariate regression with these variables separated—because the other independent variables in my regression framework contain a host of cognitive-test-score parameters
    (where by "multivariate" you presumably mean "multiple")

    I'd say that on your evidence a single index is unattainable usefully and is not necessarily the best goal, even given a project aim of focusing on women's empowerment in the context of other predictors too.

    Comment

    Working...
    X