Hello Statalist users,
I have read questions posted on this forum regarding PCA and forming an index. But I have a suite of queries that need to be tackled together.
I have a panel data of 45 countries and period 2003-2018. The data is comprised of 9 variables related to gender statistics retrieved from World Bank data. I would like to create a single composite index, which I call Women Empowerment index, from all these 9 variables using PCA. This index would then be used as an independent variable in a later regression analysis. Making this index is important—instead of performing the usual multivariate regression with these variables separated—because the other independent variables in my regression framework contain a host of cognitive-test-score parameters and, thus, through this index I intend to capture women empowerment prevalent in the macro-environment of different countries.
The following are my queries:
1. This one is obvious but I'm finding it hard to wrap my head around on how to select the number of principal components (PCs). According to Kaiser's criterion (eigenvalue > 1), I am getting 3 such PCs. The 3 PCs are able to explain 69.62% variance; this could be considered alright. But, on the other hand, scree plot is exhibiting a wide gap and sharp edge between component 1 and component 2; telling that only component 1 fulfils this criterion. However, if I pick only the 1st component (Comp1), that would explain only 44.26% of the total variance. Am I good to go with 3 components?
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 3.9832 2.80695 0.4426 0.4426
Comp2 | 1.17625 .0699101 0.1307 0.5733
Comp3 | 1.10634 .349979 0.1229 0.6962
Comp4 | .756364 .124735 0.0840 0.7802
Comp5 | .631629 .0777993 0.0702 0.8504
Comp6 | .553829 .101858 0.0615 0.9120
Comp7 | .451971 .159765 0.0502 0.9622
Comp8 | .292207 .244002 0.0325 0.9946
Comp9 | .0482048 . 0.0054 1.0000
--------------------------------------------------------------------------
2. Now that I have selected the components, I obtain their loadings on each variable. These are unrotated for now. The loadings on Comp1 make sense from theory stand-point, while those on Comp2 and Comp3 don't. This has great significance in my question.
----------------------------------------------------------
Variable | Comp1 Comp2 Comp3 | Unexplained
-------------+------------------------------+-------------
v4_ | 0.4169 0.2472 -0.2823 | .1477
v5_ | 0.1017 -0.7868 -0.1225 | .214
v8_ | -0.3573 -0.0330 0.0064 | .4902
v9_ | -0.1447 0.1295 0.8026 | .1841
v10_ | 0.4157 0.2776 -0.1901 | .181
v11_ | 0.3866 -0.2515 0.1857 | .2922
v13_ | -0.3753 0.2580 -0.2979 | .2626
v14_ | 0.3552 -0.0617 0.2460 | .4261
v15_ | 0.2794 0.3021 0.2027 | .5362
----------------------------------------------------------
After that, I run
. This yields the following result:
Rotated components
----------------------------------------------------------
Variable | Comp1 Comp2 Comp3 | Unexplained
-------------+------------------------------+-------------
v4_ | 0.3739 -0.3607 -0.2113 | .1477
v5_ | 0.0030 -0.1346 0.7914 | .214
v8_ | -0.3505 0.0769 0.0004 | .4902
v9_ | 0.0287 0.8145 -0.1332 | .1841
v10_ | 0.3938 -0.2703 -0.2406 | .181
v11_ | 0.3909 0.1076 0.2878 | .2922
v13_ | -0.4014 -0.2199 -0.2944 | .2626
v14_ | 0.3895 0.1716 0.0966 | .4261
v15_ | 0.3403 0.1415 -0.2731 | .5362
----------------------------------------------------------
Comp1 now shows somewhat different loadings on variables. The most striking is the change in sign of v9_ from negative to positive, which is not right from theory. Can someone please explain why this is happening and should I consider rotating at all?
3. My final aim is to produce one single composite Women Empowerment index. Some of my peers have suggested to weight each PC with its proportion-of-explained-variance and add all of these weighted PCs together. Is this a correct approach? Are there any other alternatives to do so? Or should I at all consider mushing the components together? Looking at the loadings, it's increasingly difficult to justify the existence and characteristic of each PC in my regression framework.
I have read questions posted on this forum regarding PCA and forming an index. But I have a suite of queries that need to be tackled together.
I have a panel data of 45 countries and period 2003-2018. The data is comprised of 9 variables related to gender statistics retrieved from World Bank data. I would like to create a single composite index, which I call Women Empowerment index, from all these 9 variables using PCA. This index would then be used as an independent variable in a later regression analysis. Making this index is important—instead of performing the usual multivariate regression with these variables separated—because the other independent variables in my regression framework contain a host of cognitive-test-score parameters and, thus, through this index I intend to capture women empowerment prevalent in the macro-environment of different countries.
The following are my queries:
1. This one is obvious but I'm finding it hard to wrap my head around on how to select the number of principal components (PCs). According to Kaiser's criterion (eigenvalue > 1), I am getting 3 such PCs. The 3 PCs are able to explain 69.62% variance; this could be considered alright. But, on the other hand, scree plot is exhibiting a wide gap and sharp edge between component 1 and component 2; telling that only component 1 fulfils this criterion. However, if I pick only the 1st component (Comp1), that would explain only 44.26% of the total variance. Am I good to go with 3 components?
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 3.9832 2.80695 0.4426 0.4426
Comp2 | 1.17625 .0699101 0.1307 0.5733
Comp3 | 1.10634 .349979 0.1229 0.6962
Comp4 | .756364 .124735 0.0840 0.7802
Comp5 | .631629 .0777993 0.0702 0.8504
Comp6 | .553829 .101858 0.0615 0.9120
Comp7 | .451971 .159765 0.0502 0.9622
Comp8 | .292207 .244002 0.0325 0.9946
Comp9 | .0482048 . 0.0054 1.0000
--------------------------------------------------------------------------
2. Now that I have selected the components, I obtain their loadings on each variable. These are unrotated for now. The loadings on Comp1 make sense from theory stand-point, while those on Comp2 and Comp3 don't. This has great significance in my question.
----------------------------------------------------------
Variable | Comp1 Comp2 Comp3 | Unexplained
-------------+------------------------------+-------------
v4_ | 0.4169 0.2472 -0.2823 | .1477
v5_ | 0.1017 -0.7868 -0.1225 | .214
v8_ | -0.3573 -0.0330 0.0064 | .4902
v9_ | -0.1447 0.1295 0.8026 | .1841
v10_ | 0.4157 0.2776 -0.1901 | .181
v11_ | 0.3866 -0.2515 0.1857 | .2922
v13_ | -0.3753 0.2580 -0.2979 | .2626
v14_ | 0.3552 -0.0617 0.2460 | .4261
v15_ | 0.2794 0.3021 0.2027 | .5362
----------------------------------------------------------
After that, I run
Code:
rotate, varimax
Rotated components
----------------------------------------------------------
Variable | Comp1 Comp2 Comp3 | Unexplained
-------------+------------------------------+-------------
v4_ | 0.3739 -0.3607 -0.2113 | .1477
v5_ | 0.0030 -0.1346 0.7914 | .214
v8_ | -0.3505 0.0769 0.0004 | .4902
v9_ | 0.0287 0.8145 -0.1332 | .1841
v10_ | 0.3938 -0.2703 -0.2406 | .181
v11_ | 0.3909 0.1076 0.2878 | .2922
v13_ | -0.4014 -0.2199 -0.2944 | .2626
v14_ | 0.3895 0.1716 0.0966 | .4261
v15_ | 0.3403 0.1415 -0.2731 | .5362
----------------------------------------------------------
Comp1 now shows somewhat different loadings on variables. The most striking is the change in sign of v9_ from negative to positive, which is not right from theory. Can someone please explain why this is happening and should I consider rotating at all?
3. My final aim is to produce one single composite Women Empowerment index. Some of my peers have suggested to weight each PC with its proportion-of-explained-variance and add all of these weighted PCs together. Is this a correct approach? Are there any other alternatives to do so? Or should I at all consider mushing the components together? Looking at the loadings, it's increasingly difficult to justify the existence and characteristic of each PC in my regression framework.

Comment