Dear Statausers,
Over the last weeks, I have been trying to deeply understand how the PCA works with the aim of including a wealth index (WI) in my regression (it would work as a control variable). I have read all similar stata-list posts but haven't found a straightforward solution to my issue.
I am using DHS (Ethiopia), and as you may know, the WI provided is a survey-specific measure (calculated separately and differently for each survey), therefore I need to create my own for a proper inter-survey comparison (i.e, compare the household wealth of DHS 2000, DHS 2005, and DHS 2011), an index that follows the same construction methodology. I have applied for a PCA using 18 components (all of them dummies) available in the three survey rounds. This is the code I used:
Then, I run predict pc1 pc2 pc3 pc4, score and I obtain my WI, in this case I want to use the pc1. However, I face a big issue (to my mind), the score generated by the command predict in pc1 has too many repeated values so a bunch of my observations have the same score. This can be easily seen using the command extremes or reporting the duplicates:
From my perspective, having that many duplicates make my index helpless. A non-negligible part of the individuals seem to have exactly the same characteristics throughout the 18 components, or that is my interpretation.
Am I right? How could I fix this issue? What is the origin of this problem?
I know that this is a pretty basic question but I have seen other research papers using fewer components and they don't find this issue (apparently). I wasn't able to find any other explanation apart from the one I previously mentioned two lines above.
Thanks in advance!
Daniel.
Over the last weeks, I have been trying to deeply understand how the PCA works with the aim of including a wealth index (WI) in my regression (it would work as a control variable). I have read all similar stata-list posts but haven't found a straightforward solution to my issue.
I am using DHS (Ethiopia), and as you may know, the WI provided is a survey-specific measure (calculated separately and differently for each survey), therefore I need to create my own for a proper inter-survey comparison (i.e, compare the household wealth of DHS 2000, DHS 2005, and DHS 2011), an index that follows the same construction methodology. I have applied for a PCA using 18 components (all of them dummies) available in the three survey rounds. This is the code I used:
Code:
pca floor roof crowding drinkingsource toilet electricity radio tv car motorcycle bicycle telephone cleanfuel ownhouse lifestock electricmitad keropresslamp agriland
Code:
Principal components/correlation Number of obs = 68,348 Number of comp. = 18 Trace = 18 Rotation: (unrotated = principal) Rho = 1.0000 -------------------------------------------------------------------------- Component | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Comp1 | 6.99464 5.37066 0.3886 0.3886 Comp2 | 1.62398 .460976 0.0902 0.4788 Comp3 | 1.163 .150166 0.0646 0.5434 Comp4 | 1.01284 .0735526 0.0563 0.5997 Comp5 | .939284 .0398593 0.0522 0.6519 Comp6 | .899424 .152798 0.0500 0.7018 Comp7 | .746626 .0782401 0.0415 0.7433 Comp8 | .668386 .121691 0.0371 0.7805 Comp9 | .546695 .0355317 0.0304 0.8108 Comp10 | .511163 .0650795 0.0284 0.8392 Comp11 | .446084 .00538749 0.0248 0.8640 Comp12 | .440697 .0130105 0.0245 0.8885 Comp13 | .427686 .0287318 0.0238 0.9122 Comp14 | .398954 .0382597 0.0222 0.9344 Comp15 | .360694 .0306717 0.0200 0.9545 Comp16 | .330023 .026312 0.0183 0.9728 Comp17 | .303711 .11759 0.0169 0.9897 Comp18 | .186121 . 0.0103 1.0000 --------------------------------------------------------------------------
Code:
predict pc1 pc2 pc3 pc4, score (14 components skipped) Scoring coefficients sum of squares(column-loading) = 1 ------------------------------------------------------------------------------------ Variable | Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 Comp7 -------------+---------------------------------------------------------------------- floor | 0.3083 0.0614 -0.0138 -0.0149 -0.0764 0.0772 0.1524 roof | 0.2766 -0.1797 0.1068 -0.0383 -0.0001 0.0135 0.0219 crowding | 0.0874 -0.0655 -0.2392 0.4493 0.7908 0.2515 0.1169 drinkingso~e | 0.2628 -0.1818 0.0968 -0.0482 -0.0189 -0.0012 -0.1181 toilet | 0.2846 -0.1141 0.0721 -0.0149 -0.0030 0.0238 0.0408 electricity | 0.3367 -0.1441 -0.0005 -0.0046 -0.0631 -0.0222 -0.0049 radio | 0.2476 -0.0288 0.2645 -0.0969 0.0511 0.0877 0.2702 tv | 0.2633 0.3629 -0.0331 -0.0174 -0.0144 0.0378 0.1201 car | 0.1328 0.4287 -0.0183 -0.0070 0.1082 -0.0016 -0.7948 motorcycle | 0.0128 -0.0013 0.3267 0.7723 -0.4273 0.3176 -0.0862 bicycle | 0.0857 0.1674 0.3606 0.2995 0.1876 -0.8225 0.1212 telephone | 0.2272 0.4533 -0.0534 0.0014 -0.0061 0.0546 0.0503 cleanfuel | 0.2986 0.0410 -0.1204 -0.0426 -0.1090 0.0543 0.1153 ownhouse | -0.2279 0.3807 0.0697 0.0118 0.0535 0.1484 0.2338 lifestock | -0.2358 0.2149 0.2314 -0.1054 -0.0563 0.0743 0.2242 electricmi~d | 0.2693 0.3254 -0.1182 -0.0535 -0.0802 0.0947 0.2206 keropressl~p | 0.0165 -0.0164 0.7202 -0.2863 0.3282 0.3080 -0.1215 agriland | -0.2838 0.2099 0.0041 0.0081 0.0115 0.0754 0.1466 ------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------ Variable | Comp8 Comp9 Comp10 Comp11 Comp12 Comp13 Comp14 -------------+---------------------------------------------------------------------- floor | 0.0051 -0.0820 -0.0200 0.0127 0.3519 0.0790 0.1846 roof | 0.3330 -0.1002 0.2459 0.3747 -0.3669 0.5875 -0.0414 crowding | 0.0761 -0.0289 0.1222 -0.0387 0.0243 -0.0528 -0.0113 drinkingso~e | -0.0023 0.6912 0.3579 -0.4871 -0.0332 -0.0001 0.0614 toilet | 0.1622 0.3407 -0.1438 0.6108 0.2635 -0.3837 0.2277 electricity | 0.0818 0.0281 0.0501 0.0305 -0.0219 0.0708 -0.0410 radio | 0.4763 -0.1255 -0.5530 -0.4197 -0.1475 -0.1026 -0.0199 tv | -0.1791 -0.0541 0.0073 -0.0002 -0.0942 -0.1350 0.3332 car | 0.3432 -0.0451 -0.0666 -0.0304 0.1538 0.0356 -0.0489 motorcycle | -0.0403 -0.0373 -0.0044 -0.0180 -0.0079 -0.0043 -0.0086 bicycle | -0.0486 -0.0052 0.0284 -0.0044 0.0961 0.0593 -0.0277 telephone | -0.1902 0.0404 0.0659 0.0424 -0.5483 -0.1171 0.1202 cleanfuel | -0.0702 -0.2184 0.1589 -0.1492 0.4932 0.1849 -0.2024 ownhouse | 0.0793 0.5253 -0.2357 0.1750 0.0800 0.2586 -0.4324 lifestock | 0.4782 -0.1382 0.6026 0.0060 0.0521 -0.3794 -0.0780 electricmi~d | -0.1487 -0.0337 0.1086 -0.0217 0.0945 0.0684 -0.2825 keropressl~p | -0.3842 -0.1074 0.0380 0.0563 0.0882 0.0678 0.0057 agriland | 0.1541 0.0927 0.0175 -0.1040 0.1915 0.4382 0.6810 ------------------------------------------------------------------------------------ ------------------------------------------------------ Variable | Comp15 Comp16 Comp17 Comp18 -------------+---------------------------------------- floor | -0.3515 0.6620 -0.3158 -0.1809 roof | -0.0868 -0.0913 0.0333 -0.2395 crowding | -0.0024 0.0026 -0.0044 0.0135 drinkingso~e | -0.0214 -0.0319 -0.0139 -0.1490 toilet | 0.2599 -0.1404 0.0190 -0.0775 electricity | -0.0806 0.0915 0.0302 0.9096 radio | 0.0911 -0.0725 0.0093 -0.0653 tv | -0.5913 -0.3851 0.3316 -0.0235 car | -0.0142 -0.0168 -0.0422 -0.0023 motorcycle | 0.0156 -0.0424 -0.0045 -0.0120 bicycle | 0.0355 0.0060 -0.0201 -0.0071 telephone | 0.4376 0.3969 0.1126 -0.0038 cleanfuel | 0.3369 -0.0158 0.5679 -0.0997 ownhouse | -0.1807 0.1232 0.1945 0.0645 lifestock | -0.0528 0.0610 0.0198 0.0637 electricmi~d | 0.1580 -0.4240 -0.6377 0.0207 keropressl~p | 0.0440 -0.0035 -0.0157 0.0556 agriland | 0.2600 -0.1102 -0.0712 0.1653 ------------------------------------------------------
Code:
Code:
extremes pc1 +------------------+ | obs: pc1 | |------------------| | 27. -1.899374 | | 28. -1.899374 | | 29. -1.899374 | | 30. -1.899374 | | 31. -1.899374 | +------------------+ +-------------------+ | 66542. 10.23505 | | 66543. 10.23505 | | 66544. 10.23505 | | 66545. 10.23505 | | 66546. 10.23505 | +-------------------+ note: 8116 values of -1.899374 note: 7 values of 10.23505
Code:
Code:
duplicates report pc1 Duplicates in terms of pc1 -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 46 0 2 | 116 58 3 | 153 102 4 | 332 249 5 | 385 308 6 | 690 575 7 | 679 582 8 | 528 462 9 | 378 336 10 | 450 405 11 | 374 340 12 | 300 275 13 | 442 408 14 | 378 351 15 | 255 238 16 | 304 285 17 | 323 304 18 | 288 272 19 | 247 234 20 | 200 190 21 | 294 280 22 | 132 126 23 | 138 132 24 | 120 115 25 | 250 240 26 | 182 175 27 | 135 130 28 | 224 216 29 | 145 140 30 | 120 116 31 | 124 120 32 | 224 217 33 | 99 96 34 | 102 99 35 | 140 136 36 | 36 35 37 | 185 180 38 | 114 111 39 | 78 76 40 | 200 195 41 | 287 280 42 | 42 41 43 | 43 42 44 | 88 86 45 | 180 176 46 | 230 225 47 | 329 322 48 | 192 188 50 | 50 49 51 | 51 50 52 | 104 102 53 | 212 208 54 | 108 106 55 | 165 162 56 | 168 165 57 | 171 168 58 | 58 57 59 | 59 58 60 | 120 118 62 | 62 61 63 | 126 124 64 | 128 126 65 | 65 64 67 | 134 132 68 | 204 201 69 | 69 68 70 | 140 138 71 | 71 70 73 | 146 144 74 | 74 73 76 | 152 150 77 | 154 152 78 | 156 154 80 | 80 79 81 | 162 160 82 | 164 162 83 | 83 82 84 | 168 166 86 | 172 170 88 | 88 87 89 | 89 88 90 | 90 89 96 | 96 95 98 | 98 97 100 | 100 99 103 | 103 102 108 | 108 107 110 | 220 218 112 | 112 111 113 | 113 112 114 | 114 113 115 | 115 114 121 | 242 240 124 | 124 123 127 | 127 126 131 | 131 130 133 | 133 132 136 | 136 135 137 | 137 136 139 | 139 138 143 | 143 142 144 | 144 143 147 | 147 146 149 | 149 148 155 | 155 154 157 | 157 156 159 | 159 158 169 | 169 168 170 | 170 169 171 | 171 170 173 | 173 172 177 | 177 176 181 | 181 180 182 | 182 181 189 | 378 376 193 | 193 192 194 | 194 193 201 | 201 200 213 | 213 212 223 | 223 222 226 | 226 225 228 | 228 227 233 | 233 232 235 | 235 234 236 | 236 235 243 | 243 242 244 | 244 243 250 | 250 249 254 | 254 253 280 | 280 279 282 | 282 281 289 | 289 288 294 | 294 293 297 | 297 296 307 | 307 306 329 | 329 328 334 | 668 666 357 | 357 356 395 | 395 394 418 | 418 417 461 | 461 460 473 | 473 472 482 | 482 481 567 | 567 566 620 | 620 619 651 | 651 650 735 | 735 734 775 | 775 774 904 | 904 903 1037 | 1037 1036 1071 | 1071 1070 1219 | 1219 1218 1268 | 1268 1267 1291 | 1291 1290 1462 | 1462 1461 1501 | 1501 1500 1688 | 1688 1687 1723 | 1723 1722 3673 | 3673 3672 8116 | 8116 8115 11489 | 11489 11488 --------------------------------------
Am I right? How could I fix this issue? What is the origin of this problem?
I know that this is a pretty basic question but I have seen other research papers using fewer components and they don't find this issue (apparently). I wasn't able to find any other explanation apart from the one I previously mentioned two lines above.
Thanks in advance!
Daniel.
Comment