Dear Statausers,
Over the last weeks, I have been trying to deeply understand how the PCA works with the aim of including a wealth index (WI) in my regression (it would work as a control variable). I have read all similar stata-list posts but haven't found a straightforward solution to my issue.
I am using DHS (Ethiopia), and as you may know, the WI provided is a survey-specific measure (calculated separately and differently for each survey), therefore I need to create my own for a proper inter-survey comparison (i.e, compare the household wealth of DHS 2000, DHS 2005, and DHS 2011), an index that follows the same construction methodology. I have applied for a PCA using 18 components (all of them dummies) available in the three survey rounds. This is the code I used:
Then, I run predict pc1 pc2 pc3 pc4, score and I obtain my WI, in this case I want to use the pc1. However, I face a big issue (to my mind), the score generated by the command predict in pc1 has too many repeated values so a bunch of my observations have the same score. This can be easily seen using the command extremes or reporting the duplicates:
From my perspective, having that many duplicates make my index helpless. A non-negligible part of the individuals seem to have exactly the same characteristics throughout the 18 components, or that is my interpretation.
Am I right? How could I fix this issue? What is the origin of this problem?
I know that this is a pretty basic question but I have seen other research papers using fewer components and they don't find this issue (apparently). I wasn't able to find any other explanation apart from the one I previously mentioned two lines above.
Thanks in advance!
Daniel.
Over the last weeks, I have been trying to deeply understand how the PCA works with the aim of including a wealth index (WI) in my regression (it would work as a control variable). I have read all similar stata-list posts but haven't found a straightforward solution to my issue.
I am using DHS (Ethiopia), and as you may know, the WI provided is a survey-specific measure (calculated separately and differently for each survey), therefore I need to create my own for a proper inter-survey comparison (i.e, compare the household wealth of DHS 2000, DHS 2005, and DHS 2011), an index that follows the same construction methodology. I have applied for a PCA using 18 components (all of them dummies) available in the three survey rounds. This is the code I used:
Code:
pca floor roof crowding drinkingsource toilet electricity radio tv car motorcycle bicycle telephone cleanfuel ownhouse lifestock electricmitad keropresslamp agriland
Code:
Principal components/correlation Number of obs = 68,348
Number of comp. = 18
Trace = 18
Rotation: (unrotated = principal) Rho = 1.0000
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 6.99464 5.37066 0.3886 0.3886
Comp2 | 1.62398 .460976 0.0902 0.4788
Comp3 | 1.163 .150166 0.0646 0.5434
Comp4 | 1.01284 .0735526 0.0563 0.5997
Comp5 | .939284 .0398593 0.0522 0.6519
Comp6 | .899424 .152798 0.0500 0.7018
Comp7 | .746626 .0782401 0.0415 0.7433
Comp8 | .668386 .121691 0.0371 0.7805
Comp9 | .546695 .0355317 0.0304 0.8108
Comp10 | .511163 .0650795 0.0284 0.8392
Comp11 | .446084 .00538749 0.0248 0.8640
Comp12 | .440697 .0130105 0.0245 0.8885
Comp13 | .427686 .0287318 0.0238 0.9122
Comp14 | .398954 .0382597 0.0222 0.9344
Comp15 | .360694 .0306717 0.0200 0.9545
Comp16 | .330023 .026312 0.0183 0.9728
Comp17 | .303711 .11759 0.0169 0.9897
Comp18 | .186121 . 0.0103 1.0000
--------------------------------------------------------------------------
Code:
predict pc1 pc2 pc3 pc4, score
(14 components skipped)
Scoring coefficients
sum of squares(column-loading) = 1
------------------------------------------------------------------------------------
Variable | Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 Comp7
-------------+----------------------------------------------------------------------
floor | 0.3083 0.0614 -0.0138 -0.0149 -0.0764 0.0772 0.1524
roof | 0.2766 -0.1797 0.1068 -0.0383 -0.0001 0.0135 0.0219
crowding | 0.0874 -0.0655 -0.2392 0.4493 0.7908 0.2515 0.1169
drinkingso~e | 0.2628 -0.1818 0.0968 -0.0482 -0.0189 -0.0012 -0.1181
toilet | 0.2846 -0.1141 0.0721 -0.0149 -0.0030 0.0238 0.0408
electricity | 0.3367 -0.1441 -0.0005 -0.0046 -0.0631 -0.0222 -0.0049
radio | 0.2476 -0.0288 0.2645 -0.0969 0.0511 0.0877 0.2702
tv | 0.2633 0.3629 -0.0331 -0.0174 -0.0144 0.0378 0.1201
car | 0.1328 0.4287 -0.0183 -0.0070 0.1082 -0.0016 -0.7948
motorcycle | 0.0128 -0.0013 0.3267 0.7723 -0.4273 0.3176 -0.0862
bicycle | 0.0857 0.1674 0.3606 0.2995 0.1876 -0.8225 0.1212
telephone | 0.2272 0.4533 -0.0534 0.0014 -0.0061 0.0546 0.0503
cleanfuel | 0.2986 0.0410 -0.1204 -0.0426 -0.1090 0.0543 0.1153
ownhouse | -0.2279 0.3807 0.0697 0.0118 0.0535 0.1484 0.2338
lifestock | -0.2358 0.2149 0.2314 -0.1054 -0.0563 0.0743 0.2242
electricmi~d | 0.2693 0.3254 -0.1182 -0.0535 -0.0802 0.0947 0.2206
keropressl~p | 0.0165 -0.0164 0.7202 -0.2863 0.3282 0.3080 -0.1215
agriland | -0.2838 0.2099 0.0041 0.0081 0.0115 0.0754 0.1466
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
Variable | Comp8 Comp9 Comp10 Comp11 Comp12 Comp13 Comp14
-------------+----------------------------------------------------------------------
floor | 0.0051 -0.0820 -0.0200 0.0127 0.3519 0.0790 0.1846
roof | 0.3330 -0.1002 0.2459 0.3747 -0.3669 0.5875 -0.0414
crowding | 0.0761 -0.0289 0.1222 -0.0387 0.0243 -0.0528 -0.0113
drinkingso~e | -0.0023 0.6912 0.3579 -0.4871 -0.0332 -0.0001 0.0614
toilet | 0.1622 0.3407 -0.1438 0.6108 0.2635 -0.3837 0.2277
electricity | 0.0818 0.0281 0.0501 0.0305 -0.0219 0.0708 -0.0410
radio | 0.4763 -0.1255 -0.5530 -0.4197 -0.1475 -0.1026 -0.0199
tv | -0.1791 -0.0541 0.0073 -0.0002 -0.0942 -0.1350 0.3332
car | 0.3432 -0.0451 -0.0666 -0.0304 0.1538 0.0356 -0.0489
motorcycle | -0.0403 -0.0373 -0.0044 -0.0180 -0.0079 -0.0043 -0.0086
bicycle | -0.0486 -0.0052 0.0284 -0.0044 0.0961 0.0593 -0.0277
telephone | -0.1902 0.0404 0.0659 0.0424 -0.5483 -0.1171 0.1202
cleanfuel | -0.0702 -0.2184 0.1589 -0.1492 0.4932 0.1849 -0.2024
ownhouse | 0.0793 0.5253 -0.2357 0.1750 0.0800 0.2586 -0.4324
lifestock | 0.4782 -0.1382 0.6026 0.0060 0.0521 -0.3794 -0.0780
electricmi~d | -0.1487 -0.0337 0.1086 -0.0217 0.0945 0.0684 -0.2825
keropressl~p | -0.3842 -0.1074 0.0380 0.0563 0.0882 0.0678 0.0057
agriland | 0.1541 0.0927 0.0175 -0.1040 0.1915 0.4382 0.6810
------------------------------------------------------------------------------------
------------------------------------------------------
Variable | Comp15 Comp16 Comp17 Comp18
-------------+----------------------------------------
floor | -0.3515 0.6620 -0.3158 -0.1809
roof | -0.0868 -0.0913 0.0333 -0.2395
crowding | -0.0024 0.0026 -0.0044 0.0135
drinkingso~e | -0.0214 -0.0319 -0.0139 -0.1490
toilet | 0.2599 -0.1404 0.0190 -0.0775
electricity | -0.0806 0.0915 0.0302 0.9096
radio | 0.0911 -0.0725 0.0093 -0.0653
tv | -0.5913 -0.3851 0.3316 -0.0235
car | -0.0142 -0.0168 -0.0422 -0.0023
motorcycle | 0.0156 -0.0424 -0.0045 -0.0120
bicycle | 0.0355 0.0060 -0.0201 -0.0071
telephone | 0.4376 0.3969 0.1126 -0.0038
cleanfuel | 0.3369 -0.0158 0.5679 -0.0997
ownhouse | -0.1807 0.1232 0.1945 0.0645
lifestock | -0.0528 0.0610 0.0198 0.0637
electricmi~d | 0.1580 -0.4240 -0.6377 0.0207
keropressl~p | 0.0440 -0.0035 -0.0157 0.0556
agriland | 0.2600 -0.1102 -0.0712 0.1653
------------------------------------------------------
Code:
Code:
extremes pc1 +------------------+ | obs: pc1 | |------------------| | 27. -1.899374 | | 28. -1.899374 | | 29. -1.899374 | | 30. -1.899374 | | 31. -1.899374 | +------------------+ +-------------------+ | 66542. 10.23505 | | 66543. 10.23505 | | 66544. 10.23505 | | 66545. 10.23505 | | 66546. 10.23505 | +-------------------+ note: 8116 values of -1.899374 note: 7 values of 10.23505
Code:
Code:
duplicates report pc1
Duplicates in terms of pc1
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 46 0
2 | 116 58
3 | 153 102
4 | 332 249
5 | 385 308
6 | 690 575
7 | 679 582
8 | 528 462
9 | 378 336
10 | 450 405
11 | 374 340
12 | 300 275
13 | 442 408
14 | 378 351
15 | 255 238
16 | 304 285
17 | 323 304
18 | 288 272
19 | 247 234
20 | 200 190
21 | 294 280
22 | 132 126
23 | 138 132
24 | 120 115
25 | 250 240
26 | 182 175
27 | 135 130
28 | 224 216
29 | 145 140
30 | 120 116
31 | 124 120
32 | 224 217
33 | 99 96
34 | 102 99
35 | 140 136
36 | 36 35
37 | 185 180
38 | 114 111
39 | 78 76
40 | 200 195
41 | 287 280
42 | 42 41
43 | 43 42
44 | 88 86
45 | 180 176
46 | 230 225
47 | 329 322
48 | 192 188
50 | 50 49
51 | 51 50
52 | 104 102
53 | 212 208
54 | 108 106
55 | 165 162
56 | 168 165
57 | 171 168
58 | 58 57
59 | 59 58
60 | 120 118
62 | 62 61
63 | 126 124
64 | 128 126
65 | 65 64
67 | 134 132
68 | 204 201
69 | 69 68
70 | 140 138
71 | 71 70
73 | 146 144
74 | 74 73
76 | 152 150
77 | 154 152
78 | 156 154
80 | 80 79
81 | 162 160
82 | 164 162
83 | 83 82
84 | 168 166
86 | 172 170
88 | 88 87
89 | 89 88
90 | 90 89
96 | 96 95
98 | 98 97
100 | 100 99
103 | 103 102
108 | 108 107
110 | 220 218
112 | 112 111
113 | 113 112
114 | 114 113
115 | 115 114
121 | 242 240
124 | 124 123
127 | 127 126
131 | 131 130
133 | 133 132
136 | 136 135
137 | 137 136
139 | 139 138
143 | 143 142
144 | 144 143
147 | 147 146
149 | 149 148
155 | 155 154
157 | 157 156
159 | 159 158
169 | 169 168
170 | 170 169
171 | 171 170
173 | 173 172
177 | 177 176
181 | 181 180
182 | 182 181
189 | 378 376
193 | 193 192
194 | 194 193
201 | 201 200
213 | 213 212
223 | 223 222
226 | 226 225
228 | 228 227
233 | 233 232
235 | 235 234
236 | 236 235
243 | 243 242
244 | 244 243
250 | 250 249
254 | 254 253
280 | 280 279
282 | 282 281
289 | 289 288
294 | 294 293
297 | 297 296
307 | 307 306
329 | 329 328
334 | 668 666
357 | 357 356
395 | 395 394
418 | 418 417
461 | 461 460
473 | 473 472
482 | 482 481
567 | 567 566
620 | 620 619
651 | 651 650
735 | 735 734
775 | 775 774
904 | 904 903
1037 | 1037 1036
1071 | 1071 1070
1219 | 1219 1218
1268 | 1268 1267
1291 | 1291 1290
1462 | 1462 1461
1501 | 1501 1500
1688 | 1688 1687
1723 | 1723 1722
3673 | 3673 3672
8116 | 8116 8115
11489 | 11489 11488
--------------------------------------
Am I right? How could I fix this issue? What is the origin of this problem?
I know that this is a pretty basic question but I have seen other research papers using fewer components and they don't find this issue (apparently). I wasn't able to find any other explanation apart from the one I previously mentioned two lines above.
Thanks in advance!
Daniel.

Comment