Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCA with many repeated scores

    Dear Statausers,

    Over the last weeks, I have been trying to deeply understand how the PCA works with the aim of including a wealth index (WI) in my regression (it would work as a control variable). I have read all similar stata-list posts but haven't found a straightforward solution to my issue.

    I am using DHS (Ethiopia), and as you may know, the WI provided is a survey-specific measure (calculated separately and differently for each survey), therefore I need to create my own for a proper inter-survey comparison (i.e, compare the household wealth of DHS 2000, DHS 2005, and DHS 2011), an index that follows the same construction methodology. I have applied for a PCA using 18 components (all of them dummies) available in the three survey rounds. This is the code I used:

    Code:
    pca floor roof crowding drinkingsource toilet electricity radio tv car motorcycle bicycle telephone cleanfuel ownhouse lifestock electricmitad keropresslamp agriland
    Code:
    Principal components/correlation                 Number of obs    =     68,348
                                                     Number of comp.  =         18
                                                     Trace            =         18
        Rotation: (unrotated = principal)            Rho              =     1.0000
    
        --------------------------------------------------------------------------
           Component |   Eigenvalue   Difference         Proportion   Cumulative
        -------------+------------------------------------------------------------
               Comp1 |      6.99464      5.37066             0.3886       0.3886
               Comp2 |      1.62398      .460976             0.0902       0.4788
               Comp3 |        1.163      .150166             0.0646       0.5434
               Comp4 |      1.01284     .0735526             0.0563       0.5997
               Comp5 |      .939284     .0398593             0.0522       0.6519
               Comp6 |      .899424      .152798             0.0500       0.7018
               Comp7 |      .746626     .0782401             0.0415       0.7433
               Comp8 |      .668386      .121691             0.0371       0.7805
               Comp9 |      .546695     .0355317             0.0304       0.8108
              Comp10 |      .511163     .0650795             0.0284       0.8392
              Comp11 |      .446084    .00538749             0.0248       0.8640
              Comp12 |      .440697     .0130105             0.0245       0.8885
              Comp13 |      .427686     .0287318             0.0238       0.9122
              Comp14 |      .398954     .0382597             0.0222       0.9344
              Comp15 |      .360694     .0306717             0.0200       0.9545
              Comp16 |      .330023      .026312             0.0183       0.9728
              Comp17 |      .303711       .11759             0.0169       0.9897
              Comp18 |      .186121            .             0.0103       1.0000
        --------------------------------------------------------------------------
    Then, I run predict pc1 pc2 pc3 pc4, score and I obtain my WI, in this case I want to use the pc1. However, I face a big issue (to my mind), the score generated by the command predict in pc1 has too many repeated values so a bunch of my observations have the same score. This can be easily seen using the command extremes or reporting the duplicates:

    Code:
    predict pc1 pc2 pc3 pc4, score
    (14 components skipped)
    
    Scoring coefficients
        sum of squares(column-loading) = 1
    
        ------------------------------------------------------------------------------------
            Variable |    Comp1     Comp2     Comp3     Comp4     Comp5     Comp6     Comp7
        -------------+----------------------------------------------------------------------
               floor |   0.3083    0.0614   -0.0138   -0.0149   -0.0764    0.0772    0.1524
                roof |   0.2766   -0.1797    0.1068   -0.0383   -0.0001    0.0135    0.0219
            crowding |   0.0874   -0.0655   -0.2392    0.4493    0.7908    0.2515    0.1169
        drinkingso~e |   0.2628   -0.1818    0.0968   -0.0482   -0.0189   -0.0012   -0.1181
              toilet |   0.2846   -0.1141    0.0721   -0.0149   -0.0030    0.0238    0.0408
         electricity |   0.3367   -0.1441   -0.0005   -0.0046   -0.0631   -0.0222   -0.0049
               radio |   0.2476   -0.0288    0.2645   -0.0969    0.0511    0.0877    0.2702
                  tv |   0.2633    0.3629   -0.0331   -0.0174   -0.0144    0.0378    0.1201
                 car |   0.1328    0.4287   -0.0183   -0.0070    0.1082   -0.0016   -0.7948
          motorcycle |   0.0128   -0.0013    0.3267    0.7723   -0.4273    0.3176   -0.0862
             bicycle |   0.0857    0.1674    0.3606    0.2995    0.1876   -0.8225    0.1212
           telephone |   0.2272    0.4533   -0.0534    0.0014   -0.0061    0.0546    0.0503
           cleanfuel |   0.2986    0.0410   -0.1204   -0.0426   -0.1090    0.0543    0.1153
            ownhouse |  -0.2279    0.3807    0.0697    0.0118    0.0535    0.1484    0.2338
           lifestock |  -0.2358    0.2149    0.2314   -0.1054   -0.0563    0.0743    0.2242
        electricmi~d |   0.2693    0.3254   -0.1182   -0.0535   -0.0802    0.0947    0.2206
        keropressl~p |   0.0165   -0.0164    0.7202   -0.2863    0.3282    0.3080   -0.1215
            agriland |  -0.2838    0.2099    0.0041    0.0081    0.0115    0.0754    0.1466
        ------------------------------------------------------------------------------------
    
        ------------------------------------------------------------------------------------
            Variable |    Comp8     Comp9    Comp10    Comp11    Comp12    Comp13    Comp14
        -------------+----------------------------------------------------------------------
               floor |   0.0051   -0.0820   -0.0200    0.0127    0.3519    0.0790    0.1846
                roof |   0.3330   -0.1002    0.2459    0.3747   -0.3669    0.5875   -0.0414
            crowding |   0.0761   -0.0289    0.1222   -0.0387    0.0243   -0.0528   -0.0113
        drinkingso~e |  -0.0023    0.6912    0.3579   -0.4871   -0.0332   -0.0001    0.0614
              toilet |   0.1622    0.3407   -0.1438    0.6108    0.2635   -0.3837    0.2277
         electricity |   0.0818    0.0281    0.0501    0.0305   -0.0219    0.0708   -0.0410
               radio |   0.4763   -0.1255   -0.5530   -0.4197   -0.1475   -0.1026   -0.0199
                  tv |  -0.1791   -0.0541    0.0073   -0.0002   -0.0942   -0.1350    0.3332
                 car |   0.3432   -0.0451   -0.0666   -0.0304    0.1538    0.0356   -0.0489
          motorcycle |  -0.0403   -0.0373   -0.0044   -0.0180   -0.0079   -0.0043   -0.0086
             bicycle |  -0.0486   -0.0052    0.0284   -0.0044    0.0961    0.0593   -0.0277
           telephone |  -0.1902    0.0404    0.0659    0.0424   -0.5483   -0.1171    0.1202
           cleanfuel |  -0.0702   -0.2184    0.1589   -0.1492    0.4932    0.1849   -0.2024
            ownhouse |   0.0793    0.5253   -0.2357    0.1750    0.0800    0.2586   -0.4324
           lifestock |   0.4782   -0.1382    0.6026    0.0060    0.0521   -0.3794   -0.0780
        electricmi~d |  -0.1487   -0.0337    0.1086   -0.0217    0.0945    0.0684   -0.2825
        keropressl~p |  -0.3842   -0.1074    0.0380    0.0563    0.0882    0.0678    0.0057
            agriland |   0.1541    0.0927    0.0175   -0.1040    0.1915    0.4382    0.6810
        ------------------------------------------------------------------------------------
    
        ------------------------------------------------------
            Variable |   Comp15    Comp16    Comp17    Comp18
        -------------+----------------------------------------
               floor |  -0.3515    0.6620   -0.3158   -0.1809
                roof |  -0.0868   -0.0913    0.0333   -0.2395
            crowding |  -0.0024    0.0026   -0.0044    0.0135
        drinkingso~e |  -0.0214   -0.0319   -0.0139   -0.1490
              toilet |   0.2599   -0.1404    0.0190   -0.0775
         electricity |  -0.0806    0.0915    0.0302    0.9096
               radio |   0.0911   -0.0725    0.0093   -0.0653
                  tv |  -0.5913   -0.3851    0.3316   -0.0235
                 car |  -0.0142   -0.0168   -0.0422   -0.0023
          motorcycle |   0.0156   -0.0424   -0.0045   -0.0120
             bicycle |   0.0355    0.0060   -0.0201   -0.0071
           telephone |   0.4376    0.3969    0.1126   -0.0038
           cleanfuel |   0.3369   -0.0158    0.5679   -0.0997
            ownhouse |  -0.1807    0.1232    0.1945    0.0645
           lifestock |  -0.0528    0.0610    0.0198    0.0637
        electricmi~d |   0.1580   -0.4240   -0.6377    0.0207
        keropressl~p |   0.0440   -0.0035   -0.0157    0.0556
            agriland |   0.2600   -0.1102   -0.0712    0.1653
        ------------------------------------------------------
    Code:
    
    
    Code:
    extremes pc1
    
      +------------------+
      | obs:         pc1 |
      |------------------|
      |  27.   -1.899374 |
      |  28.   -1.899374 |
      |  29.   -1.899374 |
      |  30.   -1.899374 |
      |  31.   -1.899374 |
      +------------------+
    
      +-------------------+
      | 66542.   10.23505 |
      | 66543.   10.23505 |
      | 66544.   10.23505 |
      | 66545.   10.23505 |
      | 66546.   10.23505 |
      +-------------------+
    
    note: 8116 values of -1.899374
    note: 7 values of 10.23505
    Code:
    
    
    Code:
    duplicates report pc1
    
    Duplicates in terms of pc1
    
    --------------------------------------
       copies | observations       surplus
    ----------+---------------------------
            1 |           46             0
            2 |          116            58
            3 |          153           102
            4 |          332           249
            5 |          385           308
            6 |          690           575
            7 |          679           582
            8 |          528           462
            9 |          378           336
           10 |          450           405
           11 |          374           340
           12 |          300           275
           13 |          442           408
           14 |          378           351
           15 |          255           238
           16 |          304           285
           17 |          323           304
           18 |          288           272
           19 |          247           234
           20 |          200           190
           21 |          294           280
           22 |          132           126
           23 |          138           132
           24 |          120           115
           25 |          250           240
           26 |          182           175
           27 |          135           130
           28 |          224           216
           29 |          145           140
           30 |          120           116
           31 |          124           120
           32 |          224           217
           33 |           99            96
           34 |          102            99
           35 |          140           136
           36 |           36            35
           37 |          185           180
           38 |          114           111
           39 |           78            76
           40 |          200           195
           41 |          287           280
           42 |           42            41
           43 |           43            42
           44 |           88            86
           45 |          180           176
           46 |          230           225
           47 |          329           322
           48 |          192           188
           50 |           50            49
           51 |           51            50
           52 |          104           102
           53 |          212           208
           54 |          108           106
           55 |          165           162
           56 |          168           165
           57 |          171           168
           58 |           58            57
           59 |           59            58
           60 |          120           118
           62 |           62            61
           63 |          126           124
           64 |          128           126
           65 |           65            64
           67 |          134           132
           68 |          204           201
           69 |           69            68
           70 |          140           138
           71 |           71            70
           73 |          146           144
           74 |           74            73
           76 |          152           150
           77 |          154           152
           78 |          156           154
           80 |           80            79
           81 |          162           160
           82 |          164           162
           83 |           83            82
           84 |          168           166
           86 |          172           170
           88 |           88            87
           89 |           89            88
           90 |           90            89
           96 |           96            95
           98 |           98            97
          100 |          100            99
          103 |          103           102
          108 |          108           107
          110 |          220           218
          112 |          112           111
          113 |          113           112
          114 |          114           113
          115 |          115           114
          121 |          242           240
          124 |          124           123
          127 |          127           126
          131 |          131           130
          133 |          133           132
          136 |          136           135
          137 |          137           136
          139 |          139           138
          143 |          143           142
          144 |          144           143
          147 |          147           146
          149 |          149           148
          155 |          155           154
          157 |          157           156
          159 |          159           158
          169 |          169           168
          170 |          170           169
          171 |          171           170
          173 |          173           172
          177 |          177           176
          181 |          181           180
          182 |          182           181
          189 |          378           376
          193 |          193           192
          194 |          194           193
          201 |          201           200
          213 |          213           212
          223 |          223           222
          226 |          226           225
          228 |          228           227
          233 |          233           232
          235 |          235           234
          236 |          236           235
          243 |          243           242
          244 |          244           243
          250 |          250           249
          254 |          254           253
          280 |          280           279
          282 |          282           281
          289 |          289           288
          294 |          294           293
          297 |          297           296
          307 |          307           306
          329 |          329           328
          334 |          668           666
          357 |          357           356
          395 |          395           394
          418 |          418           417
          461 |          461           460
          473 |          473           472
          482 |          482           481
          567 |          567           566
          620 |          620           619
          651 |          651           650
          735 |          735           734
          775 |          775           774
          904 |          904           903
         1037 |         1037          1036
         1071 |         1071          1070
         1219 |         1219          1218
         1268 |         1268          1267
         1291 |         1291          1290
         1462 |         1462          1461
         1501 |         1501          1500
         1688 |         1688          1687
         1723 |         1723          1722
         3673 |         3673          3672
         8116 |         8116          8115
        11489 |        11489         11488
    --------------------------------------
    From my perspective, having that many duplicates make my index helpless. A non-negligible part of the individuals seem to have exactly the same characteristics throughout the 18 components, or that is my interpretation.

    Am I right? How could I fix this issue? What is the origin of this problem?

    I know that this is a pretty basic question but I have seen other research papers using fewer components and they don't find this issue (apparently). I wasn't able to find any other explanation apart from the one I previously mentioned two lines above.

    Thanks in advance!

    Daniel.
    Last edited by Daniel Perez Parra; 18 Aug 2022, 08:06.

  • #2
    If it's fine that several people have the same value on an indicator variable, because that is correct, it's surely no worse that some people have the same value on a set of indicator variables. Otherwise put, what else would be expected from this kind of data?

    Some researchers seem to find some flavour of correspondence analysis to be more appropriate here, but I won't go beyond that impression.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      If it's fine that several people have the same value on an indicator variable, because that is correct, it's surely no worse that some people have the same value on a set of indicator variables. Otherwise put, what else would be expected from this kind of data?

      Some researchers seem to find some flavour of correspondence analysis to be more appropriate here, but I won't go beyond that impression.
      Well, that's an argument I didn't consider but you're right. I guess that there is no problem with having the same wealth score for a bunch of observations.

      Thanks, Mr Cox!

      Comment

      Working...
      X