Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intpretation scoring coefficients of Polychoric PCA

    Dear Statalist Community,

    Aiming to construct an SES index based on several (continuous and dichotomous) asset and housing variables, I am applying a polychoric PCA:
    Code:
    polychoricpca $SES3, score(pca_poly32) nscore(2)
    Among others, this code displays the scoring coefficients, which I want to report as weights. However, it is not clear to me why the binary variables have partly more than one scoring coefficient per value (0/1), such as:

    Scoring coefficients

    Variable Coeff. 1 Coeff. 2 Coeff. 3
    Car
    0 -0.305196 0.078348 -0.044379
    0 0.250862 -0.064400 0.036478
    0 0.283528 -0.072786 0.041228
    0 0.334635 -0.085906 0.048660
    1 0.376041 -0.096535 0.054680
    1 0.709849 -0.182228 0.103220

    How can this be interpreted when reporting on the weight of the variable "Car" in the PCA?

    THANKS!

  • #2
    Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. Section 12.1 is particularly pertinent

    12.1 What to say about your commands and your problem
    Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!
    We don't know what the actual list of variables in your polychoricpca command was. We don't know what else was presented in your output.

    Here's the way to present polychoricpca output, using CODE delimiters for readability as described in the FAQ.
    Code:
    . polychoricpca foreign mpg rep78, score(pca) nscore(2)
    
    Polychoric correlation matrix
    
               foreign        mpg      rep78
    foreign          1
        mpg  .55443553          1
      rep78  .80668066  .42655384          1
    
    Principal component analysis
    
     k  |  Eigenvalues  |  Proportion explained  |  Cum. explained
    ----+---------------+------------------------+------------------
      1 |    2.206757   |    0.735586            |   0.735586
      2 |    0.615445   |    0.205148            |   0.940734
      3 |    0.177798   |    0.059266            |   1.000000
    
                   Scoring coefficients
    
        Variable    |  Coeff. 1  |  Coeff. 2  |  Coeff. 3 
    ------------------------------------------------------
     foreign       
                 0  | -0.361911  |  0.127408  |  0.429707 
                 1  |  0.649795  | -0.228756  | -0.771520 
     mpg            |  0.499489  |  0.849728  |  0.168739 
     rep78         
                 1  | -1.404227  |  1.126627  | -1.516724 
                 2  | -0.886922  |  0.711587  | -0.957976 
                 3  | -0.276327  |  0.221700  | -0.298464 
                 4  |  0.243116  | -0.195054  |  0.262592 
                 5  |  0.791159  | -0.634756  |  0.854541

    Comment


    • #3
      Ok, then let me try again:
      Code:
       global SES3 CarsPP Size_house Tractor Plough Harrow
      
      . su $SES3
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
            CarsPP |        144    .2668943    .3107161          0          2
        Size_house |        144    62.45578    31.45135          6        160
           Tractor |        144    .2671097    .4247629          0          1
            Plough |        144    .2317539    .4025776          0          1
            Harrow |        144    .2243288    .4024766          0          1
      
      . polychoricpca $SES3, score(pca_poly32) nscore(1)
      
      Polychoric correlation matrix
      
                      CarsPP  Size_house     Tractor      Plough      Harrow
          CarsPP           1
      Size_house   .18724863           1
         Tractor   .09607885   .02779113           1
          Plough   .19964741   .06711594   .93751077           1
          Harrow   .19769896   .07041003    .9350099   .99664232           1
      
      Principal component analysis
      
       k  |  Eigenvalues  |  Proportion explained  |  Cum. explained
      ----+---------------+------------------------+------------------
        1 |    2.962512   |    0.592502            |   0.592502
        2 |    1.152852   |    0.230570            |   0.823073
        3 |    0.805322   |    0.161064            |   0.984137
        4 |    0.075997   |    0.015199            |   0.999337
        5 |    0.003317   |    0.000663            |   1.000000
      
                     Scoring coefficients
      
          Variable    |  Coeff. 1  |  Coeff. 2  |  Coeff. 3
      ------------------------------------------------------
       CarsPP         |  0.149819  |  0.660683  | -0.730477
       Size_house     |  0.062531  |  0.732557  |  0.677569
       Tractor       
                   0  | -0.305196  |  0.078348  | -0.044379
                   0  |  0.250862  | -0.064400  |  0.036478
                   0  |  0.283528  | -0.072786  |  0.041228
                   0  |  0.334635  | -0.085906  |  0.048660
                   1  |  0.376041  | -0.096535  |  0.054680
                   1  |  0.709849  | -0.182228  |  0.103220
       Plough        
                   0  | -0.286118  |  0.028366  | -0.008369
                   0  |  0.333755  | -0.033089  |  0.009763
                   0  |  0.388533  | -0.038520  |  0.011365
                   0  |  0.427195  | -0.042353  |  0.012496
                   0  |  0.453987  | -0.045009  |  0.013279
                   1  |  0.783636  | -0.077691  |  0.022922
       Harrow        
                   0  | -0.285869  |  0.027725  | -0.010508
                   0  |  0.304065  | -0.029490  |  0.011177
                   0  |  0.345396  | -0.033498  |  0.012696
                   0  |  0.400869  | -0.038878  |  0.014735
                   0  |  0.440131  | -0.042686  |  0.016178
                   1  |  0.782956  | -0.075935  |  0.028780
      
      .
      If I understand it right, the scoring coefficients can be interpreted as the weights this variable has in the newly created PCA index. Why do the binary variables have more than one coefficient in general (for 0 and 1) and why several ones for 0s and 1s in particular?

      Any help is highly appreciated. Thanks!

      Comment


      • #4
        I agree that the results you show do not make sense. I cannot explain them.

        When I perform factor analyses using polychoric correlations, I create the correlation matrix and mean and standard deviation vectors and input them into the factormat command. The same approach should work with the pcamat command, although for what you are doing, I would expect factor analysis to be a preferred approach. I'd suggest you use the following code as a starting point; see if the results are more sensible than those from polychoricpca.
        Code:
        tempfile mydata
        save `mydata'
        
        collapse (mean) $SES3
        mkmat `scc', matrix(fa_m)
        use `mydata', clear
        
        collapse (sd) $SES3
        mkmat `scc', matrix(fa_s)
        use `mydata', clear
        
        polychoric $SES3
        scalar fa_N = r(N)
        matrix fa_r = r(R)
        
        pcamat fa_r, n(`=fa_N') means(fa_m) sds(fa_s) factors(1)
        predict pca1 pca2

        Comment


        • #5
          William, thank you!
          When replacing `scc' with $SES3 in your code above, this gives me exactly the same results for the Rho and scoring coefficients of continuous variables as using
          Code:
          polychoricpca $SES3, score(pca_poly32) nscore(1)
          However, it only shows one scoring effect for the dummy variables and therefore seems to solve my problem:
          Code:
           pcamat fa_r, n(`=fa_N') means(fa_m) sds(fa_s) factors(1)
          
          Principal components/correlation                 Number of obs    =        144
                                                           Number of comp.  =          1
                                                           Trace            =          5
              Rotation: (unrotated = principal)            Rho              =     0.5925
          
              --------------------------------------------------------------------------
                 Component |   Eigenvalue   Difference         Proportion   Cumulative
              -------------+------------------------------------------------------------
                     Comp1 |      2.96251      1.80966             0.5925       0.5925
                     Comp2 |      1.15285       .34753             0.2306       0.8231
                     Comp3 |      .805322      .729325             0.1611       0.9841
                     Comp4 |     .0759968     .0726795             0.0152       0.9993
                     Comp5 |    .00331734            .             0.0007       1.0000
              --------------------------------------------------------------------------
          
          Principal components (eigenvectors)
          
              --------------------------------------
                  Variable |    Comp1 | Unexplained
              -------------+----------+-------------
                    CarsPP |   0.1498 |       .9335
                Size_house |   0.0625 |       .9884
                   Tractor |   0.5575 |      .07921
                    Plough |   0.5759 |      .01733
                    Harrow |   0.5754 |      .01903
              --------------------------------------
          
          . predict pcaN1 pcaN2
          (score assumed)
          (extra variables dropped)
          
          Scoring coefficients
              sum of squares(column-loading) = 1
          
              ------------------------
                  Variable |    Comp1
              -------------+----------
                    CarsPP |   0.1498
                Size_house |   0.0625
                   Tractor |   0.5575
                    Plough |   0.5759
                    Harrow |   0.5754
              ------------------------
          What has changed and why do you expect factor analysis to be my preferred approach?

          Comment


          • #6
            Apologies for overlooking those two instances of `scc' in modifying code stolen from one of my do-files. Good catch.

            With regard to what has changed in running pcamat vs. polychoricpca: for reasons unknown to me but perhaps due to the age of polychoric - it dates back to Stata 8.2 - it does not use the pcamat command but has its own code. My reading on tetrachoric correlations (in the output of help tetrachoric) suggests that the correlation matrix returned by tetrachoric is suitable input for pcamat and factormat, and polychoric correlation is just a generalization of tetrachoric correlation to ordered categorical variables. So I'm very much inclined to use the pcamat and factormat commands rather than rely on polychoricpca.

            With regard to factor analysis vs. PCA, I commonly see "principal component analysis" used as shorthand for "factor analysis using principal component analysis for factor extraction", but the two are not the same. This confusion is enhanced by SPSS's apparent lack of a separate command for doing principal component analysis other than as the first step of a factor analysis. Wikipedia's discussions of principal component analysis and factor analysis help clarify the distinction. In particular, from the article on principal component analysis,

            PCA is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when the goal is to detect the latent construct or factors. ... Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.
            The description of the objective in post #1 suggests to me that factor analysis, not principal component analysis, is the objective here, as it was for me in the work that led to the work represented in post #4.

            Comment


            • #7
              William, thank you very much for this detailed explanation!!!
              I find this distinction of PCA and factor analysis rather vague. My aim is to do data reduction (by creating an index based on different variables, rather than using the variables separately as input variables) and use this data to perform econometric analysis but wouldnt necessarily speak of causal estimations/modelling. If I want to compare the two approaches, I only need to replace
              Code:
              pcamat fa_r, n(`=fa_N') means(fa_m) sds(fa_s) factors(1)
              with
              Code:
              factormat fa_r, n(`=fa_N') means(fa_m) sds(fa_s) factors(1),
              right?


              Comment


              • #8
                One other question that just came to my mind: in your code in post #4 you estimate polychoric correlations before using the pcamat:
                Code:
                polychoric $SES3
                scalar fa_N = r(N)
                matrix fa_r = r(R)  
                
                pcamat fa_r, n(`=fa_N') means(fa_m) sds(fa_s) factors(1)
                predict pca1 pca2
                Is this preferred over tetrachoric because the variable list does not only include binary variables?

                Comment


                • #9
                  In reverse order of your questions, yes, tetrachoric will not work for your data, but polychoric will.

                  With regard to substituting factormat for pcamat, the simple substitution you suggest will work, but it then uses the default for the type of factor estimation to be performed, and you can perhaps do better with other alternatives.

                  Beyond that, there's not much more I can say about the comparison of factor analysis and principal component analysis that hasn't already been said, much better, in textbooks and even in the Wikipedia articles I referenced in post #6.

                  https://en.wikipedia.org/wiki/Factor_analysis
                  https://en.wikipedia.org/wiki/Princi...onent_analysis

                  Comment


                  • #10
                    Thanks a lot!

                    Comment

                    Working...
                    X