polychoric and tetrachoric commands (factor analysis on binary variables)

Malene Christensen

Join Date: Sep 2016

Posts: 14
#16

09 Jul 2017, 11:06

Okay. I didn't post the table with the correlation coefficient as it seemed a bit "overkill".

Thank you, I will try the proposed method.
Comment

Malene Christensen

Join Date: Sep 2016
Posts: 14

#17

18 Jul 2017, 07:57

I have had time to return to the data, and I had some colleagues look at it as well. We have not been able to solve this issue. This is why I am returning to your comment, William with some follow-up questions.

Here is the entire output from the tetrachoric command

Code:

 tetrachoric requirement funding competition discussion event citizen_science cam
> paign platform training guiding rules policy network unit strategy standard coop
> eration res_area sup_res sup_ed
(obs=217)

matrix with tetrachoric correlations is not positive semidefinite;
  it has 5 negative eigenvalues
  maxdiff(corr,adj-corr) =  0,3721
  (adj-corr: tetrachoric correlations adjusted to be positive semidefinite)

             | requir~t  funding compet~n discus~n    event citize~e campaign
-------------+---------------------------------------------------------------
 requirement |   1,0000
     funding |   0,4324   1,0000
 competition |  -0,0793   0,3978   1,0000
  discussion |   0,2243  -0,1203  -0,0237   1,0000
       event |  -0,0324   0,5958   0,6756   0,1967   1,0000
citizen_sc~e |   0,1179   0,7278   0,0196   0,0858   0,7500   1,0000
    campaign |  -0,0662   0,5213  -0,0535   0,1996   0,6992   0,7144   1,0000
    platform |   0,0072  -0,0525   0,3951   0,0028   0,4322   0,3194   0,0325
    training |  -0,1158   0,4146   0,4114   0,4036   0,4363   0,1631   0,1344
     guiding |  -0,0856   0,1019  -0,1200   0,0301  -0,0822   0,1609   0,2408
       rules |   0,5276   0,2780  -0,3215   0,1724  -0,1953  -0,0759  -0,1423
      policy |   0,3280   0,0655   0,0826   0,4743   0,4133   0,2716   0,4290
     network |   0,2335  -0,0340  -1,0000   0,5163   0,1251   0,1201   0,1619
        unit |   0,0249  -0,1447   0,4409   0,0978   0,4082  -0,0638  -0,0774
    strategy |   0,3853   0,3276   0,3085   0,1831   0,2463   0,4806   0,1703
    standard |   0,3935  -0,0214  -0,1923  -0,1689  -0,3900  -0,2688  -0,1765
 cooperation |   0,3161   0,4847  -0,0482   0,3448   0,5356   0,7006   0,7144
    res_area |   0,1842   0,7106   0,1242  -0,1252   0,5347   0,7205   0,5274
     sup_res |   0,5837   0,5239  -0,1151   0,1727   0,1608   0,0809   0,1774
      sup_ed |   0,3814   0,5254   0,2286   0,0858   0,3090  -0,2188   0,1445

             | platform training  guiding    rules   policy  network     unit
-------------+---------------------------------------------------------------
    platform |   1,0000
    training |   0,2358   1,0000
     guiding |  -0,0352  -0,1662   1,0000
       rules |  -0,0759   0,0277   0,0701   1,0000
      policy |   0,2716   0,3622   0,0701   0,0145   1,0000
     network |   0,2362   0,4568   0,0614   0,4572   0,7082   1,0000
        unit |   0,3637   0,1421   0,1180  -0,0419  -0,0419   0,1168   1,0000
    strategy |   0,0485   0,0263   0,3949   0,0036  -0,1743   0,0887   0,3676
    standard |  -0,0065   0,0206   0,2743   0,4876   0,2042   0,1660  -0,0323
 cooperation |   0,2833   0,5182   0,3405   0,1989   0,6876   0,7500   0,1872
    res_area |   0,1416   0,4321   0,3638   0,2500   0,3215   0,1607  -0,0112
     sup_res |  -0,0281   0,3880  -0,1474   0,7382   0,2875   0,2829   0,0360
      sup_ed |  -0,2188   0,4902   0,1609   0,1728   0,4389   0,2362   0,0175

             | strategy standard cooper~n res_area  sup_res   sup_ed
-------------+------------------------------------------------------
    strategy |   1,0000
    standard |   0,3688   1,0000
 cooperation |   0,3327   0,1095   1,0000
    res_area |   0,4191  -0,0972   0,6530   1,0000
     sup_res |   0,1619   0,4269   0,5665   0,3928   1,0000
      sup_ed |  -0,0829   0,3397   0,5830   0,0466   0,6018   1,0000

. matrix r = r(Rho)

. factormat r, n(217)
r not positive (semi)definite
r(506);

end of do-file

r(506);

A colleague of mine noticed that the correlation between network and competition is -1. Using the polychoric command, this correlation coefficient equals 0.0. I understand now, why you requested the entire output. It turns out that there are no observations in the dataset that have both networking activities and hosts competitions, Can this be why I get the error message? I tried recoding competition so that the competition=1/network=1 combination was represented in the dataset, i do, however, still get the above error message.

The requirement that the correlation matrix input to principle components analysis be positive semidefinite is a serious statistical concern, so the error message from factormat about the correlation matrix from tetrachoric is important.

The problem can be solved with the ,posdef option as you mentioned, but what does this mean? is this legitimate? I am highly confused as I did a similar analysis on other variables in the dataset, which are also all binary. These are on which understandings of responsibility the oranizations worked with and this worked just fine with the polychoricpca and tetrachoric varlist, matrix r = r(Rho)
factormat r, n(217). However, these two methods - that yield the same correlation coefficients - suggest three dimensions for the first method and 5 dimensions for the latter. I belive it should be similar? These analyses are done with all 217 observations in the dataset. I hope you can help me, I am on the verge of quitting this factor analysis all together.

Last edited by Malene Christensen; 18 Jul 2017, 08:01.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#18

18 Jul 2017, 09:04

Code:

polychoricpca

and

Code:

tetrachoric varlist matrix r = r(Rho) factormat r, n(217)

... these two methods - that yield the same correlation coefficients - suggest three dimensions for the first method and 5 dimensions for the latter. I belive it should be similar?

Reviewing the documentation in help polychoricpca and help factormat and help pca suggests that you are comparing apples with oranges. polychoricpca produces a principle components analysis, whereas factormat produces a factor analysis, by default using the principal-factor method, although optionally using the principle-component factor method.

I think that to be comparable to polychoricpca, you would have to use the pca command rather than the factormat command. Or conversely, going back to your post #1, since you want a factor analysis, polychoricpca is not the tool for you; factormat is. However, it is possible that

Code:

factormat r, n(217) pcf

would produce results more similar to those from polychoricpca.
Comment

Malene Christensen

Join Date: Sep 2016
Posts: 14

#19

19 Jul 2017, 03:23

Wow, I have indeed confused these terms! Thank you for clearing this up! I believe that the Principal Component Analysis is the most correct to use in this case, as the primary purpose is data reduction in an explorative out-set. So I need a polychoric PCA - which does not work or I could try to store the tetrachoric/polychoric correlation coefficients and do a pcamat, which also does not work.

Code:

 polychoricpca requirement funding competition discussion event citizen_science c
> ampaign platform training guiding rules policy network unit strategy standard co
> operation res_area sup_res sup_ed
could not calculate numerical derivatives
missing values encountered
could not calculate numerical derivatives
missing values encountered

Polychoric correlation matrix

                     requirement          funding      competition
    requirement                1
        funding        ,43360211                1
    competition       -,07963118        ,39906673                1
     discussion        ,22511135       -,12062146       -,02376225
          event       -,03243606        ,59658092        ,67674132
citizen_science        ,11872629        ,72926096        ,01991276
       campaign       -,06642385        ,52262024       -,05369341
       platform        ,00737133       -,05270511        ,39700047
       training        -,1162005        ,41540034        ,41257345
        guiding       -,08579849        ,10211343       -,12030967
          rules        ,52943032         ,2791685       -,32344164
         policy        ,32958669        ,06590443        ,08317795
        network        ,23488741       -,03405168                .
           unit        ,02502208       -,14509878        ,44205054
       strategy        ,38606621        ,32795366        ,30917024
       standard        ,39458861       -,02141349       -,19300705
    cooperation        ,31689848        ,48534115       -,04827437
       res_area        ,18502194        ,71155498        ,12484554
        sup_res        ,58525384        ,52510344       -,11564683
         sup_ed        ,38322625        ,52704529        ,22991512

                      discussion            event  citizen_science
     discussion                1
          event        ,19717124                1
citizen_science        ,08629164        ,75138492                1
       campaign        ,20043572        ,70035335        ,71610885
       platform        ,00290081        ,43367535        ,32136945
       training        ,40441367        ,43696061         ,1639602
        guiding        ,03021667       -,08227838        ,16150499
          rules        ,17318865       -,19611518        -,0762605
         policy        ,47571215        ,41453787        ,27321665
        network        ,51792978        ,12572792        ,12105779
           unit        ,09806615        ,40882123        -,0639711
       strategy        ,18334256        ,24649133        ,48180778
       standard       -,16930525       -,39110504       -,27012386
    cooperation        ,34534373        ,53610536        ,70199426
       res_area        -,1256119        ,53572544        ,72206206
        sup_res        ,17333193        ,16136692        ,08144885
         sup_ed        ,08629164        ,31020763       -,22033115

                        campaign         platform         training
       campaign                1
       platform        ,03285636                1
       training        ,13500998        ,23688703                1
        guiding        ,24149222       -,03522224       -,16647947
          rules       -,14311739        -,0762605        ,02786731
         policy        ,43079963        ,27321665        ,36346845
        network        ,16301905        ,23786434        ,45842578
           unit       -,07768877        ,36506261        ,14251282
       strategy        ,17071145        ,04870872        ,02642428
       standard       -,17717211       -,00640102        ,02070659
    cooperation        ,71552201        ,28435191        ,51882571
       res_area        ,52891711        ,14245858        ,43315516
        sup_res        ,17836936       -,02810993        ,38906015
         sup_ed        ,14547083       -,22033115        ,49178715

                         guiding            rules           policy
        guiding                1
          rules        ,07038232                1
         policy        ,07038232        ,01469897                1
        network        ,06168779        ,45937531        ,71012239
           unit        ,11815959       -,04198881       -,04198881
       strategy        ,39487928         ,0036789        -,1747876
       standard        ,27461083         ,4889252        ,20498582
    cooperation         ,3407695        ,19958837        ,68885162
       res_area        ,36451458        ,25116821        ,32285854
        sup_res       -,14781197        ,73962198        ,28886311
         sup_ed        ,16150499        ,17401462         ,4409355

                         network             unit         strategy
        network                1
           unit        ,11746568                1
       strategy        ,08910018        ,36786705                1
       standard        ,16677031       -,03233762        ,36906466
    cooperation        ,75144149        ,18751041        ,33280116
       res_area        ,16164549       -,01119367        ,41976814
        sup_res        ,28442846         ,0362096        ,16230451
         sup_ed        ,23786434        ,01768492       -,08308129

                        standard      cooperation         res_area
       standard                1
    cooperation        ,10972283                1
       res_area       -,09751973        ,65382663                1
        sup_res        ,42791047        ,56748995        ,39403875
         sup_ed        ,34095752        ,58448547        ,04701396

                         sup_res           sup_ed
        sup_res                1
         sup_ed        ,60365551                1
matrix symeigen: matrix has missing values
r(504);

Code:

. tetrachoric requirement funding competition discussion event citizen_science cam
> paign platform training guiding rules policy network unit strategy standard coop
> eration res_area sup_res sup_ed
(obs=217)

matrix with tetrachoric correlations is not positive semidefinite;
  it has 5 negative eigenvalues
  maxdiff(corr,adj-corr) =  0,3721
  (adj-corr: tetrachoric correlations adjusted to be positive semidefinite)

             | requir~t  funding compet~n discus~n    event citize~e campaign
-------------+---------------------------------------------------------------
 requirement |   1,0000
     funding |   0,4324   1,0000
 competition |  -0,0793   0,3978   1,0000
  discussion |   0,2243  -0,1203  -0,0237   1,0000
       event |  -0,0324   0,5958   0,6756   0,1967   1,0000
citizen_sc~e |   0,1179   0,7278   0,0196   0,0858   0,7500   1,0000
    campaign |  -0,0662   0,5213  -0,0535   0,1996   0,6992   0,7144   1,0000
    platform |   0,0072  -0,0525   0,3951   0,0028   0,4322   0,3194   0,0325
    training |  -0,1158   0,4146   0,4114   0,4036   0,4363   0,1631   0,1344
     guiding |  -0,0856   0,1019  -0,1200   0,0301  -0,0822   0,1609   0,2408
       rules |   0,5276   0,2780  -0,3215   0,1724  -0,1953  -0,0759  -0,1423
      policy |   0,3280   0,0655   0,0826   0,4743   0,4133   0,2716   0,4290
     network |   0,2335  -0,0340  -1,0000   0,5163   0,1251   0,1201   0,1619
        unit |   0,0249  -0,1447   0,4409   0,0978   0,4082  -0,0638  -0,0774
    strategy |   0,3853   0,3276   0,3085   0,1831   0,2463   0,4806   0,1703
    standard |   0,3935  -0,0214  -0,1923  -0,1689  -0,3900  -0,2688  -0,1765
 cooperation |   0,3161   0,4847  -0,0482   0,3448   0,5356   0,7006   0,7144
    res_area |   0,1842   0,7106   0,1242  -0,1252   0,5347   0,7205   0,5274
     sup_res |   0,5837   0,5239  -0,1151   0,1727   0,1608   0,0809   0,1774
      sup_ed |   0,3814   0,5254   0,2286   0,0858   0,3090  -0,2188   0,1445

             | platform training  guiding    rules   policy  network     unit
-------------+---------------------------------------------------------------
    platform |   1,0000
    training |   0,2358   1,0000
     guiding |  -0,0352  -0,1662   1,0000
       rules |  -0,0759   0,0277   0,0701   1,0000
      policy |   0,2716   0,3622   0,0701   0,0145   1,0000
     network |   0,2362   0,4568   0,0614   0,4572   0,7082   1,0000
        unit |   0,3637   0,1421   0,1180  -0,0419  -0,0419   0,1168   1,0000
    strategy |   0,0485   0,0263   0,3949   0,0036  -0,1743   0,0887   0,3676
    standard |  -0,0065   0,0206   0,2743   0,4876   0,2042   0,1660  -0,0323
 cooperation |   0,2833   0,5182   0,3405   0,1989   0,6876   0,7500   0,1872
    res_area |   0,1416   0,4321   0,3638   0,2500   0,3215   0,1607  -0,0112
     sup_res |  -0,0281   0,3880  -0,1474   0,7382   0,2875   0,2829   0,0360
      sup_ed |  -0,2188   0,4902   0,1609   0,1728   0,4389   0,2362   0,0175

             | strategy standard cooper~n res_area  sup_res   sup_ed
-------------+------------------------------------------------------
    strategy |   1,0000
    standard |   0,3688   1,0000
 cooperation |   0,3327   0,1095   1,0000
    res_area |   0,4191  -0,0972   0,6530   1,0000
     sup_res |   0,1619   0,4269   0,5665   0,3928   1,0000
      sup_ed |  -0,0829   0,3397   0,5830   0,0466   0,6018   1,0000

. matrix r = r(R)

. pcamat r, n(217)
matrix r has missing values
r(504);

end of do-file

r(504);

If I exclude either the Network variable or the Competition variable, it works, which is highly perculiar! The only odd thing here is, as already mentioned, that there are no observations in the dataset which score 1 on both there variables:

Code:

          |      competition
   network |         0          1 |     Total
-----------+----------------------+----------
         0 |       168         28 |       196
         1 |        21          0 |        21
-----------+----------------------+----------
     Total |       189         28 |       217

I don't understand, however, why this should be a problem to the PCA on tetrachoric coefficients? Can anyone enlighten me on this?

Last edited by Malene Christensen; 19 Jul 2017, 03:41.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#20

19 Jul 2017, 10:31

In general, when working with categorical variables like network and competition, if knowing that in your data one of the variables has a particular value (e.g. network=1) allows you to state that another variable has a particular value (e.g. competition=0), the methodology breaks down. The best estimate of P{competition=0 given network=1} is 1, but do you really believe that if you have 217 million observations not a single value of competition=1 and network=1 will occur? Let's look at the following example.

I start by reproducing your data for network and competition (which I shorten to net and comp), and create a third binary variable z.

Code:

. summarize

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         net |        217    .0967742    .2963336          0          1
        comp |        217    .1290323    .3360108          0          1
           z |        217    .4976959    .5011507          0          1

. tab comp net

           |          net
      comp |         0          1 |     Total
-----------+----------------------+----------
         0 |       168         21 |       189 
         1 |        28          0 |        28 
-----------+----------------------+----------
     Total |       196         21 |       217

Now let's look at what happens when we try to model comp as a function of net and z.

Code:

. logit comp net z

note: net != 0 predicts failure perfectly
      net dropped and 21 obs not used

Iteration 0:   log likelihood = -80.382798  
Iteration 1:   log likelihood = -78.760538  
Iteration 2:   log likelihood = -78.731712  
Iteration 3:   log likelihood = -78.731702  
Iteration 4:   log likelihood = -78.731702  

Logistic regression                             Number of obs     =        196
                                                LR chi2(1)        =       3.30
                                                Prob > chi2       =     0.0692
Log likelihood = -78.731702                     Pseudo R2         =     0.0205

------------------------------------------------------------------------------
        comp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         net |          0  (omitted)
           z |   .7548408   .4237117     1.78   0.075    -.0756188      1.5853
       _cons |  -2.208275   .3331501    -6.63   0.000    -2.861237   -1.555312
------------------------------------------------------------------------------

In your data, every observation with net != 0 has the dependent variable comp == 0, and that is what logit tells us in the note at the top of it's output. It cannot deal with that, so it drops the variable net and the 21 observations for which net != 0. The objective of this is to show you that the problem you are experiencing is not unique to polychoric and tetrachoric.

Now let's look at tetrachoric.

Code:

. tetrachoric comp net z
(obs=217)

matrix with tetrachoric correlations is not positive semidefinite;
  it has 1 negative eigenvalue
  maxdiff(corr,adj-corr) =  0.0661
  (adj-corr: tetrachoric correlations adjusted to be positive semidefinite)

             |     comp      net        z
-------------+---------------------------
        comp |   1.0000 
         net |  -1.0000   1.0000 
           z |   0.2235   0.1723   1.0000 

. matrix r = r(Rho)

. matrix symeigen e v = r

. matrix list v

v[1,3]
            e1          e2          e3
r1   2.0013614   1.0716784  -.07303978

. pcamat r, n(217)
r not positive (semi)definite

Same sort of results you got - tetrachoric tells us the correlation matrix is not positive semidefinite, that adjusting it to be positive seimidefinite would result in changing correlations by no more than 0.0661, and shows us the correlation matrix, in which the correlation between net and comp is shown as -1.0. For later reference, I capture and display the three eigenvalues, and we indeed see that the smallest is negative. And then pcmat declines to perform.

Now let's try tetrachoric with the posdef option to actually do the adjusting referred to in the previous output.

Code:

. tetrachoric comp net z, posdef
(obs=217)

matrix with tetrachoric correlations is not positive semidefinite;
  it has 1 negative eigenvalue
  maxdiff(corr,adj-corr) =  0.0661
  (adj-corr: tetrachoric correlations adjusted to be positive semidefinite)

    adj-corr |     comp      net        z
-------------+---------------------------
        comp |   1.0000 
         net |  -0.9339   1.0000 
           z |   0.2068   0.1567   1.0000 

. matrix r = r(Rho)

. matrix symeigen e v = r

. matrix list v

v[1,3]
            e1          e2          e3
r1   1.9352714   1.0647286  -6.661e-16

. pcamat r, n(217)

Principal components/correlation                 Number of obs    =        217
                                                 Number of comp.  =          2
                                                 Trace            =          3
    Rotation: (unrotated = principal)            Rho              =     1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      1.93527      .870543             0.6451       0.6451
           Comp2 |      1.06473      1.06473             0.3549       1.0000
           Comp3 |            0            .             0.0000       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    ------------------------------------------------
        Variable |    Comp1     Comp2 | Unexplained 
    -------------+--------------------+-------------
            comp |   0.7104    0.1483 |           0 
             net |  -0.7027    0.2040 |           0 
               z |   0.0393    0.9677 |           0 
    ------------------------------------------------

We note that as advertised, when we compare the adjusted correlations to the earlier unadjusted correlations, the largest adjustment was to the correlation between comp and net, which changed from -1.0 to -0.9339. We see that the third eigenvalue has been set to (effectively) zero, the first two are somewhat different, and that pcamat uses the adjusted correlation matrix to produce two principal components that account for 100% of the variance.

Not reported here, i also tried tetrachoric with the zeroadjust option. It succeeded in producing a positive semidefinite correlation matrix and pcmat produced three principle components. I don't encourage this approach: fiddling with the data is a somewhat older approach with little theoretical justification; and I like the fact that tetrachoric, posdef reduces the number of principle components, which feels a lot like logit dropping the variable.

Finally, for completeness, polychoric.

Code:

. polychoric comp net z
could not calculate numerical derivatives
missing values encountered
could not calculate numerical derivatives
missing values encountered

Polychoric correlation matrix

           comp        net          z
comp          1
 net          .          1
   z  .22407535  .17295066          1

. matrix r = r(R)

. matrix list r

symmetric r[3,3]
           comp        net          z
comp          1
 net          .          1
   z  .22407535  .17295066          1

. pcamat r, n(217)
matrix r has missing values

I note that polychoric reports a missing correlation between net and comp, rather than -1.0 as in tetrachoric. It provides no options for accommodating the problem data, and pcmat is of course unable to extract principle components. My own feeling on this is that tetrachoric reporting -1.0 correlation is appropriate - after all, whenever one of the variables is 1 the other is zero. Perhaps an expert would feel otherwise.

Announcement

Comment

Comment

Comment

Comment

Comment