polychoric and tetrachoric commands (factor analysis on binary variables)

Malene Christensen

Join Date: Sep 2016

Posts: 14
#1

polychoric and tetrachoric commands (factor analysis on binary variables)

05 Jul 2017, 08:12

Hi Statalisters!

I wish to check correlations between a range of binary variables and make a factor analysis on this basis to see whether the variables are in fact measuring underlying dimensions (which is theoretically sound). As far as I understand I should use tetrachoric coefficients and make the principal component analysis on this basis? However, I have read that the tetrachoric varlist command in Stata is imprecise (http://john-uebersax.com/stat/tetra.htm#tsoft). When i compare to regular Pearson's R correlations my results vary substantially! Some suggest using the polychoric command by Stan Kolenikov, which should be able to provide: "routines to estimate the polychoric, tetrachoric, polyserial and biserial correlations and use them in principal component analysis." How do I use this command for tetrachoric coefficients? And does anyone have any references on the discussion of use and merits of this method and interpretation of its results?

I hope you can help!

Last edited by Malene Christensen; 05 Jul 2017, 08:21.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3847
#2

05 Jul 2017, 08:49

I can probably not help much here, but I would in general tend to be a bit more reluctant to take something claimed on one person's homepage - which is neither reviewed nor leaves any space for critical comments - too seriously. This is especially true for claims that are made on specific values without providing any details about the data that are supposed to have generated the result. As you can read in the help file and documentation of the tetrachoric command it does not use the Edwards and Edwards estimator as John Uebersax falsely claims. Long story short, I think it is fairly save to use the command.

Obtaining different results than those you get with Pearson's correlation should not be surprising. After all, that is the point of using tetrachoric correlations, right?

As for polychoric (from Stats Kolenikov's site), it comes with a help file that explains that in the case of all binary variables the tetrachoric correlation is estimated. So there is nothing special to do as long as the variables are coded 0 and 1.

If I am not mistaken, results from subsequent factor analysis are interpreted the usual way.

Best
Daniel
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

05 Jul 2017, 10:20

Well, Uebersax may have some standing since a close reading of the documentation for Stata's tetrachoric command in the Stata Base Reference Manual PDF (as of version 14) finds Uebersax(2000) as a justification for factor analysis of dichotomous variables using the tetrachoric correlation coefficient (see Example 2). However, perhaps his online comment reflects outdated information on Stata. As Daniel accurately points out, the online comment lacks any information about, for example, the version of Stata used, or anything else allowing the reader to understand the basis for his assertion.

Looking at the Methods and Formulas section of the documentation, we see the following, which suggests that StataCorp is well aware of the limitations of the Edwards and Edwards noniterative estimator and does not make naive use of it.

tetrachoric provides two estimators for the tetrachoric correlation ρ of two binary variables with the frequencies nij , i, j = 0, 1. tetrachoric defaults to the slower (iterative) maximum likelihood estimator obtained from bivariate probit without explanatory variables (see [R] biprobit) by using the Edwards and Edwards noniterative estimator as the initial value. A fast (noniterative) estimator is also available by specifying the edwards option (Edwards and Edwards 1984; Digby 1983)
...
The Edwards and Edwards estimator is fast, but may be inaccurate if the margins are very skewed.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#4

05 Jul 2017, 12:33

I did not want to imply that statistical statements by John Uebersax are in anyway unreliable. His standing is probably absolutely justified and well earned. Yet this does not imply taking anything he states for granted, either. At least not when stated without further explanation or references in a way not open to criticism. His statement in question is marked as updated in 2015, yet Stata has implemented the current default at least since release 9 in 2006, perhaps even before that. Anyway, all I basically wanted to state is that the proper place to look how a statistical software works is the documentation of this software. If there is still confusion, in the case of Stata, this forum is obviously the right place to ask and discuss further.

Best
Daniel
Comment
Malene Christensen

Join Date: Sep 2016

Posts: 14
#5

06 Jul 2017, 02:20

Thank you for your replies! I have now run the tetrachoric and polychoric command, and these yield very similar results, so maybe the issue mentioned by Uebersax is not so prevalent - at least in this case. These issues are exactly why I am looking for peer reviewed book chapters or articles that can shed light on the advantages and disadvantages of the mentioned methods and especially on basing a factor analysis on the results. Unfortunatly this seems hard to find.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

06 Jul 2017, 05:36

Let me add that while as Daniel notes Uebersax's web page http://john-uebersax.com/stat/tetra.htm is marked as updated in 2015, we have no idea when he last reviewed his undated assertion on Stata.

I see no reason to assume any reliability when applied to Stata's current techniques, because of lack of supporting information for the assertion he makes and for lack of Stata context in which the assertion is made.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

06 Jul 2017, 07:28

More history of tetrachoric technique in Stata.

Per help whatsnew8to9 the tetrachoric command was introduced in Stata release 9 in April 2005.

Per help whatsnew9 the 11 July 2006 update to Stata release 9.2 updated the tetrachoric command to what appears to be the current methodology, acknowledging the shortcomings in the earlier methodology.

Code:

2. tetrachoric's default algorithm for computing tetrachoric correlations has been changed from the Edwards and Edwards estimator to a maximum likelihood estimator. The Edwards and Edwards estimator performed poorly with skewed data. tetrachoric's features are now similar to those of spearman and ktau. Standard errors and two-sided significance tests are now included. A frequency adjustment when one cell has a zero count is now available with the zeroadjust option. This change has been made without version control; to get the old behavior, the edwards option can be used.

Searching found Uebersax's original Compuserve page on Tetrachoric and Polychoric Correlation (http://ourworld.compuserve.com:80/ho...rsax/tetra.htm) in the Internet Archive (his current site john-uebersax.com dates from the shutdown of Compuserve in 2009). The paragraph about Stata appears first, exactly as it now appears, sometime between June 15 and July 6, 2006. The purpose of that paragraph appears to be to announce Stas Kolenikov's polychoric command for Stata, which per Kolenikov, S., and Angeles, G. (2004) was developed on Stata 8. (Note that the link to Kolenikov's work is outdated.)

I think it's safe to assume that Uebersax was reporting Stata results he learned of elsewhere, rather than experienced himself, and has not stayed up-to-date on Stata's methodology.

I would give no credence to the 11-year-old assertion about Stata's tetrachoric methodology insofar as it relates to any version of Stata from release 9.2 onward.
1 like
Comment
Malene Christensen

Join Date: Sep 2016

Posts: 14
#8

06 Jul 2017, 07:42

Thank you for the reply. I have now encountered yet another, related problem. I hope you can help. I have tried this polychoricpca varlist, which gives the tetrachoric correlation coefficient but will not peform the Principal Component Analysis with the following error: "could not calculate numerical derivatives, missing values encountered" and this
tetrachoric varlist
matrix r = r(R)
factormat r, n(186)
Which also gives the correlation coefficient but will not peform the PCA with the following message: "matrix r has missing values". I do not understand this. There are NO missing values in my dataset, and I have deleted observations that have 0 across all the variables in varlist. Can anyone shed light on this?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#9

06 Jul 2017, 08:30

You need to look at the documentation more carefully. tetrachoric saves the correlation matrix in r(Rho) (note capitalization) not r(R).

I am no expert on this, but

I have deleted observations that have 0 across all the variables in varlist

strikes me as a dubious approach. Why would you do this?

Best
Daniel
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

06 Jul 2017, 09:13

I think we need to back up a bit. You are telling us only selected parts of the story, the parts you think are important. You need to tell us the whole story, and let us figure out what's important.

Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

Section 12.1 is particularly pertinent

12.1 What to say about your commands and your problem

Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!

We need you to copy from the Stata Results window everything from the tetrachoric or polychoric command through the error message and paste it into your Statalist post using CODE delimiters to ensure readability. That way we can see everything Stata told you, which may provide clues you have overlooked in your focus on the error message. I will add that I agree with Daniel and would like to understand why you chose to delete observations.

For an example of using CODE delimiters, the following:

[code]
. sysuse auto, clear
(1978 Automobile Data)

. describe make price

storage display value
variable name type format label variable label
-----------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
[/code]

will be presented in the post as the following:

Code:

. sysuse auto, clear (1978 Automobile Data) . describe make price storage display value variable name type format label variable label ----------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price
Comment
Malene Christensen

Join Date: Sep 2016

Posts: 14
#11

07 Jul 2017, 04:56

Okay, sorry. Here are some explanations and specifications.

First, I deleted some observations, as I understand that the eigenvalues of the Principal Component Analysis cannot be calculated if there are some observations in the matrix that has the value zero across all variables(?). My dataset contains 217 organization across 16 European countries who work with responsible research practices in different ways indicated by 38 binary variables. This is for instance presence (1) or absence (0) of a online science communication platform etc. My dataset is, so to speak, a quantitative "translation" of qualitative data from 16 country reports. Some of the sections in these reports were poor which means a high degree of missing information for some organization which is coded as 0 in my dataset. In this sense, it can be justified to remove the most problematic observations from the dataset which very likely are those who scrose 0 across all the variables. I intend to run the entire analysis with and without these problematic observations.

Second, my ultimate goal is to make a cluster analysis to see whether there are substantial differences in the focus across countries or specific types of organizations. Now some of the variables are moderately or strongly correlated which may be a problem to the cluster analysis, this is why I started with a factor analysis to see whether some of these variables are more or less measuring the same underlying concept. Since they are all binary I have used the polychoric command which gives me the following error message:

Code:

. polychoricpca requirement funding competition discussion event citizen_science c > ampaign platform training guiding rules policy network unit strategy standard co > operation res_area sup_res sup_ed could not calculate numerical derivatives missing values encountered could not calculate numerical derivatives missing values encountered

Now I do get the correlation matrix, but the further Principal Component analysis stops with this error message:

Code:

matrix symeigen: matrix has missing values r(504);

with tetrachoric command, which should do the same thing, I get the following:

Code:

. tetrachoric requirement funding competition discussion event citizen_science cam > paign platform training guiding rules policy network unit strategy standard coop > eration res_area sup_res sup_ed (obs=186) matrix with tetrachoric correlations is not positive semidefinite; it has 4 negative eigenvalues maxdiff(corr,adj-corr) = 0,3289 (adj-corr: tetrachoric correlations adjusted to be positive semidefinite)

Then the correlation matrix and

Code:

matrix r = r(Rho) . factormat r, n(186) r not positive (semi)definite r(506);

While I might simply be using the second command incorrectly (?), I should be getting some useful results with the first method - as said eailier there are no missing values across the specified variables.

Last edited by Malene Christensen; 07 Jul 2017, 04:59.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#12

07 Jul 2017, 05:16

I will just pick the first major problem that I see and hope others can give additional/better advice.

Some of the sections in these reports were poor which means a high degree of missing information for some organization which is coded as 0 in my dataset

I would stop right there and code those zeros as missing values, which they really are. Coding them zero you essentially claim that the corresponding research practice is not present. How do you know that if the information is just not available? Excluding cases with many zeros afterwards, assuming/hoping that these represent the ones with missing values is, in my view, the wrong approach.

You can force the matrix you get from tetrachoric to be positive semidefinite. Please do read the documentation closely. Whether you want this, we cannot tell. You would want to look at the negative values and see whether they are close enough to zero so a replacement could somehow be justified. However, I would not even start there before you get the correct coding of missing values done in the first step.

Best
Daniel

Last edited by daniel klein; 07 Jul 2017, 05:19.
Comment
Malene Christensen

Join Date: Sep 2016

Posts: 14
#13

07 Jul 2017, 06:32

Thank you for your reply. I realize that the procedure is quite problematic, unfortunately the initial data (hundreds of pages of reports) do not allow me to do it differently. This is hard to explain without getting in to details about the country reports which I believe is besides the point - trust me it will be thoroughly discussed in the methodology section, and the analysis will be peformed with and without the problematic organizations i.e. those that are poorly described in the reports.

Last edited by Malene Christensen; 07 Jul 2017, 06:39.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#14

07 Jul 2017, 06:57

I must admit, it sounds as simple as assigning 3 codes instead of 2. Instead of, 0 "not present" 1 "present" for each of your variables, just add a third code .a "no information". It is hard to see how any data source would prohibit such coding. Anyway, I trust you know what you are doing.

Best
Daniel
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#15

07 Jul 2017, 08:11

First, I deleted some observations, as I understand that the eigenvalues of the Principal Component Analysis cannot be calculated if there are some observations in the matrix that has the value zero across all variables(?)

Unless you can support that understanding with a reference, I'd suggest you do nothing based on your understanding, which may be based on a misunderstanding. Your "(?)" at the end suggests you are uncertain of your understanding.

I am sympathetic with your need to deal with missing values in a productive fashion. I have in the past resorted, in a regression setting, to coding yes/no/unknown as a categorical variable. And that's akin to what Daniel suggests in post #14, which would be a more productive approach than coding as Stata missing values, which would likely reduce your data to a pittance.

As your data is coded now, however, you cannot distinguish between a country whose report makes it clear that there is no online science communication platform (perhaps because it discusses a plan to create such a platform over the coming years) and a country whose report says nothing about an online science communication platform. That most certainly does represents a loss of information and a poor choice in coding. As a general rule, the coding should be in as much detail as possible, allowing the analyst rather than the coder to decide, at the time of analysis, how best to handle the codes. As it stands, your variable can only be interpreted as "known to have an online science communication platform" vs. "not known to have an online science communication platform" which weakens your analysis.

The requirement that the correlation matrix input to principle components analysis be positive semidefinite is a serious statistical concern, so the error message from factormat about the correlation matrix from tetrachoric is important.

It is likely that the correlation matrix in polychoricpca is similarly problematic. You could confirm this supposition by running polychoric rather than polychoricpca and then feeding the correlation matrix (which polychoric returns as r(R) rather than r(Rho)) into factormat as you did for the tetrachoric correlation matrix.

Code:

polychoric requirement ... sup_ed matrix r = r(R) factormat r, n(186)

Now, then, as Daniel said, you can specify an option to tetrachoric to have it adjust the correlation matrix to be positive semidefinite, as the documentation in help tetrachoric specifies. There does not seem to be such a capability in polychoric. That is a shame because the 0/1/2 coding scheme would require the use of polychoric, but we could not guarantee that it would be able to work.

I also note the existence of an option to tetrachoric that adjusts for cells with a zero count, but I am uncertain of the relevance of this to the improved methodology - it seems relevant only to the original version of the estimator.

Finally, please note that I had requested you "copy from the Stata Results window everything from the tetrachoric or polychoric command through the error message and paste it into your Statalist post using CODE delimiters to ensure readability" so that "we can see everything Stata told you, which may provide clues you have overlooked in your focus on the error message". That isn't quite what you did, however. I'm not suggesting you go back and redo the post, but I would very much have appreciated seeing all the output from the polychoric and tetrachoric commands.
Comment

Announcement

polychoric and tetrachoric commands (factor analysis on binary variables)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment