Help with Polychoric correlation command and multiple imputation

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#1

Help with Polychoric correlation command and multiple imputation

08 Jan 2016, 11:25

I have a survey dataset containing 70 or so Likert scale items. Some of these are, unsurprisingly, missing. My boss and I would like to run Stas Kolenikov's polychoric correlation command on the data, with multiple imputation if possible. It looks like this might not be possible, but I'd like to confirm. I'm aware that the polychoric command is NOT natively supported by mi estimate. This is what I did:

mi impute mvn $surveyitems = age i.race i.gender ... , add(20)
mi estimate: polychoric $surveyitems, pw

Multiple-imputation estimates Imputations = 20
Number of obs = 530
Average RVI = 0.0000
Largest FMI = 0.0000
DF: min = .
avg = .
DF adjustment: Large sample max = .
F( 0, .) = .
Within VCE type: Robust Prob > F = .

------------------------------------------------------------------------------
__000005 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .4450337 .0502578 8.86 0.000 .3465303 .5435371
------------------------------------------------------------------------------

. return list

scalars:
r(level) = 95

matrices:
r(table) : 9 x 1

Needless to say, I do get a full matrix (43x43 in this case) when I run polychoric without MI. Did I miss anything? Is there an alternative polychoric command which supports MI?

Thanks for any help you can provide.
Weiwen Ng, MPH

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Tags: None
wbuchanan

Join Date: Mar 2014

Posts: 1362
#2

08 Jan 2016, 14:58

If you could provide a bit more information about the overall goal people may be able to suggest alternative methods that would get you to an equally desirable result.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

08 Jan 2016, 17:55

Sure. Overall goal is to determine the pairwise correlations between the survey items, using imputation if possible. I've got 70 items on a 4-point Likert scale. I've already calculated Spearman's correlation and the polychoric correlations, both without imputation, both using pairwise deletion for missing responses (i.e. not listwise deletion).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

09 Jan 2016, 06:26

I would assume you might be able to get something relatively close using -gsem- using the ordinal family and logit link functions. It won't be exact, but you should be able to set things up to freely estimate the covariances between the variables and can get correlations by standardizing them. I don't remember off the top of my head if -gsem- uses full information methods, but it should provide you with something.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#5

11 Jan 2016, 16:59

You can get the polychoric correlation coefficient using gsem. Rather than the logit link function, use the probit link function. Syntax is

Code:

gsem (x1@1 x2@1 <- F), oprobit nlcom rho:_b[var(F):_cons] / (1 + _b[var(F):_cons])

I've attached two do-files and their log files. The first pair (PCCfxpc.do and .smcl) shows that the values obtained from the two official Stata commands above give the same polychoric correlation coefficient (and standard error) as John Uebersax's xpc.exe across a range of correlations.

The second pair (PCMissing.do and .smcl) show that gsem uses all available data—if a value is available for one of the two variables, then it is used, just as the user's manual entry says.

So, Weiwen can use these commands to get polychoric correlation coefficients without pairwise deletion. With 70 variables, though, Weiwen will probably need to adjust the matrix of pairwise polychoric correlation coefficients because it is likely to be nonpositive definite.
Attached Files

PCCfxpc.do (1.1 KB, 1 view)

PCCfxpc.smcl (4.1 KB, 1 view)

PCMissing.do (973 Bytes, 1 view)

PCMissing.smcl (16.3 KB, 1 view)
2 likes
Comment
Sarah Williamson

Join Date: Oct 2016

Posts: 1
#6

12 Oct 2016, 16:27

I am looking to create a polychoric correlation matrix in order to set up a factor analysis using dichotomous and continuous variables in a non-probability weighted dataset. I have missing data of about 20% in one of the variables. Is anyone aware of syntax that could be used to complete multiple imputation and then create the polychoric matrix?

Thank you. Any help is much appreciated.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

25 Nov 2016, 18:54

Originally posted by Sarah Williamson View Post

I am looking to create a polychoric correlation matrix in order to set up a factor analysis using dichotomous and continuous variables in a non-probability weighted dataset. I have missing data of about 20% in one of the variables. Is anyone aware of syntax that could be used to complete multiple imputation and then create the polychoric matrix?

Thank you. Any help is much appreciated.

I'm picking up this topic after a bit of inattention.

Sarah, in general, you can do multiple imputation this way:

global imputed_vars q1 q2 q3 .... (every question with missing values goes here; if you have explanatory variables with missing info, they can go here also)
global complete_vars (all variables with NO missing go here)

mi set flong
mi register imputed $imputed_vars
mi register regular $complete_vars
mi impute chained (ologit) $all_imputed_vars = $complete_vars, add(20)

It's what comes after that that's less clear to me. I assume Sarah's goal is to do some sort of factor analysis, and to handle the missing data in a principled way. One answer I found is below, and the person giving that answer said to either average the polychoric correlation matrices from each imputed dataset, or to do some transformation of those correlations, and then average them:

http://stackoverflow.com/questions/2...mplex-sample-d

Now, it's not clear to me why you couldn't just do:

polychoric q1 q2 q3 ...
matrix x = r(R)
factormat x, n(whatever the N from Polychoric is), factors(m)

Which would basically calculate a polychoric correlation matrix using every imputed observation - sort of like averaging the correlations - which you then pass to the factor analysis.

In general, I have been searching for a principled way to handle missing data in factor analysis. I have not found an answer that is both satisfactory to me and that I can understand.

For the record, I may have not answered W Buchanan's question adequately. The overall goal of the project was to get an exploratory factor analysis of a 70-item survey measuring experience of care (a bit like satisfaction) for people who visited a nursing facility for rehab before going home. We had a lot of missing data. Each respondent left missing an average of 3 questions, but only 1/3 of our cases were complete. I could factor analyze the complete cases, but I'm sure we all can see this is problematic.

I had been hoping to either do MI then factor analysis, or to do MI, then calculate a polychoric correlation matrix, then pass that to factormat (i.e. factor analysis on a variance/covariance matrix or correlation matrix).

My professor ended up doing the analysis in MPlus, which is the go-to software for this purpose. Mplus supports full information maximum likelihood estimation for factor analysis, which constructs a likelihood function for all observations including those with missing data. The thing is, from what I am reading, any sort of maximum likelihood estimation doesn't appear to be ideal for ordinal data (which is what we have). In fact, there seem to be some people who say that you should not run ordinal or binary questions through factor analysis, and that you want continuous data. Moreover, as I'm going through the Mplus output, I'm not sure if my professor used FIML or WLSMV (which stands for weighted least squares somethingorother). WMSLV is said to support ordinal response variables better than FIML, but it essentially uses the equivalent of pairwise deletion (i.e. it estimates the responses based on any variables that are pairwise complete, which is a considerable step up from listwise deletion and may be good enough).

That was part of the motivation to do the polychoric correlation matrix on imputed data, as polychoric correlation assumes that the underlying trait is multivariate normal and calculates its correlation based on that assumption. Except it's not clear to me that there is an accepted method to do this. The explanation of what to do and how to do it is a) difficult to implement for most Stata programmers and b) unclear as to the theoretical justification.

You could say, well, you should have designed your questionnaire to produce fewer missing cases in the first place. I agree. But I am stuck with the questionnaire I have. We are trying to measure a complex, multidimensional phenomena. Some missing information is inevitable, and I think the field would benefit immensely from some clear guidance on how to handle it.

PS Joseph, thanks for the input, but to be honest, if the matrix of pairwise correlations is likely to not be positive definite, I have no clue what that means, and I have no clue how to correct that. Moreover, if your GSEM approach is essentially doing pairwise estimation, then why not just use polychoric ... , pw ? I tried this on my data and I was able to extract roughly the same factors my professor did, except that again, it's not quite the same as imputation.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#8

25 Nov 2016, 19:24

At the risk of sounding nihilistic, I am skeptical of the entire project. Without judging anyone for trying to administer a 70-item questionnaire to people leaving a rehab nursing facility, do bear in mind that even if you could get a FIML estimator for factor analysis or induce MI to support it, both of those rely on the assumption of missingness at random. This context sounds to me like one where you can't say missing at random with a straight face. People leaving a rehab nursing facility will have varying degrees of physical and mental stamina. Those at the short end of that scale will likely skip more questions, or just stop the survey early. Now if the questionnaire is about something that would have nothing to do with their health, nor with their experience at the facility (say, a questionnaire asking their opinion about deregulation of the electricity markets), I suppose you could argue that missingness at random is plausible. But if, as I suspect, the items pertain to health or the facility experience, it seems very likely that there will be strong associations between the values of the missing responses and missingness itself. Unless you can break those associations by conditioning on other observables, missingness at random sounds like a huge stretch, at best.

It's important to remember that MI and FIML are not magic; they are not oracles. They substitute a model built on a highly restrictive assumption for the missing data: they do not create information ex nihilo. I realize that these techniques are all the rage these days, but when used inappropriately they create only an illusion. Unless you can provide a persuasive case that missingness at random is a credible assumption in your context, I don't think that using these techniques makes your analysis any more credible than just analyzing the complete cases.

Last edited by Clyde Schechter; 25 Nov 2016, 19:26.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#9

25 Nov 2016, 21:40

Although everything that Clyde says is in general true and serves as a reminder that you should not apply these methods mechanically, I would not be as pessimistic in this (i.e. Weiwen Ng's) case. Think about it, you have 70 items presumably measuring a few related latent traits (which is what you would like to get from the factor analysis). If there are on average 3 missing answers, I would argue that you could probably get pretty close to missing at random by conditioning on the other 67 (presumably closely related item answers).

Best
Daniel
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#10

25 Nov 2016, 23:15

Daniel makes a good point.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#11

26 Nov 2016, 11:32

Originally posted by Clyde Schechter View Post

It's important to remember that MI and FIML are not magic; they are not oracles. They substitute a model built on a highly restrictive assumption for the missing data: they do not create information ex nihilo. I realize that these techniques are all the rage these days, but when used inappropriately they create only an illusion. Unless you can provide a persuasive case that missingness at random is a credible assumption in your context, I don't think that using these techniques makes your analysis any more credible than just analyzing the complete cases.

Some context. The questionnaire was mailed out to respondents a while after they left. We got a response rate of about 50%, which I understand is actually quite good for a mailed survey. Some people did receive a telephone follow up if they didn't mail the survey back by a certain time. There are actually 2 versions of the survey, a 70-question one, and a 44-question one. The 44-question survey is nested in the longer survey.

In response to Daniel, while there are 3 missing answers on average, my problem is that only about a third of the cases are complete. I am not sure I'm willing to lose that much information to do an EFA.

In response to Clyde:

On the one hand, I'm skeptical of the project as well. It's not just the high rates of missing items, it's the low survey response rate. It's even worse for many other mailed surveys, like the Consumer Assessment of Health Plans (and providers) in my context.

On the other hand, I would also be unwilling to just tell them to go redo the entire survey without any indication of whether or not the items load on the factors they designed the survey for.

Moreover, if you say that a 50% response rate is too low, then you surely also know that mailed survey response rates in general are declining, and people are also less and less willing to answer their landlines, and that the same is maybe even more true for cellphones plus it's harder to get a cellphone sampling frame. The state could conduct face to face exit interviews before they are discharged, I guess, but this would make for very challenging logistics. Whatever the problems with bias due to item and person non-response right now, they will simply continue to get worse and worse over time. We can move to online surveys, eventually, and thus worsen any sort of response bias.

Now, perhaps we should just skip regular measurement of facility performance entirely, if our methods for getting the information are so terrible. We can already get non-person centered measures of performance from administrative data. For example, I could calculate a facility's rate of discharge to the community, their rate of readmission to hospital, their rates of immunizations, complications, infections, etc. However, none of those measures actually get at the patient's experience of how good their healthcare (in this case, the quality of rehab care) was. This is problematic. Then again, if people aren't willing to respond to surveys, maybe it's not our problem.

So ... where does that leave me? I know what MAR is, and I know that the conclusions drawn from MI or FIML are dependent on the MAR assumption being not too badly off. Most people aren't missing too much information, at least; on average, I believe I have about 6% missing information, i.e. a total of 6% out of the (44 short form questions * number of total respondents) is missing. I'm willing to make some preliminary conclusions based on that, plus we measured a number of variables that we know are predictive of nonresponse and that could also be predictive of experience of care (and these are administrative data variables, so much less missing information). If anybody can present me a reasonable alternative, then I'm willing to hear it. When I get my PhD, I promise to try to push the survey firm to design the questionnaire better, at least. If all that anybody can tell me is that FIML doesn't create information out of nothing, then thanks, but I knew that before.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#12

26 Nov 2016, 14:57

Weiwen, I'm sorry if my earlier post was condescending in tone. That wasn't my intent. I work with a lot of collaborators with varying degrees of understanding of statistical principles. Nowadays it seems like everybody is trying to use multiple imputation with missing data, even in contexts where the MAR assumption has no credibility at all. And I have met several researchers who really did think that MI creates information out of nothing. So it has become one of my pet peeves, particularly because it seems to me that some journals are mindlessly insisting on its use with all missing data. It struck me that this might be one of those situations, and so I went off on my rant against it. My apologies for not considering that you might actually have given the matter some thought. On top of all that, as Daniel Klein points out, I missed the fact that with so many items and so few missing on average, if the items do reasonably reflect a small number of underlying latent traits, conditioning on the other items would in fact produce reasonable missingness at random, so I was wrong about the substantive point to boot.

Even though there are a large number of observations that have missing data, if the number of missing items per observation is, as you state, small, the MAR assumption would, indeed, be reasonable when conditioning on the remaining items. So I think you are OK. Although there are many slghtly incomplete cases, that shouldn't be a problem. At worst you may have to use a much larger number of imputations than the 20 you have generated so far. Even though -mi estimate- doesn't support factor analysis, you could get the equivalent using -gsem- with a probit link. Here's a toy example:

Code:

clear webuse mhouses1993s30.dta mi estimate, cmdok: gsem (F-> ne custom corner, probit)

I certainly understand the difficulties of getting good survey response rates. And nowadays 50% is quite good. And 6% item non-response is also excellent. So no criticisms there--I didn't elaborate on that in my response, but I did say that my comments were "without judging" the use of a very long survey.

I hope this response is helpful.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#13

27 Nov 2016, 02:04

Weiwen Ng if there is already a "short form" that would suggest that the underlying structure of the data is known (e.g., should be able to constrain a lot of paths instead of freely estimating all of the paths). Were the additional 26 items added to measure other constructs, or in an attempt to measure existing constructs with greater precision? Ideally, the designer should be using IRT to maximize the information functions while minimizing the number of items needed to estimate the parameters reliably.

Another consideration to make with regards to imputing the data is whether or not your imputation model adequately models the missing data generating process. So were there any items that seemed to have more or less missingness? Were the items all clustered towards the end of the form (if so you could fit a 4PL IRT model making the assumption that there was a test fatigue effect that imposed a ceiling on the upper asymptote of the response probability)? Has anyone checked the item stems to make sure the wording didn't cause confusion or have any prima facia issues with bias? There are a lot of reasons that people may not have responded to things and getting some understanding of why would help to fit an imputation model that better models that missingness process.

WLS is the equivalent of using the asymptotically distribution free estimator in Stata; can't remember off the top of my head whether or not there is an option available to specify anything about missing values. WLS is, however, the better estimator to retrieve the parameters when working with nominal/ordinal scale data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#14

27 Nov 2016, 10:11

Weiwen Ng I have another thought about doing a factor analysis following MI in your data. The advice I gave in #12 would get you a confirmatory factor analysis, but you want an exploratory one. I haven't tried this out because I don't have a suitable data set to experiment with, but I think you could use David Rood's -cmp- command (-ssc install cmp-) to calculate the polychoric correlation matrix for your data under -mi estimate, cmdok:-. The result would not be in a conventional correlation matrix but would be found in e(). It shouldn't take more than a few matrix commands to convert that into a polychoric correlation matrix which you could then feed to -factormat-. This is similar to your original approach, assuming that -cmp- works nicely with -mi estimate, cmdok:-.

I don't know about the theoretical underpinnings of an exploratory factor analysis obtained in this way.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#15

02 Dec 2016, 09:08

Originally posted by wbuchanan View Post

Weiwen Ng if there is already a "short form" that would suggest that the underlying structure of the data is known (e.g., should be able to constrain a lot of paths instead of freely estimating all of the paths). Were the additional 26 items added to measure other constructs, or in an attempt to measure existing constructs with greater precision? Ideally, the designer should be using IRT to maximize the information functions while minimizing the number of items needed to estimate the parameters reliably.

Another consideration to make with regards to imputing the data is whether or not your imputation model adequately models the missing data generating process. So were there any items that seemed to have more or less missingness? Were the items all clustered towards the end of the form (if so you could fit a 4PL IRT model making the assumption that there was a test fatigue effect that imposed a ceiling on the upper asymptote of the response probability)? Has anyone checked the item stems to make sure the wording didn't cause confusion or have any prima facia issues with bias? There are a lot of reasons that people may not have responded to things and getting some understanding of why would help to fit an imputation model that better models that missingness process.

WLS is the equivalent of using the asymptotically distribution free estimator in Stata; can't remember off the top of my head whether or not there is an option available to specify anything about missing values. WLS is, however, the better estimator to retrieve the parameters when working with nominal/ordinal scale data.

W, thanks for the reply.

Substantive issues first:

The additional 26 items were added to measure same 8 constructs measured in the short form.

In healthcare research, the impression I am getting is that a lot of researchers are not that technically savvy about the statistical methdology, and that they don't always think through the theoretical definition of the construct they're trying to measure. That's the case here. The survey firm was based in a different state, so I'm not in deep conversation with them. I think they used focus groups to derive the 8 constructs, but I don't know that they did a lot of theoretical work on why these are constructs that should matter. I very much doubt they used IRT. We do seem to operate more in a classical test theory framework. Actually, I'm not sure if we even do that. In any case, at least in my program, we don't appear to have a lot of measurement researchers. They are probably all in education or psychology.

There were indeed some items that had higher missingness levels, but they did not appear to be towards the end of the form.

I agree that we should have conducted cognitive interviews with actual patients in a pilot test. I doubt that anyone did this. In looking at the survey form with a different professor, he said that to his eyes (he has a PhD in survey research), the form, the layout of some of the questions, and the wording of some other questions did look like they could cause confusion.

Technical issues:
Stata's SEM intro, chapter 4, states that the SEM command will use listwise delete unless you specify MLMV (which I think is equivalent to FIML) as the estimation method. That said, MLMV appears to delete if all the variables are missing. GSEM uses equationwise delete - for each individual equation estimated, it will delete if there are any missing values in that equation alone.

http://www.stata.com/manuals13/semintro4.pdf#semintro4

Hence, it seems like SEM with ADF estimation would not be the best option for the data as they currently exist. It seems like some sort of SEM on the MI dataset might not be a crazy option (either SEM + ADF, or GSEM with ordinal probit or logit).

Meta issues:
This discussion has helped me a lot, actually. Craig, I clearly sounded a bit frustrated - I do apologize. It's hard to judge tone and intent in text over the internet. I've certainly learned a few things to look out for the next time someone purports to have some sort of instrument on which they've done EFA/CFA.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Help with Polychoric correlation command and multiple imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment