Factor analysis with data missing completely at random (MCAR)

Sam Richardson

Join Date: Feb 2018

Posts: 1
#1

Factor analysis with data missing completely at random (MCAR)

16 Feb 2018, 15:03

I conducted a survey in which each respondent answered 10 questions randomly chosen from 20 possibilities. I would like to do a factor analysis of all 20 questions, but every observation has 10 values missing completely at random. Any help on how to approach this would be much appreciated.
Tags: factor analysis, MCAR, missing, survey
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

16 Feb 2018, 15:27

Check out

https://stats.idre.ucla.edu/stata/fa...data-in-stata/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

16 Feb 2018, 15:28

Well, I hope somebody else can come up with a better solution than I can, because this is really just putting a bandaid on the foot in which you have shot yourself with this data-gathering design.

1. Build up a "covariance matrix" for the 20 items by looping over all pairs of questions and calculate the covariance (-corr q1 q2, cov-) for the pair, grabbing the result returned in r(rho) in the appropriate cell of the matrix.

2. Attempt to do your factor analysis on this "covariance matrix" using the -factormat- command.

3. If #2 fails because the matrix is not positive semidefinite, use the -forcepsd- option.

This is not a bona fide factor analysis, because the "covariance matrix" is not a real covariance matrix. (And if you end up needing -forcepsd- then you will be doing your factor analysis on a corrupted version of an ersatz covariance matrix.) But if you are only planning to rely on the results for heuristic purposes, it might be usable.

I hope somebody knows of something better than this, because this is not very good. But it's all I can think of.

(Well, not quite: I thought about trying to do confirmatory factor analysis on the data using -sem, method(mlmv)-, but I tried a few toy examples and could not get any of the estimations to run.)

Added: Crossed with #2. That looks a lot better than what I proposed.

Last edited by Clyde Schechter; 16 Feb 2018, 15:30.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

16 Feb 2018, 15:47

Actually, my initial inclination was very similar to Clyde's, except for step 1 I would use the pwcorr command. The pairwise correlation matrix is returned in r(C). Since data are missing completely at random by design I would think this would be more or less ok. I suppose the much more complicated UCLA procedure is better but I wonder how much better it is in this case.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

17 Feb 2018, 10:09

Originally posted by Clyde Schechter View Post

Well, I hope somebody else can come up with a better solution than I can, because this is really just putting a bandaid on the foot in which you have shot yourself with this data-gathering design.

1. Build up a "covariance matrix" for the 20 items by looping over all pairs of questions and calculate the covariance (-corr q1 q2, cov-) for the pair, grabbing the result returned in r(rho) in the appropriate cell of the matrix.

2. Attempt to do your factor analysis on this "covariance matrix" using the -factormat- command.

3. If #2 fails because the matrix is not positive semidefinite, use the -forcepsd- option.

This is not a bona fide factor analysis, because the "covariance matrix" is not a real covariance matrix. (And if you end up needing -forcepsd- then you will be doing your factor analysis on a corrupted version of an ersatz covariance matrix.) But if you are only planning to rely on the results for heuristic purposes, it might be usable.

I hope somebody knows of something better than this, because this is not very good. But it's all I can think of.

(Well, not quite: I thought about trying to do confirmatory factor analysis on the data using -sem, method(mlmv)-, but I tried a few toy examples and could not get any of the estimations to run.)

Added: Crossed with #2. That looks a lot better than what I proposed.

#2 relies on assuming that the items are distributed multivariate normally to impute the missing data (if I understood correctly). In my experience, this assumption may look questionable in some real life applications - in one case, I was using satisfaction data, which were very left skewed (i.e. many people respond at the max value of each question), and this is common in many analyses of similar data. I'm not sure how far off from the truth a multivariate normality-based imputation method will be. I think that I would go for Richard's suggestion to calculate the pairwise correlation matrix instead.

A side note: my understanding is that MPlus allows exploratory factor analysis with multiple imputation, or with full information maximum likelihood estimation (which I think is equivalent to Stata's SEM option for max likelihood with missing values).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment

Announcement

Factor analysis with data missing completely at random (MCAR)

Comment

Comment

Comment

Comment