Generating a socioeconomic score using PCA (tetrachoric)

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#1

Generating a socioeconomic score using PCA (tetrachoric)

27 May 2016, 14:50

Hello Statalisters,

I am trying to create a final socioeconomic (SE) measure (binary) out of multiple, binary, socioeconomic indicators (occupation, participant education education, crowding in house, presence or absence of window, drinking water, material of wall) etc. I guess PCA is the way to go rather than factor analysis as I am trying to summarize these variables into a single SE measure. Am I correct? I learn that since my variables are binary (and I have predetermined and fixed which indicators to use from descriptive analysis), I cannot do this straight forward but have to first output a polychoric (tetrachoric correlation matrix to be precise). The steps that I need to undertake include; 1) Getting the tetrachoric correlation matrix, 2) using this matrix to get the components, 3) rotating, 4) deciding how many components to use, 5) getting the score for the component(s) using predict, 6) dichotomizing the predicted score to get the final SE binary measure ( I will be using this binary measure for other analysis). Please correct me if there is anything wrong in these steps.

Getting into the analysis, I am able to perform a straightforward pca in Stata 13, but I am totally confused (-polychoric-, -polychoricpca-, -tetrachoric-, pcamat) as to how to do this after creating the matrix. Example code of what I tried with -tetrachoric- command is

Code:

tetrachoric Occup crowd water window wall edu matrix C=r(corr) pcamat C, n(102) // 102 observations in the sample data set rotate, varimax predict pc1 pc2

1) Is this the way to go?
2) I used varimax here but I have also seen quartimin and promax rotations being used for creating final SE scores. How can I decide which one to use in my case?
3) How will the whole scenario differ if one or two indicators are ordinal, categorical variables?

Given below is an example data set produced by -dataex-.

Thankyou

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Occup edu crowd) byte(wall window water) 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 end label values edu Edu label def Edu 0 "high", modify label def Edu 1 "low", modify
Tags: None

1 like
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#2

28 May 2016, 08:46

If i use the -predict- command as given in the code in #1, does it predict scores using maximum likelihood?
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#3

29 May 2016, 15:03

Hello Statalisters,

I am trying to predict a score from a CFA on binary variables using tetrachoric correlation. I see it has been advised on the forum not to use predicted scores in subsequent regression models. Rather go ahead with SEM. But my goal is to use my score in an inverse probability weighted regression model following counterfactual theory. Hence the requirement. Giving below the code for what I have done so far.

//CFI on binary variables

Code:

tetrachoric crowd wall clock water , pos clear ssd init crowd wall clock water ssd set obs 200 ssd set cor 1.0000\ /// 0.4791 1.0000 \ /// 0.3843 0.6650 1.0000 \ /// 0.3618 0.6894 0.4505 1.0000 sem (F1-> crowd wall clock water) estat gof, stat(all) predict F1, latent

This gave me an error saying "predict not possible with summary statistics data". I checked -sem- -predict- and indeed it has been given there that predict may not be used with summary statistics. So how can I get a predicted score from this CFI? or will the predicted score from the folliwing code of EFA serve me the same purpose?

Code:

tetrachoric crowd wall clock water matrix Rho = r(Rho) sca nobs=r(N) factormat Rho, n(`=nobs') pcf fapara, pca reps(2000) /* The parallel analysis shows that I should extract only 1 factor. */ factormat Rho, n(`=nobs') ipf factors(1) blank (0.5) rotate, promax sortl predict F1

Any help will be appreciated.
Thanks
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#4

29 May 2016, 20:36

I'm not sure how what you're trying to do will lead to an inverse probability-weighted regression model, but why not just

Code:

gsem (crowd wall clock water <- F, probit) predict double F, latent
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#5

30 May 2016, 17:43

Hello Joseph,
Thank you very much for your response. But I tried the following code and I am still getting an error message,
gsem not allowed with summary statistic data
r(111)

Code:

tetrachoric crowd wall clock water , pos clear ssd init crowd wall clock water ssd set obs 200 ssd set cor 1.0000\ /// 0.4791 1.0000 \ /// 0.3843 0.6650 1.0000 \ /// 0.3618 0.6894 0.4505 1.0000 gsem (crowd wall clock water <- F, probit) predict double F, latent

Just to clarify, I didn't mean that this analysis will lead to IP weighting. What I meant is, the score predict from this model will be dicotomized, and used as the outcome in the exposure model for creating IP weights. I mentioned this only to express my need of creating a predicted score from factor analysis.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#6

30 May 2016, 18:22

I think that Joseph Coveney, in #4, meant for you to apply his code to your original data, not to the tetrachoric correlation matrix. What he has provided you with is the code to do confirmatory factor analysis on those variables. If you apply it with the original data, -predict- should work after that. The probit link specified in the -gsem- command will adequately account for the fact that your indicator variables are dichotomous. In fact, it is the equivalent of using tetrachoric correlations: it is estimating based on latent normally distributed variables underlying the dichotomies.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#7

30 May 2016, 18:55

Thankyou very much Clyde, your clarification fixed the issue. I have one more question. What fit statistic (absolute) can I use to asses the fit of my gsem model and what command should I use ot get them.I guess we cannot get CFI and RMSEA with -gsem-..
Comment
Juliana Jacobowiski

Join Date: Jun 2016

Posts: 1
#8

27 Jun 2016, 11:47

Hello, Thekke!

Well, Google drove me to your post, and I'm very interested wether what you did on Stata was had a sucessfully result, because I'm having the same issue here. But in my case I have a questionnaire filled with binaries (present or absent). It's divided in some issues, like "Financial Planning, Financial Controlling" and so on, where each issue consists of few questions.

I'd like to turn these issues in a kinda "indicator" and test each question as a variable, so I'll be able to see which one has more relevance to build the issue. My question is how to perform it on Stata after creating the tetrachoric matrix in order to the PCA gets the info from the tetrachoric one instead of the Pearson's. I did it like yours, but I was a bit uncertain about that.

Thanks!
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#9

01 Jul 2016, 16:58

Hi Juliana,
The above worked for me but I finally did not use FA as mentioned above because of some theoritical issues. I did a PCA. In my case, I had many binary indicators. So I created a tetrachoric correlation matrix out of the indicators. Then checked for highly correlated variables. If 2 variables had high correlation I removed one and retained the other, as one of them might be redundant. I also checked the factorability of the matrix using KMO. Next I ran the PCA on the final correlation matrix to get the components. My aim was to reduce the indicators to components and extract the component that explained max variance. Hope this helps.

Code:

factortest a b c d e f g // gives the factorability test results tetrachoric a b c d e f g // gives the tetrachoric correlation matrix matrix C=r(corr) // stores the matrix in C pcamat C, n(N) comp(1) blank (0.3) // runs PCA in the stored matrix. Have to specify your sample size instead if N
Comment
Gina Allen

Join Date: Oct 2016

Posts: 2
#10

28 Oct 2016, 14:08

Google also brought me here (long time reader, first time poster). I would like to use a polychoric correlation matrix in a sem model fro confirmatory factor analysis. As far as I can tell, I need to use the ssd commands to do this - and I can do this fine, thanks to help here. However, I want the factor scores for use in subsequent regression models and understand I cannot use -predict- after ssd. Results are practically similar using standard covariance structure in CFA with SEM but my observed data are ordinal scales so it seems polychoric is "better". Are there any work arounds for this issue or any other suggestions? I appreciate any help or advice for using polychoric correlation matrix for measurement model component and subsequent regression models.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#11

28 Oct 2016, 15:29

Rather than using polychoric correlation matrices to do a factor analysis, why not do the factor analysis directly from your data using -gsem- with the -ologit- link?
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#12

29 Oct 2016, 15:50

There is an argument to be made that a CFA or PCA is not the best idea in the first place. Such models assume that there are one or more latent variables which are the cause of the observed associations among the indicators. An alternative point of view is that SES is caused by the indicators. A 2001 Annual Review of Sociology paper by Ken Bollen and others makes this point in detail and in the context of research in developing countries. You might find it worth you time to read. See Bollen et al: Socioeconomic Status and Class in Studies of Fertility and Health in Developing Counties, Annu. Rev. Sociol. 2001. 27:153–85.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Gina Allen

Join Date: Oct 2016

Posts: 2
#13

31 Oct 2016, 06:33

Thanks Clyde. I will look into gsem. The ologit link option makes conceptual sense (in lieu of predict).
Comment
Jaishri Dutt

Join Date: Sep 2021

Posts: 4
#14

12 Oct 2021, 04:58

Originally posted by Thekke Purakkal View Post

If i use the -predict- command as given in the code in #1, does it predict scores using maximum likelihood?

Did you not get any error code saying Matrix C has missing values?
Comment

Announcement

Generating a socioeconomic score using PCA (tetrachoric)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment