Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a socioeconomic score using PCA (tetrachoric)

    Hello Statalisters,

    I am trying to create a final socioeconomic (SE) measure (binary) out of multiple, binary, socioeconomic indicators (occupation, participant education education, crowding in house, presence or absence of window, drinking water, material of wall) etc. I guess PCA is the way to go rather than factor analysis as I am trying to summarize these variables into a single SE measure. Am I correct? I learn that since my variables are binary (and I have predetermined and fixed which indicators to use from descriptive analysis), I cannot do this straight forward but have to first output a polychoric (tetrachoric correlation matrix to be precise). The steps that I need to undertake include; 1) Getting the tetrachoric correlation matrix, 2) using this matrix to get the components, 3) rotating, 4) deciding how many components to use, 5) getting the score for the component(s) using predict, 6) dichotomizing the predicted score to get the final SE binary measure ( I will be using this binary measure for other analysis). Please correct me if there is anything wrong in these steps.

    Getting into the analysis, I am able to perform a straightforward pca in Stata 13, but I am totally confused (-polychoric-, -polychoricpca-, -tetrachoric-, pcamat) as to how to do this after creating the matrix. Example code of what I tried with -tetrachoric- command is


    Code:
    tetrachoric Occup crowd water window wall edu
    matrix C=r(corr)
    pcamat C, n(102) // 102 observations in the sample data set
    rotate, varimax   
    predict pc1 pc2
    1) Is this the way to go?
    2) I used varimax here but I have also seen quartimin and promax rotations being used for creating final SE scores. How can I decide which one to use in my case?
    3) How will the whole scenario differ if one or two indicators are ordinal, categorical variables?

    Given below is an example data set produced by -dataex-.

    Thankyou

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(Occup edu crowd) byte(wall window water)
    1 0 1 1 1 1
    1 1 1 1 1 1
    0 0 0 0 0 0
    1 1 0 1 0 0
    1 1 1 0 0 0
    0 0 0 0 0 0
    1 1 1 1 1 1
    1 0 0 1 1 0
    0 0 1 0 0 0
    1 1 1 1 1 1
    0 0 0 0 0 0
    1 0 1 0 0 0
    1 0 0 0 0 0
    1 1 0 1 1 1
    0 0 0 0 0 0
    1 0 0 0 0 0
    1 1 1 1 1 1
    1 0 0 0 0 1
    0 1 1 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    1 1 1 1 1 1
    0 1 1 1 1 0
    1 0 1 0 0 0
    1 1 0 1 0 0
    1 0 0 1 0 1
    1 1 1 1 1 1
    1 0 1 1 1 1
    0 0 0 1 0 0
    0 0 0 0 0 0
    1 1 1 1 1 1
    0 0 0 0 0 0
    0 0 0 0 0 1
    1 1 1 1 1 1
    1 0 0 0 0 0
    1 1 0 1 1 1
    1 0 0 0 0 1
    1 1 1 1 1 1
    1 1 1 1 1 1
    0 0 0 0 0 0
    1 1 1 1 1 1
    1 1 1 1 1 1
    1 0 0 1 1 1
    1 1 0 1 1 1
    1 1 1 1 1 1
    1 0 1 1 1 1
    1 0 1 1 0 1
    1 1 1 1 1 1
    1 1 0 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    1 1 0 1 0 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    0 0 0 1 0 1
    1 1 0 1 0 0
    1 1 1 1 1 1
    1 1 0 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 1
    1 0 1 1 1 1
    0 0 0 0 0 0
    1 1 1 1 1 1
    0 0 0 0 0 0
    1 0 0 0 0 0
    1 0 0 0 0 0
    1 1 1 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 1
    0 0 0 0 0 1
    1 1 1 0 0 1
    0 0 0 0 0 0
    1 1 0 1 1 1
    1 0 0 0 0 0
    1 1 1 1 1 1
    1 1 1 1 0 0
    1 1 1 1 1 0
    0 0 0 0 0 0
    1 0 0 1 1 1
    1 0 0 0 0 0
    1 0 1 1 1 0
    1 0 1 1 0 0
    0 1 0 1 0 1
    0 0 1 1 0 1
    1 1 1 1 1 1
    1 0 0 0 0 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    1 1 1 1 1 1
    1 1 1 1 0 0
    1 1 0 1 1 1
    0 0 1 0 0 0
    1 1 0 1 1 0
    1 1 1 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    1 1 1 1 1 1
    1 1 1 1 1 1
    1 1 1 1 1 0
    1 0 0 0 0 0
    1 1 0 0 0 0
    1 1 1 0 0 0
    end
    label values edu Edu
    label def Edu 0 "high", modify
    label def Edu 1 "low", modify

  • #2
    If i use the -predict- command as given in the code in #1, does it predict scores using maximum likelihood?

    Comment


    • #3
      Hello Statalisters,

      I am trying to predict a score from a CFA on binary variables using tetrachoric correlation. I see it has been advised on the forum not to use predicted scores in subsequent regression models. Rather go ahead with SEM. But my goal is to use my score in an inverse probability weighted regression model following counterfactual theory. Hence the requirement. Giving below the code for what I have done so far.

      //CFI on binary variables
      Code:
      tetrachoric  crowd wall clock water , pos 
      clear
      ssd init  crowd wall clock water
      ssd set obs 200
      ssd set cor 1.0000\ ///
                  0.4791   1.0000 \ ///
                  0.3843   0.6650   1.0000 \ ///
                  0.3618   0.6894   0.4505   1.0000 
      
          sem (F1-> crowd wall clock water) 
          estat gof, stat(all)
          predict F1, latent


      This gave me an error saying "
      predict not possible with summary statistics data". I checked -sem- -predict- and indeed it has been given there that predict may not be used with summary statistics. So how can I get a predicted score from this CFI? or will the predicted score from the folliwing code of EFA serve me the same purpose?

      Code:
      tetrachoric crowd wall clock water
      matrix Rho = r(Rho)
      sca nobs=r(N)
      
      factormat Rho, n(`=nobs') pcf
      fapara, pca reps(2000)
      
      /* The parallel analysis shows that I should extract only 1 factor.  */
      
      factormat Rho, n(`=nobs') ipf factors(1) blank (0.5) 
      rotate, promax 
      sortl 
      predict F1


      Any help will be appreciated.
      Thanks

      Comment


      • #4
        I'm not sure how what you're trying to do will lead to an inverse probability-weighted regression model, but why not just
        Code:
        gsem (crowd wall clock water <- F, probit)
        predict double F, latent

        Comment


        • #5
          Hello Joseph,
          Thank you very much for your response. But I tried the following code and I am still getting an error message,

          gsem not allowed with summary statistic data
          r(111)


          Code:
          tetrachoric  crowd wall clock water , pos 
          clear
          ssd init  crowd wall clock water
          ssd set obs 200
          ssd set cor 1.0000\ ///
                      0.4791   1.0000 \ ///
                      0.3843   0.6650   1.0000 \ ///
                      0.3618   0.6894   0.4505   1.0000 
          gsem (crowd wall clock water <- F, probit)
          predict double F, latent


          Just to clarify, I didn't mean that this analysis will lead to IP weighting. What I meant is, the score predict from this model will be dicotomized, and used as the outcome in the exposure model for creating IP weights. I mentioned this only to express my need of creating a predicted score from factor analysis.

          Comment


          • #6
            I think that Joseph Coveney, in #4, meant for you to apply his code to your original data, not to the tetrachoric correlation matrix. What he has provided you with is the code to do confirmatory factor analysis on those variables. If you apply it with the original data, -predict- should work after that. The probit link specified in the -gsem- command will adequately account for the fact that your indicator variables are dichotomous. In fact, it is the equivalent of using tetrachoric correlations: it is estimating based on latent normally distributed variables underlying the dichotomies.

            Comment


            • #7
              Thankyou very much Clyde, your clarification fixed the issue. I have one more question. What fit statistic (absolute) can I use to asses the fit of my gsem model and what command should I use ot get them.I guess we cannot get CFI and RMSEA with -gsem-..

              Comment


              • #8
                Hello, Thekke!

                Well, Google drove me to your post, and I'm very interested wether what you did on Stata was had a sucessfully result, because I'm having the same issue here. But in my case I have a questionnaire filled with binaries (present or absent). It's divided in some issues, like "Financial Planning, Financial Controlling" and so on, where each issue consists of few questions.

                I'd like to turn these issues in a kinda "indicator" and test each question as a variable, so I'll be able to see which one has more relevance to build the issue. My question is how to perform it on Stata after creating the tetrachoric matrix in order to the PCA gets the info from the tetrachoric one instead of the Pearson's. I did it like yours, but I was a bit uncertain about that.

                Thanks!

                Comment


                • #9
                  Hi Juliana,
                  The above worked for me but I finally did not use FA as mentioned above because of some theoritical issues. I did a PCA. In my case, I had many binary indicators. So I created a tetrachoric correlation matrix out of the indicators. Then checked for highly correlated variables. If 2 variables had high correlation I removed one and retained the other, as one of them might be redundant. I also checked the factorability of the matrix using KMO. Next I ran the PCA on the final correlation matrix to get the components. My aim was to reduce the indicators to components and extract the component that explained max variance. Hope this helps.

                  Code:
                  factortest a b c d e f g // gives the factorability test results
                  tetrachoric a b c d e f g // gives the tetrachoric correlation matrix
                  matrix C=r(corr)  //  stores the matrix in C
                  pcamat C, n(N) comp(1) blank (0.3)  // runs PCA in the stored matrix. Have to specify your sample size instead if N

                  Comment


                  • #10
                    Google also brought me here (long time reader, first time poster). I would like to use a polychoric correlation matrix in a sem model fro confirmatory factor analysis. As far as I can tell, I need to use the ssd commands to do this - and I can do this fine, thanks to help here. However, I want the factor scores for use in subsequent regression models and understand I cannot use -predict- after ssd. Results are practically similar using standard covariance structure in CFA with SEM but my observed data are ordinal scales so it seems polychoric is "better". Are there any work arounds for this issue or any other suggestions? I appreciate any help or advice for using polychoric correlation matrix for measurement model component and subsequent regression models.

                    Comment


                    • #11
                      Rather than using polychoric correlation matrices to do a factor analysis, why not do the factor analysis directly from your data using -gsem- with the -ologit- link?

                      Comment


                      • #12
                        There is an argument to be made that a CFA or PCA is not the best idea in the first place. Such models assume that there are one or more latent variables which are the cause of the observed associations among the indicators. An alternative point of view is that SES is caused by the indicators. A 2001 Annual Review of Sociology paper by Ken Bollen and others makes this point in detail and in the context of research in developing countries. You might find it worth you time to read. See Bollen et al: Socioeconomic Status and Class in Studies of Fertility and Health in Developing Counties, Annu. Rev. Sociol. 2001. 27:153–85.
                        Richard T. Campbell
                        Emeritus Professor of Biostatistics and Sociology
                        University of Illinois at Chicago

                        Comment


                        • #13
                          Thanks Clyde. I will look into gsem. The ologit link option makes conceptual sense (in lieu of predict).

                          Comment


                          • #14
                            Originally posted by Thekke Purakkal View Post
                            If i use the -predict- command as given in the code in #1, does it predict scores using maximum likelihood?
                            Did you not get any error code saying Matrix C has missing values?

                            Comment

                            Working...
                            X