Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best approach for creating "separate" indices using PCA


    Good morning everyone,

    I am working on constructing three social skills indices—communication, understanding, and engagement, the global macro would be like these:
    Code:
    global com qs1 qs2 qs3  global understanding qs4 qs5 qs6 qs7 global engagement  qs8 qs9 qs10
    ...later to use as dependent variables in separate regression models :
    Code:
    reg index1_com $demographic_control reg index2_understanding $demographic_control reg index3_engagement $demographic_control
    To create those dep.vars, I plan to use principal component analysis (PCA) to create 3 indices but am unsure about the best approach to ensure each index corresponds to its respective group (communication, understanding, engagement). Specifically, I have two options in mind:

    1) Option 1 Single PCA: run PCA on all variables together and extract 3 components:
    Code:
    pca $com $undestadning $engagement, comp(3) 
    predict comp1 comp2 comp3
    But my concern is that I cannot guarantee that comp1, comp2, and comp3 will correspond to $com $undestadning $engagement, respectively. How can I ensure the components align with my predefined groups?

    2) Option 2 Separate PCAs: Run PCA separately for each group, extracting one component per group:
    Code:
    pca $com, comp(1)
    predict index1_com, score 
    
    pca $understanding, comp(1)
    predict index2_understanding, score
    
    pca $engagement, comp(1)
    predict index3_engagement, score
    This seems more intuitive as each index would derive only from its respective questions.

    My questions are:
    1. Which method, option 1 or 2 that I should use to create distinct indices?
    2. I also want to have a "composite index", which explain best $com, $understanding, and $engagement altogether), should I use Option 1, and take the predict score of "pca $com $undestadning $engagement, comp(1) "? Or should I do Option 2 and take "average" of 3 scores?
    Let's assume that the allocation of qs1-qs10 in my global macro is theoretically correct. Can you give me some advices?
    Thanks and have a great week

  • #2
    I would go for option 3, not stated here, which is that you have 10 candidate predictor variables. Mushing them into principal components won't make anything clearer or more manageable.

    Alternatively. if some of the variables are very strongly correlated with others in the same group, then including them all is self-evidently pointless and PCA isn't needed to see that.

    I wouldn't trust any theory that (communication, understanding, engagement) are disjoint and separable in principle more than the evidence of any measurements or indicators

    So, why should anyone assume that your split is "theoretically correct"? But if it is, which regression results would contradict it?

    Comment


    • #3
      Originally posted by Lucia Credito View Post
      I am working on constructing three social skills indices—communication, understanding, and engagement . . . to use as dependent variables in separate regression models . . . How can I ensure the components align with my predefined groups? . . . Let's assume that the allocation of qs1-qs10 i. . . is theoretically correct.
      Well, given that assumption, you could try something like the following. (Begin at the "Begin here" comment; what's above is just to create a fictional dataset conforming to your assumption for use in illustration.)
      Code:
      version 19
      
      clear *
      
      // seedem
      set seed 989388669
      
      tempname Corr
      matrix define `Corr' = J(3, 3, 0.5) + I(3) * 0.5
      drawnorm qs1 qs2 qs3, double corr(`Corr') n(350)
      generate `c(obs_t)' pid = _n
      
      tempfile tmpfil0
      quietly save `tmpfil0'
      
      drop _all
      drawnorm qs8 qs9 qs10, double corr(`Corr') n(350)
      generate `c(obs_t)' pid = _n
      merge 1:1 pid using `tmpfil0', assert(match) nogenerate noreport
      quietly save `tmpfil0', replace
      
      drop _all
      matrix define `Corr' = J(4, 4, 0.5) + I(4) * 0.5
      drawnorm qs4 qs5 qs6 qs7, double corr(`Corr') n(350)
      generate `c(obs_t)' pid = _n
      merge 1:1 pid using `tmpfil0', assert(match) nogenerate noreport
      
      generate double demographic_control = runiform(-1, 1)
      
      *
      * Begin here
      *
      sem ///
          (qs1 qs2 qs3 <- Communication) ///
          (qs4 qs5 qs6 qs7 <- Understanding) ///
          (qs8 qs9 qs10 <- Engagement) ///
          (Communication Understanding Engagement <- demographic_control), ///
          nocnsreport nodescribe nofootnote nolog
      
      exit
      As Nick intimates, your audience might expect you to assess the plausibility of your assumption, especially given the nature of the three concepts.* Maybe you could begin approaching that with, say, something like factor.

      I also want to have a "composite index", which explain best $com, $understanding, and $engagement altogether
      I'm not exactly sure what you're after here, but you could look into extending the confirmatory factor analysis (CFA) model to two levels, with an additional latent factor whose indicators are the three first-level latent factors.

      *Not strictly pertinent to the advice you're seeking, but if you're curious my take on this kind of thing is here.

      Comment


      • #4
        Adding to the already excellent replies... Coming from a psychological measurement perspective, you can use PCA along with factor analysis to help you ascertain whether the variance in your 10 items neatly fall into three components (PCA) and then utilize exploratory factor analysis to see whether the correlations among the items themselves suggest the items belong to the latent factors as you hope. The two work quite nicely in combination. See this didactic post on Cross-validated for a helpful illustration of the two approaches
        Last edited by Erik Ruzek; 09 Jun 2025, 20:24. Reason: Clarification

        Comment


        • #5
          An extra point is that the approach in #1 does not even seem consistent: if three factors, or whatever else you call them, are justified theoretically and substantiated empircally, then their combination in a multiple regression is the story you need. There is no need for, and no point in, mushing those factors together into a higher-order composite; that is what the regression already will have done.

          Joseph Coveney: Thanks to the link to your 2018 post, but the link there to something by Peter Westfall is broken.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            Thanks to the link to your 2018 post, but the link there to something by Peter Westfall is broken.
            Sorry about that. When linking to that old post I didn’t check to see that its link to his essay wasn’t broken. A Google search doesn’t find an updated link and I can’t get the search function on the since-renewed CiteSeerX website to work at all.

            At the risk of relying too much on memory, I recall that he was admonishing against the reification of latent factors common in some fields of study and was advocating restricting latent factor modeling to accommodation of measurement error of “real” phenomena. You can get a sense of his arguments from the slide deck of one of his lectures that I guess he’s delivered in the interim and that shares some thematic elements with what I recall of his essay.
            Last edited by Joseph Coveney; 10 Jun 2025, 19:20.

            Comment


            • #7
              Joseph Coveney Thanks for digging that up. The arguments in his slide deck seem close to several of those made in this thread.

              Comment

              Working...
              X