Best approach for creating "separate" indices using PCA

Lucia Credito

Join Date: Nov 2021

Posts: 10
#1

Best approach for creating "separate" indices using PCA

09 Jun 2025, 07:45

Good morning everyone,

I am working on constructing three social skills indices—communication, understanding, and engagement, the global macro would be like these:

Code:

global com qs1 qs2 qs3 global understanding qs4 qs5 qs6 qs7 global engagement qs8 qs9 qs10

...later to use as dependent variables in separate regression models :

Code:

reg index1_com $demographic_control reg index2_understanding $demographic_control reg index3_engagement $demographic_control

To create those dep.vars, I plan to use principal component analysis (PCA) to create 3 indices but am unsure about the best approach to ensure each index corresponds to its respective group (communication, understanding, engagement). Specifically, I have two options in mind:

1) Option 1 Single PCA: run PCA on all variables together and extract 3 components:

Code:

pca $com $undestadning $engagement, comp(3) predict comp1 comp2 comp3

But my concern is that I cannot guarantee that comp1, comp2, and comp3 will correspond to $com $undestadning $engagement, respectively. How can I ensure the components align with my predefined groups?

2) Option 2 Separate PCAs: Run PCA separately for each group, extracting one component per group:

Code:

pca $com, comp(1) predict index1_com, score pca $understanding, comp(1) predict index2_understanding, score pca $engagement, comp(1) predict index3_engagement, score

This seems more intuitive as each index would derive only from its respective questions.

My questions are:
Which method, option 1 or 2 that I should use to create distinct indices?

I also want to have a "composite index", which explain best $com, $understanding, and $engagement altogether), should I use Option 1, and take the predict score of "pca $com $undestadning $engagement, comp(1) "? Or should I do Option 2 and take "average" of 3 scores?

Let's assume that the allocation of qs1-qs10 in my global macro is theoretically correct. Can you give me some advices?
Thanks and have a great week
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35681
#2

09 Jun 2025, 10:16

I would go for option 3, not stated here, which is that you have 10 candidate predictor variables. Mushing them into principal components won't make anything clearer or more manageable.

Alternatively. if some of the variables are very strongly correlated with others in the same group, then including them all is self-evidently pointless and PCA isn't needed to see that.

I wouldn't trust any theory that (communication, understanding, engagement) are disjoint and separable in principle more than the evidence of any measurements or indicators

So, why should anyone assume that your split is "theoretically correct"? But if it is, which regression results would contradict it?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4406
#3

09 Jun 2025, 20:00

Originally posted by Lucia Credito View Post

I am working on constructing three social skills indices—communication, understanding, and engagement . . . to use as dependent variables in separate regression models . . . How can I ensure the components align with my predefined groups? . . . Let's assume that the allocation of qs1-qs10 i. . . is theoretically correct.

Well, given that assumption, you could try something like the following. (Begin at the "Begin here" comment; what's above is just to create a fictional dataset conforming to your assumption for use in illustration.)

Code:

version 19 clear * // seedem set seed 989388669 tempname Corr matrix define `Corr' = J(3, 3, 0.5) + I(3) * 0.5 drawnorm qs1 qs2 qs3, double corr(`Corr') n(350) generate `c(obs_t)' pid = _n tempfile tmpfil0 quietly save `tmpfil0' drop _all drawnorm qs8 qs9 qs10, double corr(`Corr') n(350) generate `c(obs_t)' pid = _n merge 1:1 pid using `tmpfil0', assert(match) nogenerate noreport quietly save `tmpfil0', replace drop _all matrix define `Corr' = J(4, 4, 0.5) + I(4) * 0.5 drawnorm qs4 qs5 qs6 qs7, double corr(`Corr') n(350) generate `c(obs_t)' pid = _n merge 1:1 pid using `tmpfil0', assert(match) nogenerate noreport generate double demographic_control = runiform(-1, 1) * * Begin here * sem /// (qs1 qs2 qs3 <- Communication) /// (qs4 qs5 qs6 qs7 <- Understanding) /// (qs8 qs9 qs10 <- Engagement) /// (Communication Understanding Engagement <- demographic_control), /// nocnsreport nodescribe nofootnote nolog exit

As Nick intimates, your audience might expect you to assess the plausibility of your assumption, especially given the nature of the three concepts.* Maybe you could begin approaching that with, say, something like factor.

I also want to have a "composite index", which explain best $com, $understanding, and $engagement altogether

I'm not exactly sure what you're after here, but you could look into extending the confirmatory factor analysis (CFA) model to two levels, with an additional latent factor whose indicators are the three first-level latent factors.

*Not strictly pertinent to the advice you're seeking, but if you're curious my take on this kind of thing is here.
1 like
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 426
#4

09 Jun 2025, 20:21

Adding to the already excellent replies... Coming from a psychological measurement perspective, you can use PCA along with factor analysis to help you ascertain whether the variance in your 10 items neatly fall into three components (PCA) and then utilize exploratory factor analysis to see whether the correlations among the items themselves suggest the items belong to the latent factors as you hope. The two work quite nicely in combination. See this didactic post on Cross-validated for a helpful illustration of the two approaches

Last edited by Erik Ruzek; 09 Jun 2025, 20:24. Reason: Clarification
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#5

10 Jun 2025, 00:05

An extra point is that the approach in #1 does not even seem consistent: if three factors, or whatever else you call them, are justified theoretically and substantiated empircally, then their combination in a multiple regression is the story you need. There is no need for, and no point in, mushing those factors together into a higher-order composite; that is what the regression already will have done.

Joseph Coveney: Thanks to the link to your 2018 post, but the link there to something by Peter Westfall is broken.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4406
#6

10 Jun 2025, 19:15

Originally posted by Nick Cox View Post

Thanks to the link to your 2018 post, but the link there to something by Peter Westfall is broken.

Sorry about that. When linking to that old post I didn’t check to see that its link to his essay wasn’t broken. A Google search doesn’t find an updated link and I can’t get the search function on the since-renewed CiteSeer^X website to work at all.

At the risk of relying too much on memory, I recall that he was admonishing against the reification of latent factors common in some fields of study and was advocating restricting latent factor modeling to accommodation of measurement error of “real” phenomena. You can get a sense of his arguments from the slide deck of one of his lectures that I guess he’s delivered in the interim and that shares some thematic elements with what I recall of his essay.

Last edited by Joseph Coveney; 10 Jun 2025, 19:20.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#7

10 Jun 2025, 23:24

Joseph Coveney Thanks for digging that up. The arguments in his slide deck seem close to several of those made in this thread.
Comment

Announcement

Best approach for creating "separate" indices using PCA

Comment

Comment

Comment

Comment

Comment

Comment