Hi statalisters,
When calculating a factor or sum score of correlated items, is it always necessary to reverse code items that are negatively correlated with the other items before creating a summary score?
For example, something like this: https://www.theanalysisfactor.com/pr...tive-loadings/
Say we have four variables describing an animal's propensity to be eaten by predators. All of the items are rated originally with higher scores indicating a greater survival advantage (small, unappetizing, hidden, and always sleeps/less exposed - and thus is not vulnerable). It's clear that high scores on each of these items will make the animal less likely to be eaten. So according to the question scale, we should expect all individuals to be rated similarly for each item.
But what if, in the study sample, the animals tend to have high scores on the first three items and low scores on the fourth item (and vice versa), so that the fourth item is negatively correlated (small, unappetizing, hidden, BUT never sleeps and is moving around and more exposed - thus more vulnerable).
A factor analysis will show this fourth item as having a high but negative loading. Advise would be to reverse code the negative loading before creating a summary score.
But doesn't that assume only that the QUESTION might be negatively worded (not the response choices)?
What if we want species with a high component score to be those with heavy weight, more appetizing, more visible, but low hours of sleep (always moving around and exposed)?
In this case, shouldn't we keep the original scaling for the items before summing (despite the items being negatively correlated) so that we can keep the meaning of the component score?
When calculating a factor or sum score of correlated items, is it always necessary to reverse code items that are negatively correlated with the other items before creating a summary score?
For example, something like this: https://www.theanalysisfactor.com/pr...tive-loadings/
Say we have four variables describing an animal's propensity to be eaten by predators. All of the items are rated originally with higher scores indicating a greater survival advantage (small, unappetizing, hidden, and always sleeps/less exposed - and thus is not vulnerable). It's clear that high scores on each of these items will make the animal less likely to be eaten. So according to the question scale, we should expect all individuals to be rated similarly for each item.
But what if, in the study sample, the animals tend to have high scores on the first three items and low scores on the fourth item (and vice versa), so that the fourth item is negatively correlated (small, unappetizing, hidden, BUT never sleeps and is moving around and more exposed - thus more vulnerable).
A factor analysis will show this fourth item as having a high but negative loading. Advise would be to reverse code the negative loading before creating a summary score.
But doesn't that assume only that the QUESTION might be negatively worded (not the response choices)?
What if we want species with a high component score to be those with heavy weight, more appetizing, more visible, but low hours of sleep (always moving around and exposed)?
In this case, shouldn't we keep the original scaling for the items before summing (despite the items being negatively correlated) so that we can keep the meaning of the component score?
Code:
/*generate three variables that are positively correlated across 1000 individuals*/ set seed 12345 forvalues i=1/3 { clear set obs 1000 gen x`i'=rpoisson(4) gsort x`i' gen id=_n save "x`i'.dta", replace } /*generate fourth variable that is correlated negatively with the first three variables*/ clear set obs 1000 gen x4=rpoisson(4) gsort -x4 gen id=_n /*merge the four variables*/ merge 1:1 id using "x1.dta" tab _merge, missing drop _merge merge 1:1 id using "x2.dta" tab _merge, missing drop _merge merge 1:1 id using "x3.dta" tab _merge, missing drop _merge rm "x1.dta" rm "x2.dta" rm "x3.dta" /*code the data on a scale of 1 to 5*/ replace x1=4 if x1>4 replace x2=4 if x2>4 replace x3=4 if x3>4 replace x4=4 if x4>4 replace x1=x1+1 replace x2=x2+1 replace x3=x3+1 replace x4=x4+1 label define x1 1 "1 Very small (bad)" 2 "2 Somewhat small " 3 "3 Moderate" 4 "4 Somewhat heavy" 5 "5 Very heavy (good)" label values x1 x1 tab x1, missing label define x2 1 "1 Very unappetizing (bad)" 2 "2 Somewhat unappetizing " 3 "3 Moderate" 4 "4 Somewhat appetizing" 5 "5 Very appetizing (good)" label values x2 x2 tab x2, missing label define x3 1 "1 Very hidden (bad)" 2 "2 Somewhat hidden " 3 "3 Moderate" 4 "4 Somewhat visible" 5 "5 Very visible (good)" label values x3 x3 tab x3, missing label define x4 1 "1 Always sleeping (bad)" 2 "2 A lot of sleep" 3 "3 Average sleep" 4 "4 Minimal sleep " 5 "5 Never sleeping (good)" label values x4 x4 tab x4, missing /*reverse score the fourth item*/ sum x4 gen x4_reversed=r(max)-x4+r(min) tab x4 x4_reversed, missing order id x1 x2 x3 x4_reversed label define x4_reversed 5 "5 Always sleeping (bad)" 4 "4 A lot of sleep" 3 "3 Average sleep" 2 "2 Minimal sleep " 1 "1 Never sleeping (good)" label values x4_reversed x4_reversed tab x4_reversed, missing tab x4 x4_reversed, missing /*confirm that the fourth variable is negatively correlated*/ corr x1 x2 x3 x4 factor x1 x2 x3 x4, ml rotate, promax horst blanks(0.4) /*now test with a manually reverse-scaled item, all loadings should now be positive*/ corr x1 x2 x3 x4_reversed factor x1 x2 x3 x4_reversed, ml rotate, promax horst blanks(0.4) gsort -x1 /* option 1: do we sum the items on the original scale despite them being negatively correlated? here, we would assign high points to greater levels of the first 3 criteria, and low points for greater levels of the last criteria this would maintain a scale that is reflective of the propensity to be eaten by predators */ /*create summary scores, keeping fourth item in original scale*/ gen sum1=x1+x2+x3+x4 tab sum1, missing /*list examples*/ list x1 x2 x3 x4 sum1 in 1 list x1 x2 x3 x4 sum1 in 500 list x1 x2 x3 x4 sum1 in 1000 /*option 2: do we reverse code the item and then sum as suggested by the factor analysis?*/ /*i.e., if we reversed the fourth item, always sleeping (which is bad) would be assigned a higher score*/ /*create summary scores, reversing fourth item*/ gen sum2=x1+x2+x3+x4_reversed tab sum2, missing /*list examples*/ list x1 x2 x3 x4_reversed sum2 in 1 list x1 x2 x3 x4_reversed sum2 in 500 list x1 x2 x3 x4_reversed sum2 in 1000