coarsening a set of normally distributed variables into three, five, seven... categories

ben earnhart

Join Date: May 2014

Posts: 1027
#1

coarsening a set of normally distributed variables into three, five, seven... categories

28 Oct 2014, 20:18

To make a point to some friends about declining returns in the number of possible responses on a scale, and the number of items to be scaled, I wanted to whip up a quick Stata example. But short of taking the normal curve and chopping it up with a bunch of "replace x1_3=0 if x3<.5 and x1>-.5" etc (and it of course gets worse depending on the number of responses), I somehow am having a mental blockage.

Ideally the process would be this:
1) generate normal variable
2) generate a dozen more normal variables, with 50% of their variances shared with the variable from step 1 -- that is, a dozen indicators of the original variable, with known loadings.
3) for each of the dozen indicators of the original variable, "coarsen" them into three, five, seven, nine, and eleven-category ordinal variables, approximately normally distributed.
4) compute alphas for the different levels of coarsening, and for using different number of items (e.g. a twelve-by-six grid). Yes, I know alpha is inappropriate for ordinal variables, but half the point is to see where and how far you can violate assumptions and approach the results you get under normality.

Hmmmm... I know a simple lit search would turn up oodles of information on this issue, with tables just like I describe, but it seemed at first blush such a simple task, and some people respond better to numbers they just saw generated (plus, we could play additional what-ifs with CFAs and such).

If this is confusing or doesn't seem worth doing, please don't bother putting much thought into it; like I said, there already exists a pretty good literature on the basic theme. But if somebody wants a bit of a brain teaser and/or knows exactly how to crank this out, much obliged. Thanks for any thoughts!

BTW -- steps 1 and 2 are obvious no-brainers. Step 3 is where I get befuddled.

Coarsening them into *uniform* distribution is easy. But I want to have it roughly approximate normality, that is, more answers in the middle and fewer in the extremes, though I realize that "true" normality would be lost.

Step 4 I can do, though getting the relevant output compiled in an efficient and creative way would be a bonus.

Last edited by ben earnhart; 28 Oct 2014, 20:37.
Tags: None

ben earnhart

Join Date: May 2014
Posts: 1027

29 Oct 2014, 02:39

What I have so far, Improvements? Fatal flaws in my approach?

Code:

clear
set more off
set seed 1971
set obs 500

*=====STEP 1
gen latent=rnormal()
label variable latent "original source of 'true' variance"
*=====STEP 2
forvalues i=1/10 {
    gen x`i'=latent+rnormal()
    egen trash=std(x`i')
    replace x`i'=trash
    drop trash

*=====GETTING READY FOR STEP 3    
    *=======pull in tails
    gen trash`i'=x`i'
    replace trash`i'=1.96 if x`i'>1.96
    replace trash`i'=-1.96 if x`i'<-1.96

*======STEP 3, COARSENING
        forvalues j=3(2)11 {
            gen x`i'_`j'=round((trash`i'*`j')/4)
            }
drop trash*
    
}

sum

*=======by number of points
forvalues i=3(2)11 {
    display "number of possible responses alone `i', 10 item scale" 
    alpha x*_`i', std
    }
    
*=======by number of items, using 11 point scale
alpha x1_11 x2_11, std
alpha x1_11 x2_11 x3_11, std
alpha x1_11 x2_11 x3_11 x4_11, std
alpha x1_11 x2_11 x3_11 x4_11 x5_11, std
alpha x1_11 x2_11 x3_11 x4_11 x5_11 x6_11, std
alpha x1_11 x2_11 x3_11 x4_11 x5_11 x6_11 x7_11 x8_11, std
alpha x1_11 x2_11 x3_11 x4_11 x5_11 x6_11 x7_11 x8_11 x9_11 x10_11, std


*======realistic scenario: five items, five points
alpha x1_5 x2_5 x3_5 x4_5 x5_5, std

Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4410

29 Oct 2014, 08:03

Ben, how about something like the following? I didn't do the various numbers of items, but it should be easy to add.

Code:

*! Coarsened-normal.do

version 13.1

clear *
set more off

program define skutosis
    version 13.1
    syntax varlist(numeric) [if] [in]

    marksample touse

    tempname sum_skewness sum_kurtosis
    scalar define `sum_skewness' = 0
    scalar define `sum_kurtosis' = 0
    foreach var of varlist `varlist' {
        quietly summarize `var' if `touse', detail
        scalar define `sum_skewness' = `sum_skewness' + r(skewness)
        scalar define `sum_kurtosis' = `sum_kurtosis' + r(kurtosis)
    }
    foreach coefficient in skewness kurtosis {
        display in smcl as text "Mean " proper("`coefficient'") " = " as result %04.2f `sum_`coefficient'' / `: word count `varlist''
    }
end

tempname Corr
matrix define `Corr' = I(12) * 0.5 + J(12, 12, 0.5)
drawnorm y_1-y_12, double corr(`Corr') seed(`=date("2014-10-29", "YMD")') n(500)

tempname range bin_width
forvalues i = 1/12 {
    summarize y_`i', meanonly
    scalar define `range' = r(max) - r(min)
    forvalues k = 3(2)11 {
        generate byte y`k'_`i' = 0
        scalar define `bin_width' = `range' / `k'
        forvalues j = 1/`=`k'-1' {
            quietly replace y`k'_`i' = y`k'_`i' + 1 if y_`i' > r(min) + ///
                `j' * `bin_width'
        }
    }
}

skutosis y_*
alpha y_* // , std

forvalues k = 3(2)11 {
    display in smcl as text _newline(1) as result "`k' bins"
    skutosis y`k'_*
    alpha y`k'_* // , std
}

exit

Running it, It seems as if you wouldn't take much of a hit with even as few as three categories, at least if there's little or no skewness and they're reasonably mesokurtic.

Comment

ben earnhart

Join Date: May 2014

Posts: 1027
#4

29 Oct 2014, 22:21

Thanks Joseph, and sorry so long getting back to you. I was surprised by my/our findings. But like you mention, "...if there's little or no skewness and they're reasonable mesokurtic." I need to hit the literature instead of re-inventing the wheel. What you and I came up with was based on the ideal case -- forcing it to be as close to normally distributed as possible, absolutely pristine factor structure, etc. In real life, a lot more noise, so probably why more scale points and more items are advisable.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#5

31 Oct 2014, 20:44

The results took me by surprise, too. I expected much more degradation when it got down to five categories, and especially three. Let us know what you uncover in the literature--I'd be curious to see how things hold up with discretized data compared to the corresponding continuous when there is a bit of skew or leptokurtosis.
Comment

ben earnhart

Join Date: May 2014
Posts: 1027

31 Oct 2014, 21:34

BTW -- you saw the #s yourself, see below for a nicer presentation of them.

Code:

Click image for larger version

Name: scales1.png
Views: 1
Size: 14.0 KB
ID: 381153

Last edited by ben earnhart; 31 Oct 2014, 21:38.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#7

02 Nov 2014, 08:08

I hadn't actually seen them before you posted them. Good graph. I did a little exploring of skewness and kurtosis using the Fleishman-Vale-Maurelli algorithm for generating correlated nonnormal variables with specified correlation structure, skewness and kurtosis. It looks as if peakedness hits harder than skewness. (I've attached the Stata SMCL log file.) Real preliminary, but it breaks down like this:

0.9 0.9

0.7 0.4

0.8 0.6

0.8 0.7

Values are Cronbach's alpha (unstandardized) for a six-item scale with the target skewness (maximum with no "excess kurtosis" is 0.83 with the algorithm) and separately with target kurtosis of 27. The undiscretized (continuous) variable targeted a compound symmetric correlation structure of 50%. Alpha for the continuous is similar for increased-skewness-only and increased-kurtosis-only cases: about 86% (rounded to 0.9 in the table). Discretizing into three categories causes a larger drop for the increased-kurtosis situation. Recovery of alpha with increasing number of categories is slower for the increased-kurtosis side, too. If I get time, I'll send the Fleishman transformation and Vale-Maurelli multivariate extension to SSC, but I've got a promise to finish updating firthlogit to tend to first.
Attached Files

Skew-Kurt.smcl (6.7 KB, 1 view)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#8

02 Nov 2014, 08:11

Well that table didn't render properly--it looked good when composing the post, but it lost its headers at the top and left when I hit the Post Reply button.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#9

02 Nov 2014, 10:26

Well, a cursory review of the literature points to something that, well, we kinda already knew -- the ideal number of scale responses is between four and seven, as is the number of items, with five probably good enough. Didn't find anything on first pass through that explored kurtosis and skewness in a serious way, but there were references they cited I could follow up on claiming that alpha is robust to non-normality. People varied sample sizes, inter-item-correlations, and of course the number of items and number of scale points -- all told the same story.

Let me know if/when you get -valemaurelli- uploaded, and I think I'll need -fleishman-, too, if I'm going to pursue this in a rigorous way (my approach was pretty crude).

There was one interesting finding that supports an oddity that you mentioned (and I observed but didn't mention), that reduced skewness hurts. A two-point scale actually performs better than a three-point, since the variance is increased. I increased skewness and alphas went up. So I dunno how much further I'll go with this; really, we're more-or-less proving what people already knew.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#10

02 Nov 2014, 20:10

BTW, the more I think about it, it's probably moot, since reducing the items to ordinal wipes out most of the impact of kurtosis until one got out to nine, fifteen, or even more scale points. Skewness increases shared variance, since all (clean) indicators derived from the same latent variable would be skewed in the same direction.

Hmm. One could present an interesting argument that skewness artificially inflates alpha. Which is an argument against using alpha as *the* gold standard for reliability, which is a rabbit-hole I acknowledge, but don't feel like going down at the moment. Hmm...

So the vague four-to-seven (which is sometimes simplified as five-by-five) rule of thumb seems pretty robust. I suppose the only way to really get at it would be large-scale ("scale" ) testing on real humans, taking into account issues like satisficing (picking the middle due to laziness or cognitive overload), anchoring (does each point have a clear label), and other human factors.

I appreciate your time and attention on this... you have a lot going on more important than a wild-goose-chase. Or maybe there's something there, just I'm not seeing it.

Last edited by ben earnhart; 02 Nov 2014, 20:22.
Comment

	# choices
	3	5	7	9	11
# items
2	0.5503	0.6364	0.6413	0.6443	0.6553
3	0.6781	0.7326	0.7371	0.7423	0.7492
4	0.7237	0.7748	0.7811	0.7864	0.7927
5	0.7692	0.8077	0.8148	0.8188	0.824
6	0.8031	0.8374	0.8436	0.847	0.852
7	0.8268	0.8588	0.8641	0.866	0.8704
8	0.8459	0.8741	0.8794	0.8809	0.8842
9	0.8611	0.8865	0.8912	0.8928	0.8955
10	0.8726	0.8964	0.9008	0.9023	0.9045

0.9	0.9
0.7	0.4
0.8	0.6
0.8	0.7

Announcement