Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Analyzing discordance

    Hi all,

    I’m looking for advice on how to quantify discordance between a continuous variable and its categorical classification.

    For example, I have a continuous variable "height" and a categorical variable "heightclass" with three groups: SHA (shorter than average), AHA (average), and THA (taller than average). The categories are derived from an index taking other factors like age into account. The categories are meant to reflect ordinal differences (THA > AHA > SHA).

    While the groups differ in height as expected, I want to focus on the exceptions. For instance, someone who is 163cm may be classified as SHA, while someone shorter, 158cm, is classified as AHA. I’d like to systematically measure how often this happens.

    Is there a way in Stata to quantify these misclassifications/overlaps? For example, by comparing the observed ranking of height with the assigned heightclass and calculating the proportion of discordant cases?

    I hope my question is clear, please feel free to ask questions for clarification.

  • #2
    As you do not give example data, I have made a demonstration data set for my approach. It compares the heights themselves, not their ranks, in the different height classes and identifies smaller heights in the larger height group (and vice versa). Each such misclassified pair is reported only once. (That is, if person X and person Y are a misclassified pair, it reports X, Y but not Y, X, or just the other way around.) What it does not do is attempt to decide which one in the pair is the error since you imply that the classification is based on things other than height. In fact, although I have followed your usage of the word misclassification, it seems to have the wrong meaning. I think you are just looking for pairs where a person in one height class is classified in a higher group than a person in another height class, even though the former person is shorter than the latter. It may well be that after taking into account the other variables determining height class, that both are properly classified. Anyway:

    Code:
    clear*
    
    //  CREATE A DEMONSTRATION DATA SET OF HEIGHTS
    set seed 1234
    set obs 100
    gen height = rnormal(175, 7.5)
    
    //  CLASSIFY THEM INTO BOTTOM QUARTILE, MIDDLE HALF, AND TOP QUARTILE
    //  BUT WITH SOME RANDOM MISCLASSIFICATION ERROR
    centile height, centile(25 75)
    gen heightclass = "SHA" if height < `r(c_1)'
    replace heightclass = "AHA" if inrange(height, `r(c_1)', `r(c_2)')
    replace heightclass = "THA" if height > `r(c_2)'
    
    //  INTRODUCE A LITTLE MISCLASSIFICATION
    replace heightclass = "SHA" if heightclass == "AHA" & runiform() < 0.05
    replace heightclass = "THA" if heightclass == "AHA" & runiform() < 0.05
    
    label define heightclass    1   "SHA"   2   "AHA"   3   "THA" // NOTE THE ORDER!
    encode heightclass, gen(_heightclass) label(heightclass)
    drop heightclass
    rename _heightclass heightclass
    
    tabstat height, by(heightclass) statistics(min max) format(%2.1f)
    
    
    //  SOLUTION BEGINS HERE
    //  FORM ALL PAIRS FROM DIFFERENT HEIGHT CLASSES, AND DO NOT RETAIN THE SAME PAIR\
    //  IN REVERSE ORDER
    gen `c(obs_t)' obs_no = _n
    preserve
    rename _all =2
    tempfile copy
    save `copy'
    restore
    rename _all =1
    cross using `copy'
    keep if heightclass1 != heightclass2 & obs_no1 < obs_no2
    
    //  NOW RETAIN THE MISCLASSIFIED PAIRS
    keep if sign(heightclass2 - heightclass1) != sign(height2 - height1)
    
    //  COUNT THE NUMBER OF MISCLASSIFIED PAIRS IN EACH COMBINATION OF HEIGHT CLASSES
    tab heightclass1 heightclass2
    Note: If your data set is very large, the -cross- command will take a long time to run, and might even exhaust available memory. If you encounter this problem, post back, and I'll show you an alternate, more complicated approach that might work better.

    Comment

    Working...
    X