Analyzing discordance

Dev Lex

Join Date: Jun 2025

Posts: 2
#1

Analyzing discordance

10 Sep 2025, 09:48

Hi all,

I’m looking for advice on how to quantify discordance between a continuous variable and its categorical classification.

For example, I have a continuous variable "height" and a categorical variable "heightclass" with three groups: SHA (shorter than average), AHA (average), and THA (taller than average). The categories are derived from an index taking other factors like age into account. The categories are meant to reflect ordinal differences (THA > AHA > SHA).

While the groups differ in height as expected, I want to focus on the exceptions. For instance, someone who is 163cm may be classified as SHA, while someone shorter, 158cm, is classified as AHA. I’d like to systematically measure how often this happens.

Is there a way in Stata to quantify these misclassifications/overlaps? For example, by comparing the observed ranking of height with the assigned heightclass and calculating the proportion of discordant cases?

I hope my question is clear, please feel free to ask questions for clarification.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30188

10 Sep 2025, 10:22

As you do not give example data, I have made a demonstration data set for my approach. It compares the heights themselves, not their ranks, in the different height classes and identifies smaller heights in the larger height group (and vice versa). Each such misclassified pair is reported only once. (That is, if person X and person Y are a misclassified pair, it reports X, Y but not Y, X, or just the other way around.) What it does not do is attempt to decide which one in the pair is the error since you imply that the classification is based on things other than height. In fact, although I have followed your usage of the word misclassification, it seems to have the wrong meaning. I think you are just looking for pairs where a person in one height class is classified in a higher group than a person in another height class, even though the former person is shorter than the latter. It may well be that after taking into account the other variables determining height class, that both are properly classified. Anyway:

Code:

clear*

//  CREATE A DEMONSTRATION DATA SET OF HEIGHTS
set seed 1234
set obs 100
gen height = rnormal(175, 7.5)

//  CLASSIFY THEM INTO BOTTOM QUARTILE, MIDDLE HALF, AND TOP QUARTILE
//  BUT WITH SOME RANDOM MISCLASSIFICATION ERROR
centile height, centile(25 75)
gen heightclass = "SHA" if height < `r(c_1)'
replace heightclass = "AHA" if inrange(height, `r(c_1)', `r(c_2)')
replace heightclass = "THA" if height > `r(c_2)'

//  INTRODUCE A LITTLE MISCLASSIFICATION
replace heightclass = "SHA" if heightclass == "AHA" & runiform() < 0.05
replace heightclass = "THA" if heightclass == "AHA" & runiform() < 0.05

label define heightclass    1   "SHA"   2   "AHA"   3   "THA" // NOTE THE ORDER!
encode heightclass, gen(_heightclass) label(heightclass)
drop heightclass
rename _heightclass heightclass

tabstat height, by(heightclass) statistics(min max) format(%2.1f)


//  SOLUTION BEGINS HERE
//  FORM ALL PAIRS FROM DIFFERENT HEIGHT CLASSES, AND DO NOT RETAIN THE SAME PAIR\
//  IN REVERSE ORDER
gen `c(obs_t)' obs_no = _n
preserve
rename _all =2
tempfile copy
save `copy'
restore
rename _all =1
cross using `copy'
keep if heightclass1 != heightclass2 & obs_no1 < obs_no2

//  NOW RETAIN THE MISCLASSIFIED PAIRS
keep if sign(heightclass2 - heightclass1) != sign(height2 - height1)

//  COUNT THE NUMBER OF MISCLASSIFIED PAIRS IN EACH COMBINATION OF HEIGHT CLASSES
tab heightclass1 heightclass2

Note: If your data set is very large, the -cross- command will take a long time to run, and might even exhaust available memory. If you encounter this problem, post back, and I'll show you an alternate, more complicated approach that might work better.

Announcement

Analyzing discordance

Comment