Reading a previous thread in which a measure of “correlation” [sic’] was requested for a binary X nominal crosstabulation prompted me to think about the absence from the Stata world (built-in or user-written) of an elegant but almost forgotten measure of association, “Goodman and Kruskal’s Tau.” (Goodman, L. A., & Kruskal, W. H. 1954 "Measures of association for cross classifications." Journal of the American Statistical Association, 49, 732-764.) I wanted to offer a brief didactic note and code fragment for it here, as I think what follows is a bit too scholarly for that previous thread. Other data analysis programs (e.g., SPSS) do include Tau.
Tau is an asymmetric 0/1 normed measure of association for two non-ordered (nominal) categorical variables. Although originally derived based on the same proportional reduction in expected prediction error (PRE) rationale it shares with Goodman and Kruskal's Lambda (available in Stata as Nick Cox's -lambda- at SSC), Tau is much superior to the latter, for reasons I’ll leave aside here. Tau can also be understood/calculated as an R2 measure [1 - (conditional variation)/( total variation)], with variation measured by the Simpson diversity index (see, e.g. -ssc describe entropyetc-), and this is how I like to approach it.
I’d say that Tau has never had the use or recognition it might deserve, given its simple and elegant rationale, and its connections to other measures. My completely unsupported explanation would be that Tau is a severe judge of relationships, giving uncomfortably low values on the 0/1 scale.
While I don’t know that Tau deserves an “official” SSC entry, here’s a code I’ve used to calculate it, for whatever interest it might have to others.
Code:
cap mata mata drop gkt() mata: void gkt(string matrix sf) { f = st_matrix(sf) nrow = rows(f) ncol = cols(f) N = sum(f) rowmarg = (rowsum(f))/N E1 = (1 - (rowmarg' * rowmarg)) * N printf(" Total variation= %f\n", E1 ) // colsum = colsum(f) f = f :/ colsum E2 = 0 for (j = 1; j <= ncol; j++) { p = f[.,j] next = (1- (p' * p)) * colsum[j] printf(" Variation for col = %f: %f\n", j, next) E2 = E2 + next // (1- (p' * p)) * colsum[j] } printf(" Sum conditional variation = %f\n", E2) st_rclear() st_numscalar("tau", (E1-E2)/E1) st_numscalar("E1", E1) st_numscalar("E2", E2) } end // capture prog drop gktau program gktau, rclass * This program calculates the Goodman and Kruskal tau measure, * using the Simpson index // Use: gktau ResponseVariable ExplanatoryVariable syntax varlist [if] [in] marksample touse tempname f local y: word 1 of `varlist' local x: word 2 of `varlist' tab2 `y' `x' if `touse', matcell(`f') col chi2 di "" return add // Could be calculated in Stata, but more convenient in Mata. mata: gkt("`f'") di as text " G & K Tau = ", as result %7.4f tau return scalar tau = tau return scalar E1 = E1 return scalar E2 = E2 end // // Illustration sysuse auto, clear gktau rep78 foreign
Comment