Displaying p-values from Kruskal-Willis tests in a loop

Sanjana Ravi

Join Date: Nov 2021

Posts: 16
#1

Displaying p-values from Kruskal-Willis tests in a loop

02 Jul 2024, 15:30

Hi there! I'd really appreciate some feedback on some code I wrote (below). Basically, I'd like to run Kruskal-Wallis tests on a long list of variables, and display only those with p-values less than 0.05.

Code:

sysuse auto, clear foreach var of varlist price mpg rep78 { display "`var'" quietly kwallis `var', by(weight) if chi2tail(r(df), r(chi2)) < 0.05 { local test `test' `var' } } display `"`test'"'

After running the KW test on each variable individually, I can see that this code seems to have produced the correct result (i.e., no p-values less than 0.05). But it's hard to confirm without having a significant p-value to check against.

Thanks for any/all feedback!
Tags: None

Girish Venkataraman

Join Date: Dec 2021
Posts: 281

03 Jul 2024, 08:24

This is code I use for exploratory chi-square test for binary variables in my dataset looking at complete response rates for 4 binary variables. You could possibly adapt this easily for your purpose. You will get the logic. Uses frames and Stata18.

Code:

local x gender mdsrgrp tazy2 ep7
local by cr1        
local subset    "" 


tempname chi
frame create `chi' str30 varname str50 var chi2 N

foreach v of local x{
    qui tab `v' `by' `subset', chi2
    frame post `chi' ("`v'") (`"`: variable label `v''"') (round(r(p),0.0001)) (r(N))
}
frame change `chi'
frame `chi' { 
    gsort chi2
    drop if chi2 > 0.2
}
list, noobs

Comment

George Ford

Join Date: Aug 2014

Posts: 3163
#3

03 Jul 2024, 09:21

[CODE]
sysuse auto, clear foreach var of varlist price mpg rep78 {
display "`var'"
quietly kwallis `var', by(weight)
if chi2tail(r(df), r(chi2)) < 0.5 { //change here to 0.5 just to see it works
local test `test' `var'
}
}

display `"`test'"'
/CODE]
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35734

03 Jul 2024, 13:35

Code:

weight

takes on 64 distinct values in the auto dataset, it's an implausible choice here. At the same time it's clearly just part of an arbitrary example.

I wouldn't apply round() to a P-value as in #2 for two reasons:

1. round() is not a good way to round for presentation purposes unless the second argument is a power of 2, This point often arises here on Statalist. The issue is precision: as at machine level Stata works in binary, so also it doesn't work with decimals in the way that people expect from arithmetic as learned in childhood.

https://journals.sagepub.com/doi/10....867X1801800311 Section 4 addresses the main point.

2. More fundamentally, if you care about P-values at all, you should want the full precision to be preserved on a first loop through the data, so that later you can decide what to work with and what to present.

Here's a first stab at my guess of what the OP might want. I've borrowed some notions from earlier posts in the thread.

Code:

 
. sysuse auto, clear
(1978 automobile data)

. mykwallis mpg weight price, by(foreign) frame(kw_results)

Kruskal–Wallis equality-of-populations rank test

  +---------------------------+
  |  foreign | Obs | Rank sum |
  |----------+-----+----------|
  | Domestic |  52 |  1688.50 |
  |  Foreign |  22 |  1086.50 |
  +---------------------------+

  chi2(1) =  9.564
     Prob = 0.0020

  chi2(1) with ties =  9.614
               Prob = 0.0019

Kruskal–Wallis equality-of-populations rank test

  +---------------------------+
  |  foreign | Obs | Rank sum |
  |----------+-----+----------|
  | Domestic |  52 |  2379.50 |
  |  Foreign |  22 |   395.50 |
  +---------------------------+

  chi2(1) = 25.800
     Prob = 0.0001

  chi2(1) with ties = 25.804
               Prob = 0.0001

Kruskal–Wallis equality-of-populations rank test

  +---------------------------+
  |  foreign | Obs | Rank sum |
  |----------+-----+----------|
  | Domestic |  52 |  1862.00 |
  |  Foreign |  22 |   913.00 |
  +---------------------------+

  chi2(1) =  1.083
     Prob = 0.2980

  chi2(1) with ties =  1.083
               Prob = 0.2980

. frame change kw_results

. list

     +-------------------------------+
     | varname      chisq     Pvalue |
     |-------------------------------|
  1. |     mpg   9.563952   .0019844 |
  2. |  weight   25.80003   3.79e-07 |
  3. |   price   1.083074   .2980108 |
     +-------------------------------+

. format chisq %2.1f

. format Pvalue %6.5f

. list

     +---------------------------+
     | varname   chisq    Pvalue |
     |---------------------------|
  1. |     mpg     9.6   0.00198 |
  2. |  weight    25.8   0.00000 |
  3. |   price     1.1   0.29801 |
     +---------------------------+

Here is mykwallis.ado

Code:

*! 1.0.0 free to anyone who wants to take it further 
program mykwallis 
        version 16 
        syntax varlist(numeric) [if] [in] , by(varname) frame(str)
        
        marksample touse 
        markout `touse' `by', strok 
        quietly count if `touse'
        if r(N) == 0 error 2000 
        if r(N) == 1 error 2001 
        
        frame create `frame' str32 varname chisq Pvalue 
        
        foreach v of local varlist {
                kwallis `v' if `touse', by(`by')
                frame post `frame' ("`v'") (r(chi2)) (chi2tail(r(df), r(chi2))) 
        }
end

It is just a sketch. Much could be done differently or in addition. Do you prefer the adjustment for ties?

All that said, conservative statistical thinking is that you should think about penalising yourself for so many significance tests performed shotgun-fashion.

Comment

Sanjana Ravi

Join Date: Nov 2021

Posts: 16
#5

08 Jul 2024, 08:12

I appreciate all the input - thank you all so much!
Comment

Announcement