kappaetc: Testing for signficance of a difference in before / after intervention

Michael McCulloch

Join Date: May 2025

Posts: 24
#1

kappaetc: Testing for signficance of a difference in before / after intervention

12 Jun 2020, 18:09

I am using kappaetc (SSC) Inter-rater reliability, comparing diagnostic agreement before vs. after a diagnostic skills training event. I'd like to ask for advice on how to approach testing for signficance of the difference between those two measurements, before vs. after training.

kappaetc returns both the coefficients and standard errors as matrices.
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2396
#2

12 Jun 2020, 18:23

You can compare two correlated ICCs using the method described by Donner and Zou.

Donner, A., & Zou, G. (2002). Testing the equality of dependent intraclass correlation coefficients. Journal of the Royal Statistical Society: Series D (The Statistician), 51(3), 367–379. https://doi.org/10.1111/1467-9884.00324

Alternatively, you could build the appropriate regression model which incorporates time as a fixed effect and perform the usual Wald-type test.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3835
#3

13 Jun 2020, 03:06

You might be interested in Gwet's (2016) discussion of comparing agreement coefficients, such as Cohen's kappa. The approach basically boils down to a paired t-test; this is implemented in kappaetc:

Code:

help kappaetc ttest

The implementation might fail for large datasets, i.e., many observations.

Gwet, K. L. 2016. Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. Educational and Psychological Measurement, 76, 609-637.

Last edited by daniel klein; 13 Jun 2020, 03:14.
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#4

13 Jun 2020, 13:56

Thank you Daniel for the -kappaetc ttest- tool.
I have implemented the tool using a dataset with 7 raters, who assessed diagnosis before and after a diagnostic training course. Code is:

Code:

kappaetc Rater1-Rater7 if reading=="before", store(before) kappaetc Rater1-Rater7 if reading=="after", store(after) kappaetc before == after, ttest

In the output I show here, 0.221 would represent the (nonsignificant) p-value for the comparison between the two ICCs?
kappaetc ttest for Statalist.pdf

And... thank you also Leonardo, for the statistical paper.
Attached Files

kappaetc ttest for Statalist.pdf (82.8 KB, 2 views)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3835
#5

13 Jun 2020, 14:18

Originally posted by Michael McCulloch View Post

In the output I show here, 0.221 would represent the (nonsignificant) p-value for the comparison between the two ICCs?

Note that Gwet's AC is not the ICC; nor are any of the other agreement coefficients. A (quadratically) weighted Cohen's kappa should be close to ICC but it is not the same thing. You do not provide a lot of details concerning the contents of your research so it is a bit hard to comment on whether you want agreement coefficients or ICC. In general, you could use agreement coefficients if the rating categories are known in advance and/or categorical in nature.

Originally posted by Michael McCulloch View Post

I have implemented the tool using a dataset with 7 raters, who assessed diagnosis before and after a diagnostic training course.

The syntax seems correct. The output (presenting in code-delimiters instead of pdf would be preferred) suggests that you have only 7 subjects that were rated. Note that the standard errors for agreement coefficients as well as the paired t-test are based on large sample approximation. You might want to consider a different approach, such as bootstrapping.
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#6

13 Jun 2020, 14:47

Sorry Daniel for the ICC typo. I chose Gwet's AC because of some missing data we found.

Thank you for pointing out the small sample issue. Our study was 7 raters, and we compared Before- vs. After-training diagnostic assessment of 7 subjects. How would bootstrapping be approached given the reporting provided by kappaetc?
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#7

13 Jun 2020, 14:48

And yes, rating categories were specified in advance, and all categorical.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2396
#8

13 Jun 2020, 15:50

If there are missing ratings , the ICC as Daniel has implemented (based on Gwet's methods) follow from a generalization of ANOVA, and so would be appropriate.

A quadratically weighted kappa (for ordinal outcome) has been shown to be asymptomatically identical to an ICC, though small differences will be found in small samples or with missing data.

Nevertheless, Daniel's advice about a bootstrap is worthwhile. It's possible to perform cluster resampling, where the cluster is the the target of the rating, which preserves the correlation structure of the data.
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#9

13 Jun 2020, 18:22

Thank you Leonardo, I believe I can implement this with code such as this:

Code:

bootstrapbootstrap exp_list, cluster(varlist) : kappaetc

However, may I ask, I'm not sure I understand that you mean when you say

the cluster is the the target of the rating.
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#10

13 Jun 2020, 18:23

Obviously, correction:

Code:

bootstrap exp_list, cluster(varlist) : kappaetc
Comment
Michael McCulloch

Join Date: May 2025

Posts: 24
#11

13 Jun 2020, 18:32

And if my raters are clinicians, and I'm comparing Gwet's AC before vs. after a diagnostic skill teaching intervention, then could the binary "Before/After" variable be the cluster?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2396
#12

13 Jun 2020, 18:38

Originally posted by Michael McCulloch View Post

And if my raters are clinicians, and I'm comparing Gwet's AC before vs. after a diagnostic skill teaching intervention, then could the binary "Before/After" variable be the cluster?

I can't speak to Gwet's AC because I'm not familiar enough with it. That aside, the usual set up that I encounter are physicians (i.e., judges) evaluating patients (i.e., subjects). You have the same judges in your setup, and presumably they are also making ratings about patients (or diagnostic images), right? So the patient would be the cluster level. Time wouldn't work here because it would imply two clusters.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3835
#13

14 Jun 2020, 02:02

Let us be very clear about the setup. I understand that there are 7 raters/judges/clinicians and 7 subjects (patients or whatever). The 7 subjects are rated twice, i.e., the raters rate the same group of 7 subjects twice. I further understand that the dataset contains 7 variables, representing the raters and one variable representing the time, i.e., before or after training. There are 14 observations, represented the 7 subjects at two points in time.

Given this setup, I would first reshape the (small) dataset to wide form so that we have the first seven variables, say Rater1_1-Rater7_1, represent the ratings before training and the last seven variables, say, Rater1_2-Rater7_2, represent the ratings after training. This could be along the lines

Code:

rename (Rater#) (Rater#_) unab Raters : Rater1_-Rater7_ reshape wide `Raters' , i(id) j(time)

assuming id identifies the 7 subjects and time identifies the 2 points in time.

We could then simply bootstrap from the 7 observations. Here is a sketch

Code:

capture program drop kappaetc_bs program kappaetc_bs , rclass tempname before after diff kappaetc Rater1_1-Rater7_1 matrix `before' = r(b) kappaetc Rater1_2-Rater7_2 matrix `after' = r(b) matrix `diff' = `before' - `after' return matrix diff = `diff' end

We could then

Code:

bootstrap delta_ac = el(r(diff), 1, 5) , reps(500) : kappaetc_bs estat bootstrap , all

Note that this would be easier with example data to play with.

Hope that helps.

Last edited by daniel klein; 14 Jun 2020, 02:11.
1 like
Comment

Michael McCulloch

Join Date: May 2025
Posts: 24

#14

14 Jun 2020, 20:28

Thank you Daniel, for prompting the reshape as a first step. I have implemented this, and now have a data example here.
I'm starting to contemplate the -capture program- code sketch, this being new to me.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte subject str18(depth1 depth2 depth3 depth4 depth5 depth6 depth7) str6 reading
51 "Neither"  "Neither"  "Deep"     "Neither" "Neither"  "Deep"     "Deep"     "after"
52 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
53 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
54 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
55 "Neither"  ""         "Neither"  "Neither" "Neither"  "Floating" "Floating" "after"
56 "Neither"  "Neither"  "Neither"  "Neither" "Neither"  "Floating" "Deep"     "after"
57 "Neither"  "Deep"     "Floating" "Neither" "Neither"  "Neither"  "Floating" "after"
51 "Floating" ""         "Deep"     "Neither" "Neither"  "Neither"  "Deep"     "before"
52 "Floating" "Floating" "Neither"  "Neither" "Floating" "Neither"  "Neither"  "before"
53 "Floating" "Floating" "Neither"  "Neither" "Floating" "Neither"  "Floating" "before"
54 "Floating" "Deep"     "Deep"     "Neither" "Floating" "Floating" "Floating" "before"
55 "Deep"     "Deep"     "Deep"     "Neither" "Floating" "Deep"     "Neither"  "before"
56 "Floating" "Floating" "Deep"     "Deep"    "Neither"  "Deep"     "Deep"     "before"
57 "Deep"     "Floating" "Neither"  "Neither" "Floating" "Deep"     "Neither"  "before"
end

sort reading  subject
l subject reading depth1 depth2 depth3, noo sepby(reading)

Comment

Michael McCulloch

Join Date: May 2025

Posts: 24
#15

14 Jun 2020, 20:34

NB:
The diagnostic training is in palpation skills, with 3 categorical answers for raters to choose: Floating, Deep or Neither.
Var subject (N=7), with ID 51-57.
Var reading indicates the ratings provided before vs. after the diagnostic skills training intervention.
Variables depthX record the choices of rater 1 through 7, with "depth1" being the rating choice for rater1, depth2 for rater2 and so on.

Last edited by Michael McCulloch; 14 Jun 2020, 20:38.
Comment

Announcement

kappaetc: Testing for signficance of a difference in before / after intervention

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment