Confidence intervals for Kappa for each rating category?

Kristien Verheyen

Join Date: Jul 2025

Posts: 3
#1

Confidence intervals for Kappa for each rating category?

10 Jul 2025, 07:53

Hello,
I am calculating Kappas to assess inter-rater agreement on specific features in radiographs; there are 6 raters, 5 possible outcomes, some missing data (i.e. not all raters rated all radiographs). Using

kap rater1 rater2 rater3 rater4 rater5 rater6

produces a nice table with Kappas for each rating category as well as a combined Kappa. I then use the 'kapci' command to produce a confidence interval for the overall/combined Kappa. So far so good!

However, I'd also like to calculate confidence intervals for the Kappas for each rating category - does anyone know how to do that, or if that is even possible? I've drawn a blank searching the documentation and this forum. I am aware of kappaetc and kappa2 commands but neither seem to have an option to calculate the rating-specific CIs.

I am calculating both unweighted and weighted Kappas, in case that matters.

Many thanks,
Kristien
Tags: None

daniel klein

Join Date: Mar 2014
Posts: 3886

10 Jul 2025, 10:41

Originally posted by Kristien Verheyen View Post

I am calculating Kappas to assess inter-rater agreement [...] [u]sing

kap rater1 rater2 rater3 rater4 rater5 rater6

[...]

I am calculating both unweighted and weighted Kappas, in case that matters.

Out of curiosity, how do you do that? Does kap support weighted estimation for multiple raters in Stata19? Up to Stata 18, kap did not allow any options with more than two raters, and the current syntax diagram does not indicate any changes to this behavior.

Anyway, regarding your question, unfortunately, kap does not return the category-specific kappa values, so you cannot bootstrap them. You're right in noticing that kappaetc (SSC, I suppose) does not compute category-specific kappa. However, you can get those in a nested loop. Here is an example:

Code:

version 18

webuse rvary2

forvalues c = 1/3 { // loop over categories
    
    preserve
    
    forvalues r = 1/5 { // loop over raters
        
        replace rater`r' = (rater`r'==`c') if !mi(rater`r')
        
    }
    
    kappaetc rater1 rater2 rater3 rater4 rater5
    
    restore
    
}

The relevant output

Code:

Interrater agreement                             Number of subjects =      10
                                           Ratings per subject: min =       3
                                                                avg =     4.7
                                                                max =       5
                                        Number of rating categories =       2
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6233    0.0866   7.19   0.000     0.4273     0.8193
Brennan and Prediger |  0.2467    0.1733   1.42   0.188    -0.1453     0.6387
Cohen/Conger's Kappa |  0.2688    0.1518   1.77   0.110    -0.0745     0.6121
 Scott/Fleiss' Kappa |  0.2260    0.1776   1.27   0.235    -0.1757     0.6277
           Gwet's AC |  0.2662    0.1831   1.45   0.180    -0.1480     0.6805
Krippendorff's Alpha |  0.2759    0.1761   1.57   0.152    -0.1224     0.6742
------------------------------------------------------------------------------

(output omitted)

Interrater agreement                             Number of subjects =      10
                                           Ratings per subject: min =       3
                                                                avg =     4.7
                                                                max =       5
                                        Number of rating categories =       2
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.8700    0.0667  13.03   0.000     0.7190     1.0000
Brennan and Prediger |  0.7400    0.1335   5.54   0.000     0.4380     1.0000
Cohen/Conger's Kappa |  0.6435    0.0685   9.39   0.000     0.4885     0.7985
 Scott/Fleiss' Kappa |  0.6384    0.0633  10.08   0.000     0.4952     0.7817
           Gwet's AC |  0.7970    0.1441   5.53   0.000     0.4711     1.0000
Krippendorff's Alpha |  0.6515    0.0597  10.91   0.000     0.5165     0.7865
------------------------------------------------------------------------------
Confidence intervals are clipped at the upper limit.

(output omitted)

Interrater agreement                             Number of subjects =      10
                                           Ratings per subject: min =       3
                                                                avg =     4.7
                                                                max =       5
                                        Number of rating categories =       2
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6733    0.0929   7.25   0.000     0.4631     0.8836
Brennan and Prediger |  0.3467    0.1859   1.87   0.095    -0.0738     0.7671
Cohen/Conger's Kappa |  0.3140    0.1744   1.80   0.105    -0.0805     0.7086
 Scott/Fleiss' Kappa |  0.2788    0.1943   1.43   0.185    -0.1608     0.7185
           Gwet's AC |  0.4028    0.2058   1.96   0.082    -0.0627     0.8683
Krippendorff's Alpha |  0.3044    0.2037   1.49   0.169    -0.1564     0.7653
------------------------------------------------------------------------------

Comment

Kristien Verheyen

Join Date: Jul 2025

Posts: 3
#3

11 Jul 2025, 03:40

hi Dan,
Thanks so much for your response, I've no experience with nested loops but will give it a go!

I use Stata 14 and get the weighted Kappa for multiple raters, multiple outcomes and missing data using the kappa2 command. Doesn't give Kappas for each outcome category though, just a combined Kappa estimate and no confidence interval. If anyone can advise on how to get a CI for this weighted Kappa I'd be most grateful. Example output for kappa2:

. kappa2 Rater1BD Rater2BD Rater3BD Rater4BD Rater5BD Rater6BD, wgt(mine)

Ratings weighted by:
1.0000 0.7500 0.5000 0.0000 0.0000
0.7500 1.0000 0.7500 0.5000 0.0000
0.5000 0.7500 1.0000 0.7500 0.0000
0.0000 0.5000 0.7500 1.0000 0.0000
0.0000 0.0000 0.0000 0.0000 1.0000

+--------------------------------------------------+
| AGREEMENT | Po Pe K |
|----------------+---------------------------------|
| pairwise we | .876 .7112301 .5705923 |
+--------------------------------------------------+
Comment

daniel klein

Join Date: Mar 2014
Posts: 3886

11 Jul 2025, 04:55

Generally, you can bootstrap results from kappa2 to obtain CIs. Edit: kappa2 also has a jackknife option that will give you CIs. However, I wouldn't recommend using kappa2 for weighted estimation when some observations are missing. I cannot remember exactly how kappa2 handles missing values, but there seems to be an issue. Here's a quick example:

Code:

. version 14

. webuse p615b

. set seed 20250711

. forvalues r = 1/5 {
  2.     replace rater`r' = . in `=runiformint(1,10)'
  3. }
(1 real change made, 1 to missing)
(1 real change made, 1 to missing)
(1 real change made, 1 to missing)
(1 real change made, 1 to missing)
(1 real change made, 1 to missing)

. kapwgt mine 1 \ .75 1 \ 0 .75 1

. kappa2 rater1-rater5 , wgt(mine)
Ratings weighted by:
   1.0000   0.7500   0.0000
   0.7500   1.0000   0.7500
   0.0000   0.7500   1.0000


+--------------------------------------------------+
|      AGREEMENT |    Po         Pe          K     |
|----------------+---------------------------------|
|    pairwise we |  .6966667   .8603704  -1.172414 |
+--------------------------------------------------+

That certainly doesn't seem plausible. Compare with

Code:

. kappaetc rater1-rater5 , wgt(mine)

Interrater agreement                             Number of subjects =      10
(weighted analysis)                        Ratings per subject: min =       3
                                                                avg =     4.5
                                                                max =       5
                                        Number of rating categories =       3
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6967    0.0778   8.95   0.000     0.5206     0.8728
Brennan and Prediger |  0.0900    0.2335   0.39   0.709    -0.4383     0.6183
Cohen/Conger's Kappa |  0.3089    0.1351   2.29   0.048     0.0032     0.6146
 Scott/Fleiss' Kappa |  0.2272    0.1792   1.27   0.236    -0.1780     0.6325
           Gwet's AC |  0.1473    0.1956   0.75   0.471    -0.2951     0.5897
Krippendorff's Alpha |  0.2295    0.1766   1.30   0.226    -0.1700     0.6290
------------------------------------------------------------------------------

Note that category-specific kappa values are based on binary ratings, so weights don't make sense. In fact, you can replicate category-specific kappas with the respective weighting matrix. Watch:

Code:

. webuse p615b

. kap rater1-rater5

There are 5 raters per subject:

         Outcome |    Kappa          Z     Prob>Z
-----------------+-------------------------------
               1 |    0.2917       2.92    0.0018
               2 |    0.6711       6.71    0.0000
               3 |    0.3490       3.49    0.0002
-----------------+-------------------------------
        combined |    0.4179       5.83    0.0000

. // replicete category 1 vs. rest
. kapwgt one_vs_rest 1 \ 0 1 \ 0 1 1

. kappaetc rater1-rater5 , wgt(one_vs_rest) showw

Interrater agreement                             Number of subjects =      10
(weighted analysis)                             Ratings per subject =       5
                                        Number of rating categories =       3
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6600    0.0792   8.34   0.000     0.4809     0.8391
Brennan and Prediger |  0.2350    0.1781   1.32   0.220    -0.1679     0.6379
Cohen/Conger's Kappa |  0.3333    0.1368   2.44   0.038     0.0240     0.6427
 Scott/Fleiss' Kappa |  0.2917    0.1639   1.78   0.109    -0.0790     0.6624
           Gwet's AC |  0.2544    0.1695   1.50   0.168    -0.1290     0.6378
Krippendorff's Alpha |  0.3058    0.1639   1.87   0.095    -0.0649     0.6765
------------------------------------------------------------------------------

Weighting matrix (one_vs_rest weights)
  1.0000  0.0000  0.0000
  0.0000  1.0000  1.0000
  0.0000  1.0000  1.0000

Last edited by daniel klein; 11 Jul 2025, 05:25.

Comment

Kristien Verheyen

Join Date: Jul 2025

Posts: 3
#5

11 Jul 2025, 07:36

hi Daniel,
Thanks so much for your further help, much appreciated. Of course it doesn't make sense to generate outcome-specific weighted Kappas, my bad (I'm an epidemiologist not a statistician, can you tell? ;-). The 'nested loops' code for obtaining the CIs for the outcome-specific unweighted estimates worked by the way, so thank you for that!

My understanding was that kappa2 was developed specifically for 'unbalanced' designs where not all raters rate all the subjects, as is the case in my dataset? Reference paper here: http://www.idescat.cat/sort/questiio....8.Abraira.pdf
But completely see your point with the example you gave. The weighted Kappa using kappa2 in my dataset makes sense (0.57 vs. unweighted 0.44); using kappetc with the weights I get 0.59 so not that different.

thanks again!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#6

11 Jul 2025, 14:50

Took me quite a while to figure this one out. There's a bug in kappa2. The command yields incorrect results when there are missing values and user-defined weights.

The following may seem cryptic, but it is hopefully useful, especially for the authors or maintainers of kappa2, who should be notified.

The problem is with the code in the file kappaAux.ado (or kappaaux.ado). The code uses tokenize early on (line 8) to store the specified variable names into local macros 1, 2, ... But these local macros aren't referenced until hundreds of lines later (lines 277 and 281). Unfortunately, with user-defined weights, the code calls parse before in line 169, which wipes out the numbered local macros. Stata doesn't throw a syntax error because ``l''[`p'] (line 277) evaluates to [`p'] which evaluates to [1], [2], ..., which are just numerals 1, 2, ... Anyway, the conditinal statements in the nested loops always fail silently, yielding infalted values for the expected proportion of agreement. Frankly, using numbered local macros over hundreds of lines of code like this is really bad programming style. Hopefully, this example helps others to improve their programming.

Last edited by daniel klein; 11 Jul 2025, 14:56.
1 like
Comment

Announcement