Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confidence intervals for Kappa for each rating category?

    Hello,
    I am calculating Kappas to assess inter-rater agreement on specific features in radiographs; there are 6 raters, 5 possible outcomes, some missing data (i.e. not all raters rated all radiographs). Using

    kap rater1 rater2 rater3 rater4 rater5 rater6

    produces a nice table with Kappas for each rating category as well as a combined Kappa. I then use the 'kapci' command to produce a confidence interval for the overall/combined Kappa. So far so good!

    However, I'd also like to calculate confidence intervals for the Kappas for each rating category - does anyone know how to do that, or if that is even possible? I've drawn a blank searching the documentation and this forum. I am aware of kappaetc and kappa2 commands but neither seem to have an option to calculate the rating-specific CIs.

    I am calculating both unweighted and weighted Kappas, in case that matters.

    Many thanks,
    Kristien

  • #2
    Originally posted by Kristien Verheyen View Post
    I am calculating Kappas to assess inter-rater agreement [...] [u]sing

    kap rater1 rater2 rater3 rater4 rater5 rater6

    [...]

    I am calculating both unweighted and weighted Kappas, in case that matters.
    Out of curiosity, how do you do that? Does kap support weighted estimation for multiple raters in Stata19? Up to Stata 18, kap did not allow any options with more than two raters, and the current syntax diagram does not indicate any changes to this behavior.

    Anyway, regarding your question, unfortunately, kap does not return the category-specific kappa values, so you cannot bootstrap them. You're right in noticing that kappaetc (SSC, I suppose) does not compute category-specific kappa. However, you can get those in a nested loop. Here is an example:
    Code:
    version 18
    
    webuse rvary2
    
    forvalues c = 1/3 { // loop over categories
        
        preserve
        
        forvalues r = 1/5 { // loop over raters
            
            replace rater`r' = (rater`r'==`c') if !mi(rater`r')
            
        }
        
        kappaetc rater1 rater2 rater3 rater4 rater5
        
        restore
        
    }
    The relevant output
    Code:
    Interrater agreement                             Number of subjects =      10
                                               Ratings per subject: min =       3
                                                                    avg =     4.7
                                                                    max =       5
                                            Number of rating categories =       2
    ------------------------------------------------------------------------------
                         |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
    ---------------------+--------------------------------------------------------
       Percent Agreement |  0.6233    0.0866   7.19   0.000     0.4273     0.8193
    Brennan and Prediger |  0.2467    0.1733   1.42   0.188    -0.1453     0.6387
    Cohen/Conger's Kappa |  0.2688    0.1518   1.77   0.110    -0.0745     0.6121
     Scott/Fleiss' Kappa |  0.2260    0.1776   1.27   0.235    -0.1757     0.6277
               Gwet's AC |  0.2662    0.1831   1.45   0.180    -0.1480     0.6805
    Krippendorff's Alpha |  0.2759    0.1761   1.57   0.152    -0.1224     0.6742
    ------------------------------------------------------------------------------
    
    (output omitted)
    
    Interrater agreement                             Number of subjects =      10
                                               Ratings per subject: min =       3
                                                                    avg =     4.7
                                                                    max =       5
                                            Number of rating categories =       2
    ------------------------------------------------------------------------------
                         |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
    ---------------------+--------------------------------------------------------
       Percent Agreement |  0.8700    0.0667  13.03   0.000     0.7190     1.0000
    Brennan and Prediger |  0.7400    0.1335   5.54   0.000     0.4380     1.0000
    Cohen/Conger's Kappa |  0.6435    0.0685   9.39   0.000     0.4885     0.7985
     Scott/Fleiss' Kappa |  0.6384    0.0633  10.08   0.000     0.4952     0.7817
               Gwet's AC |  0.7970    0.1441   5.53   0.000     0.4711     1.0000
    Krippendorff's Alpha |  0.6515    0.0597  10.91   0.000     0.5165     0.7865
    ------------------------------------------------------------------------------
    Confidence intervals are clipped at the upper limit.
    
    (output omitted)
    
    Interrater agreement                             Number of subjects =      10
                                               Ratings per subject: min =       3
                                                                    avg =     4.7
                                                                    max =       5
                                            Number of rating categories =       2
    ------------------------------------------------------------------------------
                         |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
    ---------------------+--------------------------------------------------------
       Percent Agreement |  0.6733    0.0929   7.25   0.000     0.4631     0.8836
    Brennan and Prediger |  0.3467    0.1859   1.87   0.095    -0.0738     0.7671
    Cohen/Conger's Kappa |  0.3140    0.1744   1.80   0.105    -0.0805     0.7086
     Scott/Fleiss' Kappa |  0.2788    0.1943   1.43   0.185    -0.1608     0.7185
               Gwet's AC |  0.4028    0.2058   1.96   0.082    -0.0627     0.8683
    Krippendorff's Alpha |  0.3044    0.2037   1.49   0.169    -0.1564     0.7653
    ------------------------------------------------------------------------------

    Comment


    • #3
      hi Dan,
      Thanks so much for your response, I've no experience with nested loops but will give it a go!

      I use Stata 14 and get the weighted Kappa for multiple raters, multiple outcomes and missing data using the kappa2 command. Doesn't give Kappas for each outcome category though, just a combined Kappa estimate and no confidence interval. If anyone can advise on how to get a CI for this weighted Kappa I'd be most grateful. Example output for kappa2:

      . kappa2 Rater1BD Rater2BD Rater3BD Rater4BD Rater5BD Rater6BD, wgt(mine)

      Ratings weighted by:
      1.0000 0.7500 0.5000 0.0000 0.0000
      0.7500 1.0000 0.7500 0.5000 0.0000
      0.5000 0.7500 1.0000 0.7500 0.0000
      0.0000 0.5000 0.7500 1.0000 0.0000
      0.0000 0.0000 0.0000 0.0000 1.0000


      +--------------------------------------------------+
      | AGREEMENT | Po Pe K |
      |----------------+---------------------------------|
      | pairwise we | .876 .7112301 .5705923 |
      +--------------------------------------------------+

      Comment


      • #4
        Generally, you can bootstrap results from kappa2 to obtain CIs. Edit: kappa2 also has a jackknife option that will give you CIs. However, I wouldn't recommend using kappa2 for weighted estimation when some observations are missing. I cannot remember exactly how kappa2 handles missing values, but there seems to be an issue. Here's a quick example:
        Code:
        . version 14
        
        . webuse p615b
        
        . set seed 20250711
        
        . forvalues r = 1/5 {
          2.     replace rater`r' = . in `=runiformint(1,10)'
          3. }
        (1 real change made, 1 to missing)
        (1 real change made, 1 to missing)
        (1 real change made, 1 to missing)
        (1 real change made, 1 to missing)
        (1 real change made, 1 to missing)
        
        . kapwgt mine 1 \ .75 1 \ 0 .75 1
        
        . kappa2 rater1-rater5 , wgt(mine)
        Ratings weighted by:
           1.0000   0.7500   0.0000
           0.7500   1.0000   0.7500
           0.0000   0.7500   1.0000
        
        
        +--------------------------------------------------+
        |      AGREEMENT |    Po         Pe          K     |
        |----------------+---------------------------------|
        |    pairwise we |  .6966667   .8603704  -1.172414 |
        +--------------------------------------------------+
        That certainly doesn't seem plausible. Compare with
        Code:
        . kappaetc rater1-rater5 , wgt(mine)
        
        Interrater agreement                             Number of subjects =      10
        (weighted analysis)                        Ratings per subject: min =       3
                                                                        avg =     4.5
                                                                        max =       5
                                                Number of rating categories =       3
        ------------------------------------------------------------------------------
                             |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
        ---------------------+--------------------------------------------------------
           Percent Agreement |  0.6967    0.0778   8.95   0.000     0.5206     0.8728
        Brennan and Prediger |  0.0900    0.2335   0.39   0.709    -0.4383     0.6183
        Cohen/Conger's Kappa |  0.3089    0.1351   2.29   0.048     0.0032     0.6146
         Scott/Fleiss' Kappa |  0.2272    0.1792   1.27   0.236    -0.1780     0.6325
                   Gwet's AC |  0.1473    0.1956   0.75   0.471    -0.2951     0.5897
        Krippendorff's Alpha |  0.2295    0.1766   1.30   0.226    -0.1700     0.6290
        ------------------------------------------------------------------------------

        Note that category-specific kappa values are based on binary ratings, so weights don't make sense. In fact, you can replicate category-specific kappas with the respective weighting matrix. Watch:
        Code:
        . webuse p615b
        
        . kap rater1-rater5
        
        There are 5 raters per subject:
        
                 Outcome |    Kappa          Z     Prob>Z
        -----------------+-------------------------------
                       1 |    0.2917       2.92    0.0018
                       2 |    0.6711       6.71    0.0000
                       3 |    0.3490       3.49    0.0002
        -----------------+-------------------------------
                combined |    0.4179       5.83    0.0000
        
        . // replicete category 1 vs. rest
        . kapwgt one_vs_rest 1 \ 0 1 \ 0 1 1
        
        . kappaetc rater1-rater5 , wgt(one_vs_rest) showw
        
        Interrater agreement                             Number of subjects =      10
        (weighted analysis)                             Ratings per subject =       5
                                                Number of rating categories =       3
        ------------------------------------------------------------------------------
                             |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
        ---------------------+--------------------------------------------------------
           Percent Agreement |  0.6600    0.0792   8.34   0.000     0.4809     0.8391
        Brennan and Prediger |  0.2350    0.1781   1.32   0.220    -0.1679     0.6379
        Cohen/Conger's Kappa |  0.3333    0.1368   2.44   0.038     0.0240     0.6427
         Scott/Fleiss' Kappa |  0.2917    0.1639   1.78   0.109    -0.0790     0.6624
                   Gwet's AC |  0.2544    0.1695   1.50   0.168    -0.1290     0.6378
        Krippendorff's Alpha |  0.3058    0.1639   1.87   0.095    -0.0649     0.6765
        ------------------------------------------------------------------------------
        
        Weighting matrix (one_vs_rest weights)
          1.0000  0.0000  0.0000
          0.0000  1.0000  1.0000
          0.0000  1.0000  1.0000
        Last edited by daniel klein; 11 Jul 2025, 05:25.

        Comment


        • #5
          hi Daniel,
          Thanks so much for your further help, much appreciated. Of course it doesn't make sense to generate outcome-specific weighted Kappas, my bad (I'm an epidemiologist not a statistician, can you tell? ;-). The 'nested loops' code for obtaining the CIs for the outcome-specific unweighted estimates worked by the way, so thank you for that!

          My understanding was that kappa2 was developed specifically for 'unbalanced' designs where not all raters rate all the subjects, as is the case in my dataset? Reference paper here: http://www.idescat.cat/sort/questiio....8.Abraira.pdf
          But completely see your point with the example you gave. The weighted Kappa using kappa2 in my dataset makes sense (0.57 vs. unweighted 0.44); using kappetc with the weights I get 0.59 so not that different.

          thanks again!

          Comment


          • #6
            Took me quite a while to figure this one out. There's a bug in kappa2. The command yields incorrect results when there are missing values and user-defined weights.

            The following may seem cryptic, but it is hopefully useful, especially for the authors or maintainers of kappa2, who should be notified.

            The problem is with the code in the file kappaAux.ado (or kappaaux.ado). The code uses tokenize early on (line 8) to store the specified variable names into local macros 1, 2, ... But these local macros aren't referenced until hundreds of lines later (lines 277 and 281). Unfortunately, with user-defined weights, the code calls parse before in line 169, which wipes out the numbered local macros. Stata doesn't throw a syntax error because ``l''[`p'] (line 277) evaluates to [`p'] which evaluates to [1], [2], ..., which are just numerals 1, 2, ... Anyway, the conditinal statements in the nested loops always fail silently, yielding infalted values for the expected proportion of agreement. Frankly, using numbered local macros over hundreds of lines of code like this is really bad programming style. Hopefully, this example helps others to improve their programming.
            Last edited by daniel klein; 11 Jul 2025, 14:56.

            Comment

            Working...
            X