Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kappa statistics homogeneity in STATA: how to fix?

    Dear all,
    I don't know how to fix this.

    I am doing an inter-rater agreement study in which 2 operators evaluate a certain number of patients and classify them with a dichotomous outcome (positive/negative).

    This takes place in three different hospitals (2 operators, different for each hospital, classify a certain number of patients belonging to that hospital).

    I don't know whether a single Kappa statistic can be calculated accounting for this multicentric design.

    But I know that there are methods to evaluate the homogeneity of the three Kappa statistics.

    Do you have any idea how this is possible in STATA?

    Thank you so much in advance!

    Gianfranco

  • #2
    The expectation of these designs is that raters are exchangeable, so if that is true, you should not have to adjust for center effects, as all variability is accounted for among raters and subjects. You could look at the kappa for all raters and see how it compares qualitatively to kappas calculated for each hospital site alone. If they are roughly similar then you can justify that there are no site-specific effects.

    I might also try to estimate the ICC from a two-way mixed model, once ignoring and once adding a fixed effect of center, to observe whether these two ICC estimates are similar.

    Comment


    • #3
      Thanks a lot, Leonardo Guizzetti .

      I understand that if I have, say, a required sample size of 45 patients, I distribute 15 to each of 3 hospitals, and have the two rater operators of each hospital identified as "rater A2 and "rater B", and calculate the global Kappa?
      It doesn't matter who is A and who is B, correct?

      Regarding ICC in a mixed effect model, I wonder if it is adequate to use ICC for a dichotomous variable.

      Thank you again.

      Comment


      • #4
        Originally posted by Gianfranco Di Gennaro View Post
        I understand that if I have, say, a required sample size of 45 patients, I distribute 15 to each of 3 hospitals, and have the two rater operators of each hospital identified as "rater A2 and "rater B", and calculate the global Kappa?
        It doesn't matter who is A and who is B, correct?
        If I understand you correctly, each subject within any hospital is rated by both raters at that same hospital. Then yes, you would compute a global (overall) kappa where raters are identified from 1 to 6 (I hesitate to call them each rater A and B to ensure that they have a unique identifier, or else they will incorrectly be misidentified and combined in the modeling). A conditional, or hospital-specific kappa then uses data from only those two raters.

        Originally posted by Gianfranco Di Gennaro View Post
        Regarding ICC in a mixed effect model, I wonder if it is adequate to use ICC for a dichotomous variable.
        Yes it is, as the standard weighted Fleiss' Kappa is known to be (asymptotically) equivalent to the ICC. See the citation below.

        Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3), 613-619.

        Comment


        • #5
          Thank you again Leonardo Guizzetti

          I have just one last question.
          Do you think it's appropriate to compute a sample size (number of subjects to be rated) based on only two raters and splitting the sample size between the three hospitals?
          Thanks again. All the best!
          Gianfranco


          Comment


          • #6
            Originally posted by Leonardo Guizzetti View Post
            If I understand you correctly, each subject within any hospital is rated by both raters at that same hospital. Then yes, you would compute a global (overall) kappa where raters are identified from 1 to 6 (I hesitate to call them each rater A and B to ensure that they have a unique identifier, or else they will incorrectly be misidentified and combined in the modeling)
            If you talk about Cohen's Kappa, then yes, each rater must be identified. In Fleiss' Kappa (which reduces to Scott's Pi in the two-rater case) referenced later, raters are interchangeable. Here is a quick example, using kappaetc (SJ or SSC):

            Code:
            // setup
            webuse rate2
            
            // mimc binary ratings
            recode rada radb (1/2 = 0) (3/4 = 1)
            
            // mimic hospital 1
            generate rater1 = rada in 1/28
            generate rater2 = radb in 1/28
            
            // mimic hospital 2
            generate rater3 = rada in 29/57
            generate rater4 = radb in 29/57
            
            // mimic hospital 3
            generate rater5 = rada in 58/L
            generate rater6 = radb in 58/L
            
            // overall
            kappaetc rada radb
            
            // separated by hospital
            kappaetc rater1-rater6

            The (relevant) output is

            Code:
            . // overall
            . kappaetc rada radb
            
            Interrater agreement                             Number of subjects =      85
                                                            Ratings per subject =       2
                                                    Number of rating categories =       2
            ------------------------------------------------------------------------------
                                 |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
            ---------------------+--------------------------------------------------------
               Percent Agreement |  0.8471    0.0393  21.57   0.000     0.7690     0.9252
            Brennan and Prediger |  0.6941    0.0785   8.84   0.000     0.5379     0.8503
            Cohen/Conger's Kappa |  0.6347    0.0888   7.14   0.000     0.4580     0.8114
                Scott/Fleiss' Pi |  0.6273    0.0943   6.65   0.000     0.4397     0.8148
                       Gwet's AC |  0.7406    0.0721  10.27   0.000     0.5973     0.8840
            Krippendorff's Alpha |  0.6294    0.0943   6.68   0.000     0.4419     0.8170
            ------------------------------------------------------------------------------
            
            . 
            . // separated by hospital
            . kappaetc rater1-rater6
            
            Interrater agreement                             Number of subjects =      85
                                                            Ratings per subject =       2
                                                    Number of rating categories =       2
            ------------------------------------------------------------------------------
                                 |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
            ---------------------+--------------------------------------------------------
               Percent Agreement |  0.8471    0.0393  21.57   0.000     0.7690     0.9252
            Brennan and Prediger |  0.6941    0.0785   8.84   0.000     0.5379     0.8503
            Cohen/Conger's Kappa |  0.6638    0.0922   7.20   0.000     0.4805     0.8471
             Scott/Fleiss' Kappa |  0.6273    0.0943   6.65   0.000     0.4397     0.8148
                       Gwet's AC |  0.7406    0.0721  10.27   0.000     0.5973     0.8840
            Krippendorff's Alpha |  0.6294    0.0943   6.68   0.000     0.4419     0.8170
            ------------------------------------------------------------------------------
            Note how Cohen's Kappa changes when we estimate agreement among 6 raters instead of pooling the rating in two raters. Note also how Fleiss' Kappa (and all other coefficients) are the same in both scenarios.

            As for the equivalence of Kappa and ICC, see this post. Note however that the equality is shown for quadratically weighted Kappa. With binary ratings, quadratic weights reproduce the unweighted Kappa and the equivalence no longer holds. I am not saying that this is necessarily a problem.

            I have not looked into the homogeneity of agreement coefficients.

            Comment


            • #7
              No, I don't think that method will suffice for sample size. The design aspects that matter for sample size are the number of raters, subjects and average number of ratings per subject.

              In practice, it's not unusual to examine raters at different locations (schools, hospitals, whatever). Heterogeneity due to location is usually a secondary concern, if at all.

              Comment

              Working...
              X