Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is Kappa statistic correct?

    Hello,

    I have 30 patients and want to compare correlation between two measures of physical activity (accelerometer - objective, and questionnaire - subjective). I have first run Spearman's rank correlation coefficient. Is it also appropriate to run Kappa statistic

  • #2
    You have started the exact same topic a couple of days ago. You have not gotten any answers. There might be various reasons for not getting an answer. One might be that you do not provide example data and little information, in general. From what you describe, I think that the objective and subjective measures are based on completely different scales, which probably makes it pretty hard to define "agreement" between them, on which Kappa is based.

    Comment


    • #3
      Apologies I wasn't sure what data to provide - would it be the output of the spearman correlation?

      They are kind of based on different scales. So for example, the questionnaire asks how many minutes you spend walking each day for 7 days and then gives a mean, and the accelerometer also provides a mean over the same time period of minutes walked, where walking is based on cut points that are derived from energy expenditure.

      On this basis do you feel spearman correlation continues to be the most appropriate measure for assessing correlation between the measures, and agreement would be hard to define and therefore Kappa is not appropriate?

      Comment


      • #4
        Thanks for providing more background information. If both measures are based on the same units (minutes per day, in this case), then defining agreement is not a problem. However, Kappa is usually applied to a finite number of possible rating categories, and 'minutes per day' (although finite) might not qualify as such. You could still calculate a Kappa coefficient using appropriate weights for (dis-)agreement that reflect the level of measurement (interval or ratio in this case). Remember to specify option absolte with Stata's kap command or use kappaetc (SSC) to obtain appropriately weighted Kappa coefficients.* The quadratically weighted Kappa will be close to the ICC, which might be the more appropriate measure for what you have. You can get the latter from kappaetc, too.


        * Stata's kap command cannot be used for non-integer values. More precisely, the weighted Kappa coefficient that kap produces for non-integer values is not appropriate, because the weights are based on integer values; the weights should be based on the observed (non-integer) ratings.

        kappaetc produces other coefficients, such as PABAK and Gwet' AC, which depend on the number of rating categories. These coefficients are probably not appropriate when the possible rating categories are not fixed and known in advance.
        Last edited by daniel klein; 20 Jan 2021, 04:36. Reason: Stata's -kap- with option -absolute-

        Comment


        • #5
          I am not deeply familiar with Bland-Altman, but my understanding is that it shows the agreement between two continuous variables, often the same physiological parameter measured by two different instruments. Maybe this is worth investigating instead of Kappa.

          https://www.stata.com/meeting/uk19/s...k19_newson.pdf
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Just a note: basic Bland-Altman plots are also implemented in kappaetc.

            Comment


            • #7
              https://www.statalist.org/forums/for...nk-correlation overlaps.

              Comment


              • #8
                Originally posted by Joe Tuckles View Post
                Apologies I wasn't sure what data to provide - would it be the output of the spearman correlation?

                They are kind of based on different scales. So for example, the questionnaire asks how many minutes you spend walking each day for 7 days and then gives a mean, and the accelerometer also provides a mean over the same time period of minutes walked, where walking is based on cut points that are derived from energy expenditure.

                On this basis do you feel spearman correlation continues to be the most appropriate measure for assessing correlation between the measures, and agreement would be hard to define and therefore Kappa is not appropriate?
                Joe, the fact that you are reporting 7-day means for both variables suggests that you believe they have (at least approximate) interval scale properties. If so, I wonder why you are not looking at the ordinary Pearson correlation and some variety of intra-class correlation. Bear in mind that weighted kappa (with quadratic weights) is equivalent to one common form of ICC (according to Norman & Streiner in a couple of their books). Ah, yes...I just found a relevant quote in this article:

                Norman and Streiner (2008) show that using a weighted kappa with quadratic weights for ordinal scales is identical to a two-way mixed, single-measures, consistency ICC, and the two may be substituted interchangeably. This interchangeability poses a specific advantage when three or more coders are used in a study, since ICCs can accommodate three or more coders whereas weighted kappa can only accommodate two coders (Norman & Streiner, 2008).
                By the way, I would be sure to inspect the scatter-plot too. (But I assume you have already done that.)

                HTH.
                --
                Bruce Weaver
                Email: [email protected]
                Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                Version: Stata/MP 18.0 (Windows)

                Comment


                • #9
                  Originally posted by Bruce Weaver View Post
                  Bear in mind that weighted kappa (with quadratic weights) is equivalent to one common form of ICC (according to Norman & Streiner in a couple of their books).
                  To elaborate on this, note that Fleiss and Cohen (1973) were (to the best of my knowledge) the first to mathematically show that quadratically weighted Kappa is equivalent to the two-way random-effects ICC (for single ratings). By 'equivalent', the authors mean that quadratically weighted Kappa will converge to the ICC(2, 1) as the sample size (of subjects) goes to infinity. Here is a demonstration in Stata:

                  Code:
                  // example data
                  webuse judges , clear
                  
                  // reshape for -kappaetc-
                  quietly reshape wide rating , i(target) j(judge)
                  
                  // weighted kappa (for judges 1 and 2)
                  kappaetc rating1 rating2 , wgt(quadratic)
                  
                  // two-way random-effects model
                  kappaetc rating1 rating2 , icc(random)
                  
                  // replaicate weighted kappa in small samples
                  scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)
                  scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )
                  display "kappa = " r(sigma2_s) / ( denominator + smallsample )
                  which yields

                  Code:
                  . // weighted kappa (for judges 1 and 2)
                  . kappaetc rating1 rating2 , wgt(quadratic)
                  
                  Interrater agreement                             Number of subjects =       6
                  (weighted analysis)                             Ratings per subject =       2
                                                          Number of rating categories =       9
                  ------------------------------------------------------------------------------
                                       |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
                  ---------------------+--------------------------------------------------------
                     Percent Agreement |  0.6564    0.0642  10.23   0.000     0.4914     0.8214
                  Brennan and Prediger | -0.6577    0.3096  -2.12   0.087    -1.0000     0.1382
                  Cohen/Conger's Kappa |  0.1070    0.0613   1.74   0.142    -0.0507     0.2646
                      Scott/Fleiss' Pi | -0.5620    0.1748  -3.22   0.024    -1.0000    -0.1127
                             Gwet's AC | -0.5643    0.3503  -1.61   0.168    -1.0000     0.3362
                  Krippendorff's Alpha | -0.4318    0.1748  -2.47   0.056    -0.8811     0.0175
                  ------------------------------------------------------------------------------
                  Confidence intervals are clipped at the lower limit.
                  
                  .
                  . // two-way random-effects model
                  . kappaetc rating1 rating2 , icc(random)
                  
                  Interrater reliability                           Number of subjects =       6
                  Two-way random-effects model                    Ratings per subject =       2
                  ------------------------------------------------------------------------------
                                 |   Coef.     F     df1     df2      P>F   [95% Conf. Interval]
                  ---------------+--------------------------------------------------------------
                        ICC(2,1) |  0.1257   6.85     5.00    5.00   0.027    0.0000     0.5999
                  ---------------+--------------------------------------------------------------
                         sigma_s |  1.4142
                         sigma_r |  3.6378
                         sigma_e |  0.8266
                  ------------------------------------------------------------------------------
                  Confidence interval is clipped at the lower limit.
                  
                  .
                  . // replicate weighted kappa in small samples
                  . scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)
                  
                  . scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )
                  
                  . display "kappa = " r(sigma2_s) / ( denominator + smallsample )
                  kappa = .10695187

                  Contrary to the statement by Norman and Streiner (2008), weighted Kappa can also accommodate three or more raters (see Conger, 1980). Here is the code (output omitted).

                  Code:
                  // works for multiple raters, too
                  kappaetc rating1 rating2 rating3 rating4 , wgt(quadratic)
                  kappaetc rating1 rating2 rating3 rating4 , icc(random)
                  scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)
                  scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )
                  display "kappa = " r(sigma2_s) / ( denominator + smallsample )


                  Conger, A. J. 1980. Integration and Generalization of Kappa for Multiple Raters. Psychological Bulletin, 88, 322-328.

                  Fleiss, J. L., Cohen, J. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability
                  Educational and Psychological Measurement, 33, pp. 613--619.

                  Norman, G. R., Streiner, D. L. 2008. Biostatistics: The bare essentials. BC Decker: Hamilton, Ontario.
                  Last edited by daniel klein; 20 Jan 2021, 11:37. Reason: add year to publication; spelling

                  Comment


                  • #10
                    Thank you for your help!

                    I have run the code although wondering if it is correct as the terminology of 'raters' confuses me.

                    Code:
                    . kappaetc objective subjective, wgt(quadratic)
                    
                    Interrater agreement                             Number of subjects =      28
                    (weighted analysis)                        Ratings per subject: min =       1
                                                                                    avg =  1.9286
                                                                                    max =       2
                                                            Number of rating categories =      46
                    ------------------------------------------------------------------------------
                                         |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
                    ---------------------+--------------------------------------------------------
                       Percent Agreement |  0.9400    0.0526  17.87   0.000     0.8321     1.0000
                    Brennan and Prediger |  0.3267    0.1785   1.83   0.078    -0.0395     0.6929
                    Cohen/Conger's Kappa |  0.3583    0.1338   2.68   0.012     0.0838     0.6328
                        Scott/Fleiss' Pi |  0.3240    0.1507   2.15   0.041     0.0148     0.6332
                               Gwet's AC |  0.3480    0.1759   1.98   0.058    -0.0130     0.7091
                    Krippendorff's Alpha |  0.3501    0.1465   2.39   0.025     0.0484     0.6518
                    ------------------------------------------------------------------------------
                    Confidence interval is clipped at the upper limit.
                    I cannot seem to run the bland altman plots with the wgt(quadratic) specified but can run without that part.

                    Comment


                    • #11
                      The literature on which kappaetc is based, is mostly concerned with inter-rater agreement; hence the terminology. objective and subjective could be raters, coders, judges, devices, ...

                      The Bland-Altman plot is used for interval-level data; there is no need to have weights for partial (dis-)agreement.

                      Have you also estimated ICC? You do not often see Kappa for 46 "categories" and in small samples, Kappa and ICC might not be close at all.

                      Comment


                      • #12
                        Thanks for clarifying that is helpful! I am not sure where the 46 categories comes from. Should I have added the code
                        Code:
                        noabsolute
                        ? I can't see anything in the help section about
                        Code:
                        absolte
                        I am unsure with the ICC whether it is oneway, random or mixed in my model with how it is described in the help section, can anyone advise?

                        Comment


                        • #13
                          There are 46 distinct values in the two variables objective and subjective. The weights for disagreement should be based on these values. Do not use noabsolute; noabsolute bases the weights on the ordered integers 1, 2, ..., 46.

                          From what you describe, I would think that the mixed ICC best fits your situation. The mixed ICC treats the raters (i.e., the accelerometer and the questionnaire) as fixed and the patients as a random sample.

                          Comment


                          • #14
                            Thank you that produces this output:

                            Code:
                            . kappaetc objective subjective, icc (mixed)
                            
                            Interrater reliability                           Number of subjects =      28
                            Two-way mixed-effects model                Ratings per subject: min =       1
                                                                                            avg =  1.9286
                                                                                            max =       2
                            ------------------------------------------------------------------------------
                                           |   Coef.     F     df1     df2      P>F   [95% Conf. Interval]
                            ---------------+--------------------------------------------------------------
                                  ICC(3,1) |  0.3602   2.26    27.00   27.00   0.019    0.0224     0.6601
                            ---------------+--------------------------------------------------------------
                                   sigma_s |125.7491
                                   sigma_e |167.6074
                            ------------------------------------------------------------------------------
                            Note: F test and confidence intervals are based on methods for complete data.
                            Based can you advise what is the most appropriate to report in a paper? Would it be the ICC coef 0.3602 and confidence intervals alongside some of the data produced earlier:

                            Code:
                            . kappaetc objective subjective, wgt(quadratic)
                            
                            Interrater agreement                             Number of subjects =      28
                            (weighted analysis)                        Ratings per subject: min =       1
                                                                                            avg =  1.9286
                                                                                            max =       2
                                                                    Number of rating categories =      46
                            ------------------------------------------------------------------------------
                                                 |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
                            ---------------------+--------------------------------------------------------
                               Percent Agreement |  0.9400    0.0526  17.87   0.000     0.8321     1.0000
                            Brennan and Prediger |  0.3267    0.1785   1.83   0.078    -0.0395     0.6929
                            Cohen/Conger's Kappa |  0.3583    0.1338   2.68   0.012     0.0838     0.6328
                                Scott/Fleiss' Pi |  0.3240    0.1507   2.15   0.041     0.0148     0.6332
                                       Gwet's AC |  0.3480    0.1759   1.98   0.058    -0.0130     0.7091
                            Krippendorff's Alpha |  0.3501    0.1465   2.39   0.025     0.0484     0.6518
                            ------------------------------------------------------------------------------
                            Confidence interval is clipped at the upper limit.
                            I am not sure what is the most useful there? percent agreement coef p value and confidence intervals?

                            Comment


                            • #15
                              I am sorry, I cannot help here. What to report depends on the research questions, and, perhaps as important, on the audience you are writing for. A (n additional) graphical representation of the data might well be interesting to a broader audience.

                              Comment

                              Working...
                              X