Is Kappa statistic correct?

Joe Tuckles

Join Date: Jul 2018

Posts: 180
#1

Is Kappa statistic correct?

19 Jan 2021, 12:03

Hello,

I have 30 patients and want to compare correlation between two measures of physical activity (accelerometer - objective, and questionnaire - subjective). I have first run Spearman's rank correlation coefficient. Is it also appropriate to run Kappa statistic
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3859
#2

20 Jan 2021, 03:01

You have started the exact same topic a couple of days ago. You have not gotten any answers. There might be various reasons for not getting an answer. One might be that you do not provide example data and little information, in general. From what you describe, I think that the objective and subjective measures are based on completely different scales, which probably makes it pretty hard to define "agreement" between them, on which Kappa is based.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#3

20 Jan 2021, 03:09

Apologies I wasn't sure what data to provide - would it be the output of the spearman correlation?

They are kind of based on different scales. So for example, the questionnaire asks how many minutes you spend walking each day for 7 days and then gives a mean, and the accelerometer also provides a mean over the same time period of minutes walked, where walking is based on cut points that are derived from energy expenditure.

On this basis do you feel spearman correlation continues to be the most appropriate measure for assessing correlation between the measures, and agreement would be hard to define and therefore Kappa is not appropriate?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#4

20 Jan 2021, 04:21

Thanks for providing more background information. If both measures are based on the same units (minutes per day, in this case), then defining agreement is not a problem. However, Kappa is usually applied to a finite number of possible rating categories, and 'minutes per day' (although finite) might not qualify as such. You could still calculate a Kappa coefficient using appropriate weights for (dis-)agreement that reflect the level of measurement (interval or ratio in this case). Remember to specify option absolte with Stata's kap command or use kappaetc (SSC) to obtain appropriately weighted Kappa coefficients.* The quadratically weighted Kappa will be close to the ICC, which might be the more appropriate measure for what you have. You can get the latter from kappaetc, too.

* Stata's kap command cannot be used for non-integer values. More precisely, the weighted Kappa coefficient that kap produces for non-integer values is not appropriate, because the weights are based on integer values; the weights should be based on the observed (non-integer) ratings.

kappaetc produces other coefficients, such as PABAK and Gwet' AC, which depend on the number of rating categories. These coefficients are probably not appropriate when the possible rating categories are not fixed and known in advance.

Last edited by daniel klein; 20 Jan 2021, 04:36. Reason: Stata's -kap- with option -absolute-
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

20 Jan 2021, 06:14

I am not deeply familiar with Bland-Altman, but my understanding is that it shows the agreement between two continuous variables, often the same physiological parameter measured by two different instruments. Maybe this is worth investigating instead of Kappa.

https://www.stata.com/meeting/uk19/s...k19_newson.pdf

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#6

20 Jan 2021, 06:42

Just a note: basic Bland-Altman plots are also implemented in kappaetc.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

20 Jan 2021, 07:24

https://www.statalist.org/forums/for...nk-correlation overlaps.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#8

20 Jan 2021, 08:03

Originally posted by Joe Tuckles View Post

Apologies I wasn't sure what data to provide - would it be the output of the spearman correlation?

They are kind of based on different scales. So for example, the questionnaire asks how many minutes you spend walking each day for 7 days and then gives a mean, and the accelerometer also provides a mean over the same time period of minutes walked, where walking is based on cut points that are derived from energy expenditure.

On this basis do you feel spearman correlation continues to be the most appropriate measure for assessing correlation between the measures, and agreement would be hard to define and therefore Kappa is not appropriate?

Joe, the fact that you are reporting 7-day means for both variables suggests that you believe they have (at least approximate) interval scale properties. If so, I wonder why you are not looking at the ordinary Pearson correlation and some variety of intra-class correlation. Bear in mind that weighted kappa (with quadratic weights) is equivalent to one common form of ICC (according to Norman & Streiner in a couple of their books). Ah, yes...I just found a relevant quote in this article:

Norman and Streiner (2008) show that using a weighted kappa with quadratic weights for ordinal scales is identical to a two-way mixed, single-measures, consistency ICC, and the two may be substituted interchangeably. This interchangeability poses a specific advantage when three or more coders are used in a study, since ICCs can accommodate three or more coders whereas weighted kappa can only accommodate two coders (Norman & Streiner, 2008).

By the way, I would be sure to inspect the scatter-plot too. (But I assume you have already done that.)

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
2 likes
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

20 Jan 2021, 11:28

Originally posted by Bruce Weaver View Post

Bear in mind that weighted kappa (with quadratic weights) is equivalent to one common form of ICC (according to Norman & Streiner in a couple of their books).

To elaborate on this, note that Fleiss and Cohen (1973) were (to the best of my knowledge) the first to mathematically show that quadratically weighted Kappa is equivalent to the two-way random-effects ICC (for single ratings). By 'equivalent', the authors mean that quadratically weighted Kappa will converge to the ICC(2, 1) as the sample size (of subjects) goes to infinity. Here is a demonstration in Stata:

Code:

// example data
webuse judges , clear

// reshape for -kappaetc-
quietly reshape wide rating , i(target) j(judge)

// weighted kappa (for judges 1 and 2)
kappaetc rating1 rating2 , wgt(quadratic)

// two-way random-effects model
kappaetc rating1 rating2 , icc(random)

// replaicate weighted kappa in small samples
scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)
scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )
display "kappa = " r(sigma2_s) / ( denominator + smallsample )

which yields

Code:

. // weighted kappa (for judges 1 and 2)
. kappaetc rating1 rating2 , wgt(quadratic)

Interrater agreement                             Number of subjects =       6
(weighted analysis)                             Ratings per subject =       2
                                        Number of rating categories =       9
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6564    0.0642  10.23   0.000     0.4914     0.8214
Brennan and Prediger | -0.6577    0.3096  -2.12   0.087    -1.0000     0.1382
Cohen/Conger's Kappa |  0.1070    0.0613   1.74   0.142    -0.0507     0.2646
    Scott/Fleiss' Pi | -0.5620    0.1748  -3.22   0.024    -1.0000    -0.1127
           Gwet's AC | -0.5643    0.3503  -1.61   0.168    -1.0000     0.3362
Krippendorff's Alpha | -0.4318    0.1748  -2.47   0.056    -0.8811     0.0175
------------------------------------------------------------------------------
Confidence intervals are clipped at the lower limit.

.
. // two-way random-effects model
. kappaetc rating1 rating2 , icc(random)

Interrater reliability                           Number of subjects =       6
Two-way random-effects model                    Ratings per subject =       2
------------------------------------------------------------------------------
               |   Coef.     F     df1     df2      P>F   [95% Conf. Interval]
---------------+--------------------------------------------------------------
      ICC(2,1) |  0.1257   6.85     5.00    5.00   0.027    0.0000     0.5999
---------------+--------------------------------------------------------------
       sigma_s |  1.4142
       sigma_r |  3.6378
       sigma_e |  0.8266
------------------------------------------------------------------------------
Confidence interval is clipped at the lower limit.

.
. // replicate weighted kappa in small samples
. scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)

. scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )

. display "kappa = " r(sigma2_s) / ( denominator + smallsample )
kappa = .10695187

Contrary to the statement by Norman and Streiner (2008), weighted Kappa can also accommodate three or more raters (see Conger, 1980). Here is the code (output omitted).

Code:

// works for multiple raters, too
kappaetc rating1 rating2 rating3 rating4 , wgt(quadratic)
kappaetc rating1 rating2 rating3 rating4 , icc(random)
scalar denominator = r(sigma2_s) + r(sigma2_r) + r(sigma2_e)
scalar smallsample = ( 1/(r(N)-1) *(r(sigma2_r) + r(sigma2_e)) )
display "kappa = " r(sigma2_s) / ( denominator + smallsample )

Conger, A. J. 1980. Integration and Generalization of Kappa for Multiple Raters. Psychological Bulletin, 88, 322-328.

Fleiss, J. L., Cohen, J. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability
Educational and Psychological Measurement, 33, pp. 613--619.

Norman, G. R., Streiner, D. L. 2008. Biostatistics: The bare essentials. BC Decker: Hamilton, Ontario.

Last edited by daniel klein; 20 Jan 2021, 11:37. Reason: add year to publication; spelling

Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#10

23 Jan 2021, 07:56

Thank you for your help!

I have run the code although wondering if it is correct as the terminology of 'raters' confuses me.

Code:

. kappaetc objective subjective, wgt(quadratic)

Interrater agreement                             Number of subjects =      28
(weighted analysis)                        Ratings per subject: min =       1
                                                                avg =  1.9286
                                                                max =       2
                                        Number of rating categories =      46
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.9400    0.0526  17.87   0.000     0.8321     1.0000
Brennan and Prediger |  0.3267    0.1785   1.83   0.078    -0.0395     0.6929
Cohen/Conger's Kappa |  0.3583    0.1338   2.68   0.012     0.0838     0.6328
    Scott/Fleiss' Pi |  0.3240    0.1507   2.15   0.041     0.0148     0.6332
           Gwet's AC |  0.3480    0.1759   1.98   0.058    -0.0130     0.7091
Krippendorff's Alpha |  0.3501    0.1465   2.39   0.025     0.0484     0.6518
------------------------------------------------------------------------------
Confidence interval is clipped at the upper limit.

I cannot seem to run the bland altman plots with the wgt(quadratic) specified but can run without that part.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#11

23 Jan 2021, 14:14

The literature on which kappaetc is based, is mostly concerned with inter-rater agreement; hence the terminology. objective and subjective could be raters, coders, judges, devices, ...

The Bland-Altman plot is used for interval-level data; there is no need to have weights for partial (dis-)agreement.

Have you also estimated ICC? You do not often see Kappa for 46 "categories" and in small samples, Kappa and ICC might not be close at all.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#12

24 Jan 2021, 04:49

Thanks for clarifying that is helpful! I am not sure where the 46 categories comes from. Should I have added the code

Code:

noabsolute

? I can't see anything in the help section about

Code:

absolte

I am unsure with the ICC whether it is oneway, random or mixed in my model with how it is described in the help section, can anyone advise?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#13

24 Jan 2021, 05:35

There are 46 distinct values in the two variables objective and subjective. The weights for disagreement should be based on these values. Do not use noabsolute; noabsolute bases the weights on the ordered integers 1, 2, ..., 46.

From what you describe, I would think that the mixed ICC best fits your situation. The mixed ICC treats the raters (i.e., the accelerometer and the questionnaire) as fixed and the patients as a random sample.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#14

24 Jan 2021, 06:04

Thank you that produces this output:

Code:

. kappaetc objective subjective, icc (mixed)

Interrater reliability                           Number of subjects =      28
Two-way mixed-effects model                Ratings per subject: min =       1
                                                                avg =  1.9286
                                                                max =       2
------------------------------------------------------------------------------
               |   Coef.     F     df1     df2      P>F   [95% Conf. Interval]
---------------+--------------------------------------------------------------
      ICC(3,1) |  0.3602   2.26    27.00   27.00   0.019    0.0224     0.6601
---------------+--------------------------------------------------------------
       sigma_s |125.7491
       sigma_e |167.6074
------------------------------------------------------------------------------
Note: F test and confidence intervals are based on methods for complete data.

Based can you advise what is the most appropriate to report in a paper? Would it be the ICC coef 0.3602 and confidence intervals alongside some of the data produced earlier:

Code:

. kappaetc objective subjective, wgt(quadratic)

Interrater agreement                             Number of subjects =      28
(weighted analysis)                        Ratings per subject: min =       1
                                                                avg =  1.9286
                                                                max =       2
                                        Number of rating categories =      46
------------------------------------------------------------------------------
                     |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.9400    0.0526  17.87   0.000     0.8321     1.0000
Brennan and Prediger |  0.3267    0.1785   1.83   0.078    -0.0395     0.6929
Cohen/Conger's Kappa |  0.3583    0.1338   2.68   0.012     0.0838     0.6328
    Scott/Fleiss' Pi |  0.3240    0.1507   2.15   0.041     0.0148     0.6332
           Gwet's AC |  0.3480    0.1759   1.98   0.058    -0.0130     0.7091
Krippendorff's Alpha |  0.3501    0.1465   2.39   0.025     0.0484     0.6518
------------------------------------------------------------------------------
Confidence interval is clipped at the upper limit.

I am not sure what is the most useful there? percent agreement coef p value and confidence intervals?

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#15

24 Jan 2021, 06:55

I am sorry, I cannot help here. What to report depends on the research questions, and, perhaps as important, on the audience you are writing for. A (n additional) graphical representation of the data might well be interesting to a broader audience.
Comment

Announcement