Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kappaetc weighting

    Hi,

    I am currently trying to measure inter rater reliability for a set of data as below. The raters were asked to rate physical suffering on a scale (1 = unknown, 2 =nil, 3 = mild, 4 = moderate, 5 = severe)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(vignette rater14 rater18 rater29 rater39 rater42 rater47 rater59 rater60 rater64)
     1 3 3 1 3 1 3 3 2 3
     2 5 5 4 5 4 1 5 4 4
     3 3 3 3 3 2 3 3 3 3
     4 5 5 4 5 5 4 5 4 5
     5 3 3 3 4 . 3 3 3 3
     6 3 4 3 5 . 3 4 3 3
     7 5 5 4 5 . 4 4 . 5
     8 3 3 3 4 . 3 3 . 3
     9 3 4 3 3 . 3 4 . 4
    10 4 4 2 4 . 2 3 . 4
    11 3 4 5 4 . 3 4 . 3
    12 4 4 4 4 . 3 3 . 3
    13 5 5 5 5 . 5 4 . 5
    14 5 5 5 5 . 5 4 . 5
    end

    I am looking to use a weighed kappa to determine the inter rater reliability, however I am unsure of which weighting to use. I think I understand the difference between linear and quadratic (with linear punishing the difference by being off by any number of categories the same, while quadratic weighting means the penalties increase).

    I am unsure of the ordinal weighting suggested in the 'help kappaetc' file and the mathematical explanation is a little over my head.

    Could someone please explain how the ordinal weighting option differs from linear and quadratic? And which might be most appropriate for my data? I do not think I want to use linear weighting, as I do want harsher penalities if there is disagreement ranging from 2 (nil) to 5 (severe), however I am unsure what the difference between ordinal and quadratic weighting is.

    On the above data:
    If using ordinal weights, the Fleiss kappa = 0.4916.
    If using quadratic weights, the Fleiss kappa = 0.5272
    If using linear weights, the Fleiss kappa = 0.4275

    Thanks a lot,
    Olivia

  • #2
    Olivia

    I will mention that kappaetc is from SSC or SJ (both are the same at this time), just so others know which programs you are using.

    Starting with the difference between linear a quadratic weights for (dis)agreement, I am not sure your understanding is correct. You say

    Originally posted by Olivia Helen View Post
    [...] linear [weights] punishing the difference by being off by any number of categories the same, while quadratic weighting means the penalties increase).
    The "penalties" for disagreement increases (conversely: the degree of agreement decreases) with the distance between rating categories with both linear and quadratic weights; the change is linear in the first case and quadratic in the second. Therefore, your (preliminary) conslusion that

    Originally posted by Olivia Helen View Post
    [...] I do not think I want to use linear weighting, as I do want harsher penalities if there is disagreement
    is probably wrong. Loosely speaking, quadratic weights make the categories more alike, that is, penalize disagreement less than linear weights. Using the five rating categories 1, 2, ..., 5, the rating-pair (2, 5) would be assigned a quadratic weight of 1-(2-5)^2/(5-1)^2 = 0.4375, while the linear weight would only be 1-abs(2-5)/abs(5-1) = 0.25. Remember: the diagonal, that is, full agreement, is assigned a weight of 1. Thus, contrary to your statement, the penalty for disagreement would be more severe with linear weights.

    Originally posted by Olivia Helen View Post
    I am unsure of the ordinal weighting suggested in the 'help kappaetc' file and the mathematical explanation is a little over my head. Could someone please explain how the ordinal weighting option differs from linear and quadratic?
    The help file for kappaetc gives the (simplified) mathematical definition of the weights; it does not really explain the underlying ideas; for that see the cited literature (Gwet 2014: 91-92). Forget the mathematical formulas for a minute. The basic idea behind ordinal weights is to account for the number of possible disagreements between two ratings. For 5 rating categories, there are 10 possible pairs of disagreement: (1, 2), (1, 3), ..., (2, 3), ..., (3, 4), ...., (4, 5). For the two rating categories 2 and 5, there are six possible pairs of disagreement: (2, 3), (2, 4), ..., (4, 5). It turns out that the number of possible pairs between two ratings, k and l, can be calculated with the combinatorial function. In your example, the ordinal weight for the pair (2, 5) is 1-comb(abs(5-2)+1, 2)/comb(5, 2) = 0.4.

    You can specify the showweights option with kappaetc to display the full weighting matrix.

    Turning to your last question

    Originally posted by Olivia Helen View Post
    And which [weights] might be most appropriate for my data?
    I would like to stress that in my view the choice of weighting (dis)agreement and the way the weights are constructed are ultimately the researchers (subjective) decision. There is no mathematical proof or a statistical test that we can rely on. You will have to think about what your data means and try to convince others that using a certain weighting scheme makes more sense than another.

    I would argue that your first rating category, "unknown", appears to mess up the ordinal structure of the ratings. Is "unknown" more or less "nil"? You will have to find a way to deal with that.

    I hope this helps.
    Daniel


    Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
    Last edited by daniel klein; 19 Sep 2019, 02:25.

    Comment

    Working...
    X