Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Agreement between many raters with many different subjects against gold standard rater

    Hi All,
    Any tips on how to analyze level of agreement when
    1) many raters, let's say students that were randomly selected from larger population
    2) raters examine many different randomly selected subjects from larger population of interest, not all subjects rated by every rater and some subjects rated by >1 rater
    3) score is binary
    4) compare against a single gold standard rater that evaluated all subjects

    interested in level of agreement between students (as representing the student body in general) vs gold standard.
    Thanks in advance!!
    Mark

    Data would look some thing like:
    student_id subject rating gold_stand_rating
    1 1 0 0
    1 2 1 1
    2 1 0
    2 3 0 0
    2 4 1 0
    3 1 0
    4 3 0
    4 5 1 1
    4 6 1 0
    4 7 0 0
    5 2 0
    5 5 0

  • #2
    Assuming you want some kappa-like statistic, having binary ratings is not a problem; nor are many raters. Missing ratings are not necessarily a problem, provided that ratings are missing at random. Given your example dataset (better to use dataex in the future) I would, nevertheless, recommend inspecting the number of subjects that have at least two ratings and, thus, actually contribute to the observed agreement.

    Here is on way to get the dataset into shape for estimating kappa-like coefficients.

    Code:
    // example data
    clear
    input student_id subject rating    gold_stand_rating
    1    1    0    0
    1    2    1    1
    2    1    0    .
    2    3    0    0
    2    4    1    0
    3    1    0    .
    4    3    0    .
    4    5    1    1
    4    6    1    0
    4    7    0    0
    5    2    0    .
    5    5    0    .
    end
    
    // fill in gold-standard for all subjects
    bysort subject (gold_stand_rating) : ///
        assert mi(gold_stand_rating) if _n > 1
    bysort subject (gold_stand_rating) : ///
        replace gold_stand_ratin = gold_stand_rating[1]
    
    // reshape the dataset    
    reshape wide rating , i(subject) j(student_id)
    You can then use official Stata's kap command

    Code:
    kap rating*
    You can also use kappaetc (SSC)

    Code:
    kappaetc rating* , se(unconditional)
    The results will differ because the two commands treat missing values differently.


    How to deal with the gold-standard rating is, to a large extent, a substantive question. Gwet (2014) discusses the concept of 'validity coefficients'. The basic idea is that agreement does only count as such if the raters agree on the gold-standard. Whether that is the concept you are after, I cannot tell. You could probably also evaluate the extent to which each of the raters agrees with the gold standard.


    Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
    Last edited by daniel klein; 11 Sep 2020, 11:15.

    Comment


    • #3
      Hi Dan, thanks for your note, much appreciated--

      Comment

      Working...
      X