Agreement between many raters with many different subjects against gold standard rater

Mark Oyama

Join Date: Aug 2016

Posts: 4
#1

Agreement between many raters with many different subjects against gold standard rater

11 Sep 2020, 07:26

Hi All,
Any tips on how to analyze level of agreement when
1) many raters, let's say students that were randomly selected from larger population
2) raters examine many different randomly selected subjects from larger population of interest, not all subjects rated by every rater and some subjects rated by >1 rater
3) score is binary
4) compare against a single gold standard rater that evaluated all subjects

interested in level of agreement between students (as representing the student body in general) vs gold standard.
Thanks in advance!!
Mark

Data would look some thing like:
student_id subject rating gold_stand_rating

1 1 0 0

1 2 1 1

2 1 0

2 3 0 0

2 4 1 0

3 1 0

4 3 0

4 5 1 1

4 6 1 0

4 7 0 0

5 2 0

5 5 0
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3843
#2

11 Sep 2020, 11:12

Assuming you want some kappa-like statistic, having binary ratings is not a problem; nor are many raters. Missing ratings are not necessarily a problem, provided that ratings are missing at random. Given your example dataset (better to use dataex in the future) I would, nevertheless, recommend inspecting the number of subjects that have at least two ratings and, thus, actually contribute to the observed agreement.

Here is on way to get the dataset into shape for estimating kappa-like coefficients.

Code:

// example data clear input student_id subject rating gold_stand_rating 1 1 0 0 1 2 1 1 2 1 0 . 2 3 0 0 2 4 1 0 3 1 0 . 4 3 0 . 4 5 1 1 4 6 1 0 4 7 0 0 5 2 0 . 5 5 0 . end // fill in gold-standard for all subjects bysort subject (gold_stand_rating) : /// assert mi(gold_stand_rating) if _n > 1 bysort subject (gold_stand_rating) : /// replace gold_stand_ratin = gold_stand_rating[1] // reshape the dataset reshape wide rating , i(subject) j(student_id)

You can then use official Stata's kap command

Code:

kap rating*

You can also use kappaetc (SSC)

Code:

kappaetc rating* , se(unconditional)

The results will differ because the two commands treat missing values differently.

How to deal with the gold-standard rating is, to a large extent, a substantive question. Gwet (2014) discusses the concept of 'validity coefficients'. The basic idea is that agreement does only count as such if the raters agree on the gold-standard. Whether that is the concept you are after, I cannot tell. You could probably also evaluate the extent to which each of the raters agrees with the gold standard.

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.

Last edited by daniel klein; 11 Sep 2020, 11:15.
Comment
Mark Oyama

Join Date: Aug 2016

Posts: 4
#3

21 Sep 2020, 07:28

Hi Dan, thanks for your note, much appreciated--
Comment

student_id	subject	rating	gold_stand_rating
1	1	0	0
1	2	1	1
2	1	0
2	3	0	0
2	4	1	0
3	1	0
4	3	0
4	5	1	1
4	6	1	0
4	7	0	0
5	2	0
5	5	0

Announcement

Agreement between many raters with many different subjects against gold standard rater

Comment

Comment