Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percent Correct / Percent Agreement

    Hello,

    I have a data set from which I am calculating inter-rater reliability. I have run kappaetc and have what I need from that output. However, I would like to create two new variables called percentcorrect, which represents the percent of raters that scored each item correctly, and percentagree, which represents the percent agreement per item. The item scores are binary and the correct answers are binary. For example, if the item should have been rated yes, then the answer code is 1. No is 0. I have already generated the percentcorrect and percentagree variables which are currently equal to 0.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str7 item byte(raterw raterg raterd ratert raterbe raterbr raterr correctscore) float(percentcorrect percentagree)
    "R.8.01" 0 1 1 1 1 1 1 1 0 0
    "R.8.02" 0 1 1 0 0 0 1 0 0 0
    "R.8.03" 1 1 1 1 1 1 1 1 0 0
    "R.9.03" 1 1 1 1 1 1 1 1 0 0
    "R.2.01" 1 1 1 1 1 1 1 1 0 0
    end
    My question:

    Is there a way to calculate the percentcorrect and percentagree variables using my current data input. I feel like there should be a simple way and am hoping to learn something new and possibly generalizable to future analyses.

    OR

    Am I right in my thinking that I should create new binary variables for each rater that indicate whether or not they scored the item accurately to use to create the percentcorrect variable? If so, do you have a suggestion for the percentagree variable at the item level?

    Thanks so much! Sometimes I look at something too long and make it FAR more complicated than it should be.

  • #2
    I'm not sure what you mean by percent agreement. With just two raters, it is clear that it means the percent of items on which both raters gave the same response. It is less clear what this means with a larger number of raters. In the code below, I presume that you mean: consider all possible (unordered, and non-identical)* pairs of raters, and calculate the percent of those pairs in which the two paired raters gave the same response.

    Note that the calculations are best done with the data in long layout, not wide. In the end, I have restored the results to wide layout. But you should think ahead about what you will be doing next with this data. Most Stata data management and analysis commands work best, or only, with long data. So unless you know that you will be doing things that Stata does better with wide data, you would be best advised to omit that final -reshape wide- command and keep the data in long layout.

    Code:
    rename rater* response*
    reshape long response, i(item) j(rater) string
    
    preserve
    keep item rater response
    rename (rater response) =_U
    tempfile holding
    save `holding'
    restore
    
    by item (rater), sort: egen percent_correct = mean(response == correctscore)
    replace percent_correct = 100*percent_correct
    preserve
    
    joinby item using `holding'
    keep if rater > rater_U
    by item (rater rater_U), sort: egen percent_agreement = mean(response == response_U)
    replace percent_agreement = 100*percent_agreement
    by item: keep if _n == 1
    keep item percent_agreement
    save `holding', replace
    
    restore
    merge m:1 item using `holding', assert(match) nogenerate
    reshape wide
    rename response* rater*
    *By unordered, non-identical pairs, I mean that we do not count a "pair" where both members of the pair are the same rater, and we do not consider the pair X, Y to differ from the pair Y, X.

    Comment


    • #3
      Originally posted by Allison Cusano View Post
      Is there a way to calculate the percentcorrect and percentagree variables using my current data input. I feel like there should be a simple way ...
      So, the percentage of correct ratings is indeed simple (assuming no missing ratings):
      Code:
      egen number_of_yes = rowtotal(raterw raterg raterd ratert raterbe raterbr raterr)
      generate percent_correct = cond(correctscore==1,number_of_yes,7-number_of_yes) / .07
      As Clyde Schechter points out, for more than two raters, there are different ways to define percent agreement. You mention kappaetc (preferably from SSC but fine from SJ). You can use the store() option to put the subject-level percent agreement into r(). You can then shift these results into a variable:
      Code:
      kappaetc raterw raterg raterd ratert raterbe raterbr raterr , store(my_r_results)
      matrix percent_agreement = r(b_istar)[1..5,1]
      svmat percent_agreement
      Because you know the "correct" rating category of the subjects, you might want what Gwet (2014, 324f.) calls "validity" coefficients rather than reliability coefficients. This is partly implemented, albeit not documented, in kappaetc via the acm() option. ACM is short for "absolute category membership" (Gwet 2014, 312). The basic idea is that agreement only counts as such if it is the correct absolute category membership. With your example data, I get
      Code:
      . kappaetc raterw raterg raterd ratert raterbe raterbr raterr , acm(correctscore)
      
      Interrater agreement                             Number of subjects =       5
      ( ACM analysis)                                 Ratings per subject =       7
                                              Number of rating categories =       2
      ------------------------------------------------------------------------------
                           |   Coef.  Std. Err.    t    P>|t|   [95% Conf. Interval]
      ---------------------+--------------------------------------------------------
         Percent Agreement |  0.8000    0.1400   5.72   0.005     0.4114     1.0000
      Brennan and Prediger |  0.7333    0.1866   3.93   0.017     0.2152     1.0000
                 Gwet's AC |  0.7721    0.1798   4.29   0.013     0.2728     1.0000
      ------------------------------------------------------------------------------
      Confidence intervals are clipped at the upper limit.

      A final remark: With binary ratings and 7 raters, the minimum observed agreement (regardless of correct or incorrect category) is 4 out of 7, which is 57 %. Keep this in mind when interpreting the percent agreement.



      Gwet, K. L. 2014. Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
      Last edited by daniel klein; 24 Nov 2024, 12:02.

      Comment


      • #4
        Thank you Clyde and Daniel. Clyde, this is all I need to do with this data set. I reshaped it prior to analysis to run kappa. The code you shared makes sense and will definitely be helpful as additional, more complex iterations of this work come in. Daniel, thank you for bringing up the acm option. I will apply to the full data set. You bring up a good point about the binary ratings vs. 7 raters - this has been a discussion as there are 49 total items across these same 7 raters and the team I am working with wants to achieve, at minimum, 90% agreement. Technically doable? Yes. I have explained the challenge and still we persist!

        Thank you again for your help.

        Allison

        Comment

        Working...
        X