Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • kappaetc: Testing for signficance of a difference in before / after intervention

    I am using kappaetc (SSC) Inter-rater reliability, comparing diagnostic agreement before vs. after a diagnostic skills training event. I'd like to ask for advice on how to approach testing for signficance of the difference between those two measurements, before vs. after training.

    kappaetc returns both the coefficients and standard errors as matrices.



  • #2
    You can compare two correlated ICCs using the method described by Donner and Zou.

    Donner, A., & Zou, G. (2002). Testing the equality of dependent intraclass correlation coefficients. Journal of the Royal Statistical Society: Series D (The Statistician), 51(3), 367–379. https://doi.org/10.1111/1467-9884.00324

    Alternatively, you could build the appropriate regression model which incorporates time as a fixed effect and perform the usual Wald-type test.

    Comment


    • #3
      You might be interested in Gwet's (2016) discussion of comparing agreement coefficients, such as Cohen's kappa. The approach basically boils down to a paired t-test; this is implemented in kappaetc:

      Code:
      help kappaetc ttest
      The implementation might fail for large datasets, i.e., many observations.

      Gwet, K. L. 2016. Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. Educational and Psychological Measurement, 76, 609-637.
      Last edited by daniel klein; 13 Jun 2020, 03:14.

      Comment


      • #4
        Thank you Daniel for the -kappaetc ttest- tool.
        I have implemented the tool using a dataset with 7 raters, who assessed diagnosis before and after a diagnostic training course. Code is:

        Code:
           
            kappaetc Rater1-Rater7 if reading=="before", store(before)
            kappaetc Rater1-Rater7 if reading=="after", store(after)
            kappaetc before == after, ttest
        In the output I show here, 0.221 would represent the (nonsignificant) p-value for the comparison between the two ICCs?
        kappaetc ttest for Statalist.pdf

        And... thank you also Leonardo, for the statistical paper.
        Attached Files

        Comment


        • #5
          Originally posted by Michael McCulloch View Post
          In the output I show here, 0.221 would represent the (nonsignificant) p-value for the comparison between the two ICCs?
          Note that Gwet's AC is not the ICC; nor are any of the other agreement coefficients. A (quadratically) weighted Cohen's kappa should be close to ICC but it is not the same thing. You do not provide a lot of details concerning the contents of your research so it is a bit hard to comment on whether you want agreement coefficients or ICC. In general, you could use agreement coefficients if the rating categories are known in advance and/or categorical in nature.


          Originally posted by Michael McCulloch View Post
          I have implemented the tool using a dataset with 7 raters, who assessed diagnosis before and after a diagnostic training course.
          The syntax seems correct. The output (presenting in code-delimiters instead of pdf would be preferred) suggests that you have only 7 subjects that were rated. Note that the standard errors for agreement coefficients as well as the paired t-test are based on large sample approximation. You might want to consider a different approach, such as bootstrapping.


          Comment


          • #6
            Sorry Daniel for the ICC typo. I chose Gwet's AC because of some missing data we found.

            Thank you for pointing out the small sample issue. Our study was 7 raters, and we compared Before- vs. After-training diagnostic assessment of 7 subjects. How would bootstrapping be approached given the reporting provided by kappaetc?

            Comment


            • #7
              And yes, rating categories were specified in advance, and all categorical.

              Comment


              • #8
                If there are missing ratings , the ICC as Daniel has implemented (based on Gwet's methods) follow from a generalization of ANOVA, and so would be appropriate.

                A quadratically weighted kappa (for ordinal outcome) has been shown to be asymptomatically identical to an ICC, though small differences will be found in small samples or with missing data.

                Nevertheless, Daniel's advice about a bootstrap is worthwhile. It's possible to perform cluster resampling, where the cluster is the the target of the rating, which preserves the correlation structure of the data.

                Comment


                • #9
                  Thank you Leonardo, I believe I can implement this with code such as this:

                  Code:
                  bootstrapbootstrap exp_list, cluster(varlist) : kappaetc
                  However, may I ask, I'm not sure I understand that you mean when you say
                  the cluster is the the target of the rating.

                  Comment


                  • #10
                    Obviously, correction:
                    Code:
                     
                     bootstrap exp_list, cluster(varlist) : kappaetc

                    Comment


                    • #11
                      And if my raters are clinicians, and I'm comparing Gwet's AC before vs. after a diagnostic skill teaching intervention, then could the binary "Before/After" variable be the cluster?

                      Comment


                      • #12
                        Originally posted by Michael McCulloch View Post
                        And if my raters are clinicians, and I'm comparing Gwet's AC before vs. after a diagnostic skill teaching intervention, then could the binary "Before/After" variable be the cluster?
                        I can't speak to Gwet's AC because I'm not familiar enough with it. That aside, the usual set up that I encounter are physicians (i.e., judges) evaluating patients (i.e., subjects). You have the same judges in your setup, and presumably they are also making ratings about patients (or diagnostic images), right? So the patient would be the cluster level. Time wouldn't work here because it would imply two clusters.

                        Comment


                        • #13
                          Let us be very clear about the setup. I understand that there are 7 raters/judges/clinicians and 7 subjects (patients or whatever). The 7 subjects are rated twice, i.e., the raters rate the same group of 7 subjects twice. I further understand that the dataset contains 7 variables, representing the raters and one variable representing the time, i.e., before or after training. There are 14 observations, represented the 7 subjects at two points in time.

                          Given this setup, I would first reshape the (small) dataset to wide form so that we have the first seven variables, say Rater1_1-Rater7_1, represent the ratings before training and the last seven variables, say, Rater1_2-Rater7_2, represent the ratings after training. This could be along the lines

                          Code:
                          rename (Rater#) (Rater#_)
                          unab Raters : Rater1_-Rater7_
                          reshape wide `Raters' , i(id) j(time)
                          assuming id identifies the 7 subjects and time identifies the 2 points in time.

                          We could then simply bootstrap from the 7 observations. Here is a sketch

                          Code:
                          capture program drop kappaetc_bs
                          program kappaetc_bs , rclass
                              tempname before after diff
                              kappaetc Rater1_1-Rater7_1
                              matrix `before' = r(b)
                              kappaetc Rater1_2-Rater7_2
                              matrix `after' = r(b)
                              matrix `diff' = `before' - `after'
                              return matrix diff = `diff'
                          end
                          We could then

                          Code:
                          bootstrap delta_ac = el(r(diff), 1, 5) , reps(500) : kappaetc_bs
                          estat bootstrap , all
                          Note that this would be easier with example data to play with.

                          Hope that helps.
                          Last edited by daniel klein; 14 Jun 2020, 02:11.

                          Comment


                          • #14
                            Thank you Daniel, for prompting the reshape as a first step. I have implemented this, and now have a data example here.
                            I'm starting to contemplate the -capture program- code sketch, this being new to me.

                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input byte subject str18(depth1 depth2 depth3 depth4 depth5 depth6 depth7) str6 reading
                            51 "Neither"  "Neither"  "Deep"     "Neither" "Neither"  "Deep"     "Deep"     "after"
                            52 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
                            53 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
                            54 "Neither"  "Neither"  "Floating" "Neither" "Neither"  "Neither"  "Neither"  "after"
                            55 "Neither"  ""         "Neither"  "Neither" "Neither"  "Floating" "Floating" "after"
                            56 "Neither"  "Neither"  "Neither"  "Neither" "Neither"  "Floating" "Deep"     "after"
                            57 "Neither"  "Deep"     "Floating" "Neither" "Neither"  "Neither"  "Floating" "after"
                            51 "Floating" ""         "Deep"     "Neither" "Neither"  "Neither"  "Deep"     "before"
                            52 "Floating" "Floating" "Neither"  "Neither" "Floating" "Neither"  "Neither"  "before"
                            53 "Floating" "Floating" "Neither"  "Neither" "Floating" "Neither"  "Floating" "before"
                            54 "Floating" "Deep"     "Deep"     "Neither" "Floating" "Floating" "Floating" "before"
                            55 "Deep"     "Deep"     "Deep"     "Neither" "Floating" "Deep"     "Neither"  "before"
                            56 "Floating" "Floating" "Deep"     "Deep"    "Neither"  "Deep"     "Deep"     "before"
                            57 "Deep"     "Floating" "Neither"  "Neither" "Floating" "Deep"     "Neither"  "before"
                            end
                            
                            sort reading  subject
                            l subject reading depth1 depth2 depth3, noo sepby(reading)


                            Comment


                            • #15
                              NB:
                              The diagnostic training is in palpation skills, with 3 categorical answers for raters to choose: Floating, Deep or Neither.
                              Var subject (N=7), with ID 51-57.
                              Var reading indicates the ratings provided before vs. after the diagnostic skills training intervention.
                              Variables depthX record the choices of rater 1 through 7, with "depth1" being the rating choice for rater1, depth2 for rater2 and so on.
                              Last edited by Michael McCulloch; 14 Jun 2020, 20:38.

                              Comment

                              Working...
                              X