Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intra-rater reliability vs. test-retest reliability

    Dear Stata-list members,

    I have two questions:
    1. What, if any, is the difference between test-retest reliability and intra-rater reliability? The terms often seem to be used inter-changeably in the literature, but there is no precise explanation of the salient differences. Is it, for example, that intra-rater reliability is about agreement/consistency in ratings for each rater taken separately while test-retest reliability does not take account of the rater(s) but examines overall agreement/consistency between two measurements by the same set of raters.
    2. I have a dataset in which 3 raters have each rated the same 30 videotaped meetings on 11 dimensions using 7-point ordinal scales at two time points, 3 months apart. The data are nested. One form of nesting could be the following: dimensions nested in raters nested in films nested in time. What indicator should I use to measure test-retest/intra-rater reliability in this case, and is there a Stata command that would help me implement it?

    Your help in answering these questions would be much appreciated.

    Siddhartha
    Best wishes,

    Siddhartha

  • #2
    This is a good question. You can use the same statistical method to measure test-retest and inter-rater reliability, but they are not the same concept. I am not sure that I have seen papers that confuse the two.

    Basically, reliability in psychometrics is a test's signal to noise ratio. In psychometrics, the assumption is that for any test, a person has a true score. We only observe that score with error (random and systematic both). If we could observe the variance of the true score divided by the variance of the true score plus error variance, that's reliability. We can't observe that, but we can estimate it.

    If you give the same people one test or instrument, and then you give them the same instrument a short time later, the level of the latent trait should not have changed much in the interim. So, you can take the Pearson correlation of the two scores, or the intra-class correlation from a repeated measures ANOVA or a hierarchical linear model as an estimate of test-retest reliability. Test-retest is basically the temporal stability of the instrument. This concept isn't usually applied (I think) to lab tests, but blood pressure is an example of something with low test-retest reliability because it can fluctuate during the day (temporal stability). The best practice there is to take several measures during the day and average them.

    Similarly, if you have two different raters rate someone on the same instrument at the same time, that should also constitute a form of reliability. That's test-retest reliability.

    Basically, you can think of these two types of reliability as two stress tests that you should put an instrument through, if possible. Inter-rater reliability might not always be applicable, especially if you are giving someone a self-administered instrument (e.g. have someone self-report on a depression scale). If raters are conducting ratings on a binary or ordinal scale, kappa is also an appropriate measure. I don't frequently see Bland Altman plots used to evaluate inter-rater reliability, but when you think about it, they are applicable for anything with a continuous output. I think they are used more on lab tests (e.g. two lab tests for the same physiological parameter).
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Similarly, if you have two different raters rate someone on the same instrument at the same time, that should also constitute a form of reliability. That's test-retest reliability.
      I think Weiwen Ng means inter-rater reliability here.

      This concept isn't usually applied (I think) to lab tests,
      Actually, it is. There is measurement error in lab tests just as there is in all measurement. For whatever reason, for lab tests, test-retest studies are usually summarized as the coefficient of variation in series of repeat measurements rather than an intraclass correlation. But it is a parameter that is evaluated for most laboratory tests before they come into general use. There is, by the way, another form of test-retest reliability that is often applied to lab tests: what is the reproducibility when the same specimen is retested after a long time interval. This gets at the stability of the analyte in stored specimens (as well as potential issues of instrumentation drift). It has practical importance in terms of interpreting results of lab tests performed on specimens that, for whatever reason, were delayed in reaching the lab, or perhaps not stored in ideal conditions before being analyzed. Or, it can be the basis for a lab declining to process a specimen that is too old, or inform protocols for the frequency of recalibrating equipment.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        I think Weiwen Ng means inter-rater reliability here.
        Very likely. However, Siddhartha Baviskar asks about intra-rater reliability.

        Comment


        • #5
          daniel klein Good point. I read #1 a bit too quickly!

          Comment


          • #6
            Yes, I did mean intRA-rater reliability, not inter-rater reliability.

            Weiwen Ng following up on your examples of how to measure test-retest reliability, what if my scale is ordinal and doesn't fulfill the assumptions required to use Pearson's r og an ICC variant, e.g. the scores aren't normally distributed?
            Best wishes,

            Siddhartha

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              I think Weiwen Ng means inter-rater reliability here.
              ...
              Actually, it is. There is measurement error in lab tests just as there is in all measurement. For whatever reason, for lab tests, test-retest studies are usually summarized as the coefficient of variation in series of repeat measurements rather than an intraclass correlation. But it is a parameter that is evaluated for most laboratory tests before they come into general use. There is, by the way, another form of test-retest reliability that is often applied to lab tests: what is the reproducibility when the same specimen is retested after a long time interval. This gets at the stability of the analyte in stored specimens (as well as potential issues of instrumentation drift). It has practical importance in terms of interpreting results of lab tests performed on specimens that, for whatever reason, were delayed in reaching the lab, or perhaps not stored in ideal conditions before being analyzed. Or, it can be the basis for a lab declining to process a specimen that is too old, or inform protocols for the frequency of recalibrating equipment.
              A couple of points. I did mean inter-rater reliability. I had got some wires crossed when I was typing. Clyde is correct here.

              I did not notice that Siddhartha typed "intra-rater reliability". My brain skipped over that and interpreted that as inter-rater, i.e. do two different raters using the same scale produce similar results?

              I had not actually heard intra-rater reliability as a standard term before this. However, this paper distinguishes inter- and intra-rater reliability as well as test-retest reliability. It says that intra-rater reliability

              reflects the variation of data measured by 1 rater across 2 or more trials
              That could overlap with test-retest reliability, and they say this about test-retest:

              It reflects the variation in measurements taken by an instrument on the same subject under the same conditions. It is generally indicative of reliability in situations when raters are not involved or rater effect is neglectable, such as self-report survey instrument.
              The distinction between that and test-retest reliability is still not clear to me. I don't know to what extent intra-rater reliability is a widespread concept.

              Originally posted by Siddhartha Baviskar View Post
              Yes, I did mean intRA-rater reliability, not inter-rater reliability.

              Weiwen Ng following up on your examples of how to measure test-retest reliability, what if my scale is ordinal and doesn't fulfill the assumptions required to use Pearson's r og an ICC variant, e.g. the scores aren't normally distributed?
              Whatever you are defining as intra-rater reliability, it seems like either Kappa or the ICC from a hierarchical ordered logistic regression would do (command is meologit).
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8
                There is another aspect of intra-rater reliability that, in some situations, distinguishes it from test-retest reliability. Measurements generally involve an object of measurement, an apparatus that interacts with the object of measurement, and then a rater who determines the state of the apparatus and reports a result. For example, an x-ray machine interacts with a patient to create an image, and a radiologist interpret the image and reports a diagnosis. The term test-retest reliability is generic: the test is repeated on the same object of measurement, usually in a short time, but other things may vary as well: the patient may be x-rayed again on a different machine, and another radiologist reads that new image. The overall reproducibility of the diagnosis is then analyzed to identify test-retest reliability. The object of measurement is held constant, but other aspects of the process vary. And the average consistency (measured by ICC or kappa or some other statistic) of measurements of the same object defines the test-retest reliability. More technically, total variance is partitioned into object, apparatus, and rater components, and the proportion of variance attributable to the object is the test-retest reliability. (And in some cases there is added complexity as the apparatus may also interact with aspects of the environment during the measurement, and that might be yet another variance component to take into account. Or the apparatus may be conceived of has having several independent components--for example, in histopathology there is the process of staining the tissue specimens, and the microscope, with the pathologist scanning the microscopic images as a rater.) The key to test-retest reliability is that it is the proportion of variance attributable solely to the variation in objects of measurement.)

                By contrast, the term intra-rater reliability, in this context, implies that everything but the rater is held constant. A radiologist is asked to re-read the same films, and the consistency of the reported diagnosis is analyzed. Again in technical terms, in this case the sampling for the analysis is revised so that the only non-zero variance components are due to the rater and the object of measurement: variance attributable to the apparatus and environment is fixed at zero by the experimental design (not by the analysis of the data, but by the experimental design).

                Inter-rater reliability is yet a different thing. You show the same images to a bunch of radiologists who read them independently. Each radiologist only reads each image once. In this case variance due to apparatus is zero, and the variance of objects measured is identical across the radiologists. The proportion of variation that is not attributable to the radiologist is the inter-rater reliability.
                Last edited by Clyde Schechter; 15 Oct 2021, 15:47.

                Comment


                • #9
                  Thank you both for clarifying the concepts and the appropriate methods for measuring them. I will get back with the results.
                  Best wishes,

                  Siddhartha

                  Comment

                  Working...
                  X