Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difficulty conceptualizing a nested model

    This is a statistics question, though I am most interested in a solution that I could implement relatively easily in Stata.

    Some colleagues would like to investigate factors affecting the occurrence of certain types of errors. The setup is basically that in the course of data-entry like work, people sit down to do a session during which they enter multiple pieces of data into an application. It is in the nature of the application that it is possible to make a "frame-shift" error in which you at some point start entering data into the wrong "cells," and once you do that, all of your subsequent data is improperly entered for the rest the session. If you close the application and start again, you are not doomed to continue the error: you can start out "right" and continue "right" from there. The "frame shift" error may occur with the very first entry, or anywhere during the session including the last entry. Or, as in most sessions, no error occurs at all. (Errors occur in fewer than 1 of every 1,000 sessions.) Sessions vary greatly in number of items entered. The number of items entered may be a risk factor for the occurrence of these errors, but, in any case, number of items per session cannot be controlled or modified. My colleagues want to set up a series of randomized controlled experiments in which various aspects of the software interface or other aspects of the work environment are modified, to see if the frequency of such errors can be reduced.

    Actually, we have been doing some experiments with this for several years now, and hope to do more. The problem lies in the presentation of the results. All of the interventions we have studied are applied at the person level or the session level. It isn't feasible to apply any interventions at the level of the individual entry. (Or at least we haven't thought of any.) So we have always analyzed our data with the data entry session as the unit of analysis, and reported our results in terms of the effect on the proportion of sessions that contain an error. But the consumers of our research rightly point out that from a practical perspective, each erroneous entry carries disutility. A session with 2 entries and 1 error is equivalent to a session with 20 entries and 1 error, and each of these is less of a problem than a session with 3 entries and 2 errors: yet we have counted all of these as a single error-containing session. So we want to be able to present our results with the entry as the unit of analysis.

    The problem is with the nesting structure. We have entries nested in sessions nested in operators (who are in turn nested in other groups that are not relevant to this discussion, at least I don't think they are.) The problem is that clearly the error-status of single entries within sessions is not independent. You go up to a point without error, and then the rest of the session is all error. This doesn't fit with any kind of correlation structure I know how to use in a random-effects model. It is clearly not exchangeable, nor autoregressive, etc.

    I'd appreciate any help on how to model this. Thanks in advance.

  • #2
    You have thought about this for years whereas the problem never occurred to me until you mentioned it -- but since nobody else has said anything I will toss out a few ideas.

    Are you sure you need to go to the entry level? Why not code each session as made mistake/ didn't make mistake? And when mistakes do occur, maybe code them by type, e.g. one shot mistake, frame-shift mistake.

    It sounds like once you have made the mistake, you are doomed the rest of the way? So maybe set it up as a survival analysis where you see what affects how long before you make the fatal mistake, if you ever do make it. In any event, I am not sure what the point is of analyzing additional entries once you know they are virtually certain to be errors. After you are sure a patient has died, you don't keep checking his pulse, do you?

    I am sure you have thought about this much more than I have and you may have tried what I said 5 years ago. If this doesn't help, maybe I will at least have bumped the thread so somebody with better ideas notices it. ;-)
    Last edited by Richard Williams; 29 Jul 2014, 22:46.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Thanks for suggesting the survival analysis approach. I hadn't thought of that. I'll wrestle with it and see if I can make it work.

      Comment


      • #4
        Hi Clyde, this is a tricky one. Richard's survival analysis is a good idea, and opens another line of analysis. Some thoughts.

        The probability of making a mistake depends on the number of entries done in each session. Therefore, if you're going to look at the entry as the unit of analysis, each entry does not have the same probability of being wrong, so I guess you would have to control for how many entries prior to it in the session there were.

        The probability of there being a mistake in a session also depends on the number of entries per session, so each session does not have the same probability of a mistake, but this can be controlled for with the number of entries in the session.

        Now in your analysis there are two perspectives: that of the actual process itself, and that of the mess (disutility). From the process perspective there is just one thing to consider: a frame shift error happened or it didn't happen. As Richard picked up, once you make a frame shift error in an entry, the rest of the entries are errors (or is there a way that you can make a second frame shift error in the same session? Because this opens up another can of worms: either two errors make a right, or you totally mess up). Thus it is not logical to consider a session with 1 error any different than a session with 20 errors when they are caused by a single frame shift error. They are equivalent. From the disutility perspective it's clear that they are different. The question then is, what is the purpose of the analysis. If the purpose of the analysis is to try and reduce the occurrences of frame-shift errors, I think that the perspective you should take is the one of the process itself, and consider that a frame shift error is the same whether it occurs in the second entry or in the last entry, whereas the prior would have many more errors in the session than the last, even though the prior is obviously more harmful (more disutility) than the last.

        From my thoughts in this last paragraph, I believe that the appropriate unit of analysis is, then, the session. You can group the observations as those of early occurrence (where the frame shift error happens in an early entry) and those of late occurrence (where the frame shift error happens in a late entry) to see if there are any differences in the characteristics. However, this separation may be endogenous.

        You can do two lines of analysis: the first one is what affects the probability of there being a frame shift mistake; and the second one is, given there was a frame-shift mistake, what affects the entry at where it happens. The first one is a binary model (panel sure) and the second one is a duration (survival) model on the subset of the data where there were actually errors. This allows you to see how to reduce the chance of there being a frame shift error, and how to make sure it happens as late as possible, and thus reduce the amount of actual bad entries in the whole data.

        I'm sorry if I didn't get into the nesting, but as my mind started to go the unit of analysis dilemma took over.
        Last edited by Alfonso Sánchez-Peñalver; 30 Jul 2014, 09:39.
        Alfonso Sanchez-Penalver

        Comment

        Working...
        X