This is a statistics question, though I am most interested in a solution that I could implement relatively easily in Stata.
Some colleagues would like to investigate factors affecting the occurrence of certain types of errors. The setup is basically that in the course of data-entry like work, people sit down to do a session during which they enter multiple pieces of data into an application. It is in the nature of the application that it is possible to make a "frame-shift" error in which you at some point start entering data into the wrong "cells," and once you do that, all of your subsequent data is improperly entered for the rest the session. If you close the application and start again, you are not doomed to continue the error: you can start out "right" and continue "right" from there. The "frame shift" error may occur with the very first entry, or anywhere during the session including the last entry. Or, as in most sessions, no error occurs at all. (Errors occur in fewer than 1 of every 1,000 sessions.) Sessions vary greatly in number of items entered. The number of items entered may be a risk factor for the occurrence of these errors, but, in any case, number of items per session cannot be controlled or modified. My colleagues want to set up a series of randomized controlled experiments in which various aspects of the software interface or other aspects of the work environment are modified, to see if the frequency of such errors can be reduced.
Actually, we have been doing some experiments with this for several years now, and hope to do more. The problem lies in the presentation of the results. All of the interventions we have studied are applied at the person level or the session level. It isn't feasible to apply any interventions at the level of the individual entry. (Or at least we haven't thought of any.) So we have always analyzed our data with the data entry session as the unit of analysis, and reported our results in terms of the effect on the proportion of sessions that contain an error. But the consumers of our research rightly point out that from a practical perspective, each erroneous entry carries disutility. A session with 2 entries and 1 error is equivalent to a session with 20 entries and 1 error, and each of these is less of a problem than a session with 3 entries and 2 errors: yet we have counted all of these as a single error-containing session. So we want to be able to present our results with the entry as the unit of analysis.
The problem is with the nesting structure. We have entries nested in sessions nested in operators (who are in turn nested in other groups that are not relevant to this discussion, at least I don't think they are.) The problem is that clearly the error-status of single entries within sessions is not independent. You go up to a point without error, and then the rest of the session is all error. This doesn't fit with any kind of correlation structure I know how to use in a random-effects model. It is clearly not exchangeable, nor autoregressive, etc.
I'd appreciate any help on how to model this. Thanks in advance.
Some colleagues would like to investigate factors affecting the occurrence of certain types of errors. The setup is basically that in the course of data-entry like work, people sit down to do a session during which they enter multiple pieces of data into an application. It is in the nature of the application that it is possible to make a "frame-shift" error in which you at some point start entering data into the wrong "cells," and once you do that, all of your subsequent data is improperly entered for the rest the session. If you close the application and start again, you are not doomed to continue the error: you can start out "right" and continue "right" from there. The "frame shift" error may occur with the very first entry, or anywhere during the session including the last entry. Or, as in most sessions, no error occurs at all. (Errors occur in fewer than 1 of every 1,000 sessions.) Sessions vary greatly in number of items entered. The number of items entered may be a risk factor for the occurrence of these errors, but, in any case, number of items per session cannot be controlled or modified. My colleagues want to set up a series of randomized controlled experiments in which various aspects of the software interface or other aspects of the work environment are modified, to see if the frequency of such errors can be reduced.
Actually, we have been doing some experiments with this for several years now, and hope to do more. The problem lies in the presentation of the results. All of the interventions we have studied are applied at the person level or the session level. It isn't feasible to apply any interventions at the level of the individual entry. (Or at least we haven't thought of any.) So we have always analyzed our data with the data entry session as the unit of analysis, and reported our results in terms of the effect on the proportion of sessions that contain an error. But the consumers of our research rightly point out that from a practical perspective, each erroneous entry carries disutility. A session with 2 entries and 1 error is equivalent to a session with 20 entries and 1 error, and each of these is less of a problem than a session with 3 entries and 2 errors: yet we have counted all of these as a single error-containing session. So we want to be able to present our results with the entry as the unit of analysis.
The problem is with the nesting structure. We have entries nested in sessions nested in operators (who are in turn nested in other groups that are not relevant to this discussion, at least I don't think they are.) The problem is that clearly the error-status of single entries within sessions is not independent. You go up to a point without error, and then the rest of the session is all error. This doesn't fit with any kind of correlation structure I know how to use in a random-effects model. It is clearly not exchangeable, nor autoregressive, etc.
I'd appreciate any help on how to model this. Thanks in advance.
Comment