Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with replicates / duplicates - Your experience

    Dear all,

    I know there are many professional and very high standard stata users who do deal with a lot of datasets.

    I wanted your opinion with regards to what do people do with a large amount of replicates? I have a database of around 800,000 observations.

    Dataset 1:
    Has PATIENTID, OPERATIONID and a number of other variables (eg. admissiondate, diagnosis, lots of other variables)

    Dataset 2:
    I would like to merge this dataset which has to dataset 2 which as OPERATIONID, gender and age in common.


    Before merging, I looked for replicates of the OperationID in Dataset1 and found around 8,000 variables replicated
    I used the following code

    bys OperationID (patientid vte diagnosis etc): gen duplicates = cond(_N==1,0,_n)

    pls note *etc are all my variables in the dataset

    I then inspected the replicates where OperationID > 1

    I found for eg the data coder
    - Inputted the same operationid for 2 different patients
    -Or One of these with the same operation ID have different Length of stays

    I wanted to ask STATA all the experts who of course I'm sure have used large datasets before and found replicates.

    How do you normally deal with the replicates? Do you just keep the unique variables? And move on.
    Do you create another dataset with the replicates where n>1 and inspect them and drop accordingly with a plan to append to the unique dataset and merge?

    The other thing I was about to try is to merge my Dataset 1 with OperationID + Age + Gender to the Dataset 2 --> which mainly is a dataset concerning the operation.
    However, I know there are multiple entries with the same OperationID in dataset1 so i ASSUME stata will come back to me to say that 'variable OPERATIONID does not uniquely identify observations in the master data'

  • #2
    It really depends on how the data is structured. It is very common to have duplicates in the case where you have multiple IDs. In your case I guess it is perfectly reasonable (though undesirable from the patients point of view) that the same patient has multiple operations, so the same patient can appear multiple times in the same dataset. It is perfectly reasonable to choose for operationid a 1 for the first observed operation for that patient, 2 for the second observed operation for that patient, etc. As a consequence you would expect many duplicates in operationid (especially 1s, less so in 2s, even less so in 3s, etc.). This is not a problem, as in this case the operationid on its own is not supposed to identify an operation, it is the operationid and patientid together that identifies a particular operation. If that is the structure, then the problem is your second dataset. You probably have to go back and talk to your data provider, because in that case your data does not contain the information necessary.

    I don't know if your data is structured this way. I actually don't think that that is the case: 8,000 replicates out of 800,000 observations is too small for my scenario. Every dataset is different, it often involves some very careful reading of the documentation. Sometimes, that is not enough, and you have to talk to colleagues who worked with that data, and if that does not work with the people who created that data.

    Bottom line is that you should not think of this as a Stata problem, but as a data problem and often the problem is not so much the data but your understanding of the data. Don't start typing commands in Stata and force the data to look like you think it should look like. In situations like yours I start with the assumption that the data is right, and I just misunderstood what the data is supposed to look like. So I close Stata and open the documentation and start reading again. If I am lucky I have colleagues who have worked with that data before, and I go buy them a coffee and have a chat. If that is not the case, then be prepared for this being a very long and frustrating process. If your are unlucky it could easily be weeks or in case you need to discuss this with the data producers even months. On the other hand, once you are done you are now that colleague who knows how to work with that data and you can expect a lot of free coffees in the future...
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment

    Working...
    X