Dear all,
I know there are many professional and very high standard stata users who do deal with a lot of datasets.
I wanted your opinion with regards to what do people do with a large amount of replicates? I have a database of around 800,000 observations.
Dataset 1:
Has PATIENTID, OPERATIONID and a number of other variables (eg. admissiondate, diagnosis, lots of other variables)
Dataset 2:
I would like to merge this dataset which has to dataset 2 which as OPERATIONID, gender and age in common.
Before merging, I looked for replicates of the OperationID in Dataset1 and found around 8,000 variables replicated
I used the following code
bys OperationID (patientid vte diagnosis etc): gen duplicates = cond(_N==1,0,_n)
pls note *etc are all my variables in the dataset
I then inspected the replicates where OperationID > 1
I found for eg the data coder
- Inputted the same operationid for 2 different patients
-Or One of these with the same operation ID have different Length of stays
I wanted to ask STATA all the experts who of course I'm sure have used large datasets before and found replicates.
How do you normally deal with the replicates? Do you just keep the unique variables? And move on.
Do you create another dataset with the replicates where n>1 and inspect them and drop accordingly with a plan to append to the unique dataset and merge?
The other thing I was about to try is to merge my Dataset 1 with OperationID + Age + Gender to the Dataset 2 --> which mainly is a dataset concerning the operation.
However, I know there are multiple entries with the same OperationID in dataset1 so i ASSUME stata will come back to me to say that 'variable OPERATIONID does not uniquely identify observations in the master data'
I know there are many professional and very high standard stata users who do deal with a lot of datasets.
I wanted your opinion with regards to what do people do with a large amount of replicates? I have a database of around 800,000 observations.
Dataset 1:
Has PATIENTID, OPERATIONID and a number of other variables (eg. admissiondate, diagnosis, lots of other variables)
Dataset 2:
I would like to merge this dataset which has to dataset 2 which as OPERATIONID, gender and age in common.
Before merging, I looked for replicates of the OperationID in Dataset1 and found around 8,000 variables replicated
I used the following code
bys OperationID (patientid vte diagnosis etc): gen duplicates = cond(_N==1,0,_n)
pls note *etc are all my variables in the dataset
I then inspected the replicates where OperationID > 1
I found for eg the data coder
- Inputted the same operationid for 2 different patients
-Or One of these with the same operation ID have different Length of stays
I wanted to ask STATA all the experts who of course I'm sure have used large datasets before and found replicates.
How do you normally deal with the replicates? Do you just keep the unique variables? And move on.
Do you create another dataset with the replicates where n>1 and inspect them and drop accordingly with a plan to append to the unique dataset and merge?
The other thing I was about to try is to merge my Dataset 1 with OperationID + Age + Gender to the Dataset 2 --> which mainly is a dataset concerning the operation.
However, I know there are multiple entries with the same OperationID in dataset1 so i ASSUME stata will come back to me to say that 'variable OPERATIONID does not uniquely identify observations in the master data'
Comment