Unduplicating data and creating a new dataset with distinct observations

Vanessa Garcia

Join Date: Jun 2022

Posts: 16
#1

Unduplicating data and creating a new dataset with distinct observations

11 Apr 2023, 20:42

Hello,

I have a dataset that contains duplicate observations. I would like to create and save a new dataset with only distinct observations. Any advice on this would be greatly appreciated. Thank you in advance!
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4400
#2

11 Apr 2023, 22:27

The command sequence

Code:

use dataset // archival copy duplicates drop _all, force save new_dataset

might be close to what you're looking for.

You can find out more information with

Code:

help duplicates
Comment
Vanessa Garcia

Join Date: Jun 2022

Posts: 16
#3

12 Apr 2023, 06:01

Thank you Joseph! Will this command drop only exact duplicates across all variables?

Apologies as I wasn’t clear and that I didn’t provide more context in my initial post. After launching a survey, there were some survey participants who took the survey more than one time (they were only supposed to complete it once) and not all responses were the same as some were missing. Before dropping the duplicates, I would like to replace the missing responses with the complete responses. I would then like to drop the duplicate entries. I have name and phone to identify duplicates. I was also wondering if the collapse (firstnm) command would help with this but I’m not too familiar with it. Any help is greatly appreciated, thank you again!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10188
#4

12 Apr 2023, 06:25

and not all responses were the same as some were missing.

As long as this is the case and you don't have different nonmissing responses, then collapse (firstnm) will work.

Code:

ds id, not collapse (firstnm) `r(varlist)', by(id)

where you replace "id" with the respondent identifier.
Comment
Vanessa Garcia

Join Date: Jun 2022

Posts: 16
#5

12 Apr 2023, 08:07

Great thanks so much, Andrew! For future reference, I’m curious if there are different nonmissing responses that are the same, what could be done in this case?
Comment
Vanessa Garcia

Join Date: Jun 2022

Posts: 16
#6

12 Apr 2023, 08:30

Great thanks so much, Andrew! For future reference, I’m curious if there are different nonmissing responses that are the same, what could be done in this case?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10188
#7

12 Apr 2023, 08:41

Survey responses are not incentive-compatible, so you are just hoping that the conclusions from the descriptive study mostly reflect the truth. Due to this reason, you cannot guarantee that those without duplicate entries who filled in their questionnaires were being truthful. With this in mind, you may:

1. Keep the realistic value (e.g., if one value is impossible and a second value is realistic, delete the impossible value).
2. If all values appear realistic, choose one of the values randomly.

I would not advocate averaging as there is only one truth. But one can make an argument for picking the first value or the last value (someone realized that they made an error and corrected it subsequently or somone was embarassed by their correct response and chose a response that they thought was less embarassing, and so on). That is why randomization may thus make sense.

Code:

ds id, not *RANDOM ORDER gen randomid= rnormal() sort id randomid collapse (firstnm) `r(varlist)', by(id)
Comment
Vanessa Garcia

Join Date: Jun 2022

Posts: 16
#8

12 Apr 2023, 14:18

This is very insightful and makes sense, thank you, again, Andrew!
Comment

Announcement

Unduplicating data and creating a new dataset with distinct observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment