Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unduplicating data and creating a new dataset with distinct observations

    Hello,

    I have a dataset that contains duplicate observations. I would like to create and save a new dataset with only distinct observations. Any advice on this would be greatly appreciated. Thank you in advance!

  • #2
    The command sequence
    Code:
    use dataset // archival copy
    duplicates drop _all, force
    save new_dataset
    might be close to what you're looking for.

    You can find out more information with
    Code:
    help duplicates

    Comment


    • #3
      Thank you Joseph! Will this command drop only exact duplicates across all variables?

      Apologies as I wasn’t clear and that I didn’t provide more context in my initial post. After launching a survey, there were some survey participants who took the survey more than one time (they were only supposed to complete it once) and not all responses were the same as some were missing. Before dropping the duplicates, I would like to replace the missing responses with the complete responses. I would then like to drop the duplicate entries. I have name and phone to identify duplicates. I was also wondering if the collapse (firstnm) command would help with this but I’m not too familiar with it. Any help is greatly appreciated, thank you again!

      Comment


      • #4
        and not all responses were the same as some were missing.
        As long as this is the case and you don't have different nonmissing responses, then collapse (firstnm) will work.

        Code:
        ds id, not
        collapse (firstnm) `r(varlist)', by(id)
        where you replace "id" with the respondent identifier.

        Comment


        • #5
          Great thanks so much, Andrew! For future reference, I’m curious if there are different nonmissing responses that are the same, what could be done in this case?

          Comment


          • #6
            Great thanks so much, Andrew! For future reference, I’m curious if there are different nonmissing responses that are the same, what could be done in this case?

            Comment


            • #7
              Survey responses are not incentive-compatible, so you are just hoping that the conclusions from the descriptive study mostly reflect the truth. Due to this reason, you cannot guarantee that those without duplicate entries who filled in their questionnaires were being truthful. With this in mind, you may:

              1. Keep the realistic value (e.g., if one value is impossible and a second value is realistic, delete the impossible value).
              2. If all values appear realistic, choose one of the values randomly.

              I would not advocate averaging as there is only one truth. But one can make an argument for picking the first value or the last value (someone realized that they made an error and corrected it subsequently or somone was embarassed by their correct response and chose a response that they thought was less embarassing, and so on). That is why randomization may thus make sense.

              Code:
              ds id, not
              *RANDOM ORDER
              gen randomid= rnormal()
              sort id randomid
              collapse (firstnm) `r(varlist)', by(id)

              Comment


              • #8
                This is very insightful and makes sense, thank you, again, Andrew!

                Comment

                Working...
                X