Need Help to Remove Duplicates

kotey dzanie

Join Date: Jul 2022
Posts: 26

Need Help to Remove Duplicates

20 Jul 2022, 14:20

Hi Everyone,

Please, I need help to remove duplicates in Stata. I have decided to post it again because i didnt do a good job explaining what the problem is in the original post. I have dataset which contains over 200 observations. What I want to do is to remove the duplicates and retain a unique value for each observation. For the same observations with Yes and Null, I want to keep the Yes. For same observations with No and Null values, I want to keep the No. For same observations with only null, I want to keep just 1 null.

ID Number	Value
D711	null
D711	null
D711	null
D711	Yes
D714	No
D714	null
D714	null
D715	Yes
D715	null
D722	null
D722	null
D729	No
D729	null
D722	Yes
D723	null
D723	null
D723	null
D728	null
D728	null
D728	null

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10481

20 Jul 2022, 15:14

Is "Value" a string variable? If so, you just have to keep the first observation for each ID after sorting.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4(idnumber value)
"D711" "null"
"D711" "null"
"D711" "null"
"D711" "Yes"
"D714" "No"  
"D714" "null"
"D714" "null"
"D715" "Yes"
"D715" "null"
"D722" "null"
"D722" "null"
"D729" "No"  
"D729" "null"
"D722" "Yes"
"D723" "null"
"D723" "null"
"D723" "null"
"D728" "null"
"D728" "null"
"D728" "null"
end

bys idnumber (value): keep if _n==1

Res.:

Code:

. l, sep(0)

     +------------------+
     | idnumber   value |
     |------------------|
  1. |     D711     Yes |
  2. |     D714      No |
  3. |     D715     Yes |
  4. |     D722     Yes |
  5. |     D723    null |
  6. |     D728    null |
  7. |     D729      No |
     +------------------+

For such a problem, it is important to present a data example using dataex as details on the variable type matter. See FAQ Advice #12.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#3

20 Jul 2022, 15:26

This question is fairly straightforward if we can assume that all values of the variable value are coded as Yes, No, or null, with no typographical variations. Since text data is often unreliable in this way, the first command below verifies this assumption. There is also the question of what to do if some idnumber has both a Yes and a No response among his/her observations. On the assumption that this type of contradictory response is unacceptable, the third command verifies that this does not occur (or aborts with an error message if it does). Finally the last command retains the one desired observation.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str5 idnumber str4 value "D711 " "null" "D711 " "null" "D711 " "null" "D711 " "Yes" "D714 " "No" "D714 " "null" "D714 " "null" "D715 " "Yes" "D715 " "null" "D722 " "null" "D722 " "null" "D729 " "No" "D729 " "null" "D722 " "Yes" "D723 " "null" "D723 " "null" "D723 " "null" "D728 " "null" "D728 " "null" "D728 " "null" end assert inlist(value, "null", "Yes", "No") gen byte preference = inlist(value, "Yes", "No") by idnumber preference, sort: assert value[1] == value[_N] by idnumber (preference): keep if _n == _N

In your earlier post on this topic, I asked you to use the -dataex- command to show your example data. Because you did not do this, I cannot be sure that value is actually, as I have assumed for the code, a string variable. If it is not, then the code will produce only error messages and we will have both wasted our time. In the future, please help those who try to help you: use the -dataex- command, and no other means, for showing example data. It is the only way to assure that all of the necessary information about the data is provided in a way that can be used to develop and test code.

Added: Crossed with #2.
1 like
Comment
kotey dzanie

Join Date: Jul 2022

Posts: 26
#4

20 Jul 2022, 17:50

Thank you Clyde and Andrew. The codes worked.
Comment

Announcement

Need Help to Remove Duplicates

Comment

Comment

Comment