Sorting duplicates in household IDS

Zuhumnan Dapel

Join Date: Sep 2014
Posts: 392

Sorting duplicates in household IDS

20 Oct 2019, 16:27

Dear All,
I've got repeated observations with thesame IDs. I want to keep only the first of the repeated observations for each of the IDs that appear. For example in the sample below, I want to keep only two observations. Serial number 1 and Serial number 31 [Will appreciate any help. tnx]:
----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float n str42 HHID
 1 "10100101111008"
 2 "10100101111008"
 3 "10100101111008"
 4 "10100101111008"
 5 "10100101111008"
 6 "10100101111008"
 7 "10100101111008"
 8 "10100101111008"
 9 "10100101111008"
10 "10100101111008"
11 "10100101111008"
12 "10100101111008"
13 "10100101111008"
14 "10100101111008"
15 "10100101111008"
16 "10100101111008"
17 "10100101111008"
18 "10100101111008"
19 "10100101111008"
20 "10100101111008"
21 "10100101111008"
22 "10100101111008"
23 "10100101111008"
24 "10100101111008"
25 "10100101111008"
26 "10100101111008"
27 "10100101111008"
28 "10100101111008"
29 "10100101111008"
30 "10100101111008"
31 "10100101111009"
32 "10100101111009"
33 "10100101111009"
34 "10100101111009"
35 "10100101111009"
36 "10100101111009"
37 "10100101111009"
38 "10100101111009"
39 "10100101111009"
40 "10100101111009"
41 "10100101111009"
42 "10100101111009"
43 "10100101111009"
44 "10100101111009"
45 "10100101111009"
46 "10100101111009"
47 "10100101111009"
48 "10100101111009"
49 "10100101111009"
50 "10100101111009"
51 "10100101111009"
52 "10100101111009"
53 "10100101111009"
54 "10100101111009"
55 "10100101111009"
56 "10100101111009"
57 "10100101111009"
58 "10100101111009"
59 "10100101111009"
60 "10100101111009"
end

------------------ copy up to and including the previous line ------------------

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

20 Oct 2019, 17:00

I assume that the variable n is an "observation number" within the dataset. Using the by: prefix the dataset to be sorted, and for Stata to know it is sorted. So we sort it by HHID, but having n as the second key (in parentheses so it will be used to sort, but not to make grous) will ensure that within each HHID the observations will be in the order they originally were in.

Be careful to note that the if clause using the automatic variable _n, which will be the observation number within each by: group. Don't confuse it with the variable n in your data.

Code:

. by HHID (n), sort: keep if _n==1 (58 observations deleted) . list, clean n HHID 1. 1 10100101111008 2. 31 10100101111009
Comment

Announcement

Sorting duplicates in household IDS

Comment