Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting duplicates in household IDS

    Dear All,
    I've got repeated observations with thesame IDs. I want to keep only the first of the repeated observations for each of the IDs that appear. For example in the sample below, I want to keep only two observations. Serial number 1 and Serial number 31 [Will appreciate any help. tnx]:
    ----------------------- copy starting from the next line -----------------------

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float n str42 HHID
     1 "10100101111008"
     2 "10100101111008"
     3 "10100101111008"
     4 "10100101111008"
     5 "10100101111008"
     6 "10100101111008"
     7 "10100101111008"
     8 "10100101111008"
     9 "10100101111008"
    10 "10100101111008"
    11 "10100101111008"
    12 "10100101111008"
    13 "10100101111008"
    14 "10100101111008"
    15 "10100101111008"
    16 "10100101111008"
    17 "10100101111008"
    18 "10100101111008"
    19 "10100101111008"
    20 "10100101111008"
    21 "10100101111008"
    22 "10100101111008"
    23 "10100101111008"
    24 "10100101111008"
    25 "10100101111008"
    26 "10100101111008"
    27 "10100101111008"
    28 "10100101111008"
    29 "10100101111008"
    30 "10100101111008"
    31 "10100101111009"
    32 "10100101111009"
    33 "10100101111009"
    34 "10100101111009"
    35 "10100101111009"
    36 "10100101111009"
    37 "10100101111009"
    38 "10100101111009"
    39 "10100101111009"
    40 "10100101111009"
    41 "10100101111009"
    42 "10100101111009"
    43 "10100101111009"
    44 "10100101111009"
    45 "10100101111009"
    46 "10100101111009"
    47 "10100101111009"
    48 "10100101111009"
    49 "10100101111009"
    50 "10100101111009"
    51 "10100101111009"
    52 "10100101111009"
    53 "10100101111009"
    54 "10100101111009"
    55 "10100101111009"
    56 "10100101111009"
    57 "10100101111009"
    58 "10100101111009"
    59 "10100101111009"
    60 "10100101111009"
    end
    ------------------ copy up to and including the previous line ------------------



  • #2
    I assume that the variable n is an "observation number" within the dataset. Using the by: prefix the dataset to be sorted, and for Stata to know it is sorted. So we sort it by HHID, but having n as the second key (in parentheses so it will be used to sort, but not to make grous) will ensure that within each HHID the observations will be in the order they originally were in.

    Be careful to note that the if clause using the automatic variable _n, which will be the observation number within each by: group. Don't confuse it with the variable n in your data.

    Code:
    . by HHID (n), sort: keep if _n==1
    (58 observations deleted)
    
    . list, clean
    
            n             HHID  
      1.    1   10100101111008  
      2.   31   10100101111009

    Comment

    Working...
    X