Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicate values within groups or across variables

    I'm having trouble removing duplicates on address and retaining all unique values from the duplicates.

    I'd like to take a dataset like this:
    ID Name Address Phone
    35 Blue 13 Washington St 888-555-1111
    33 Blue 13 Washington St 888-555-1111
    32 Green 13 Washington St 888-555-2222
    33 Red 13 Washington St 888-555-1111

    And convert it to this where I keep only the unique values for each address:
    ID Name Address Phone ID2 ID3 Name2 Name3 Phone2
    35 Blue 13 Washington St 888-555-1111 33 32 Green Red 888-555-2222

    I can get this far, but I don't know how to remove the duplicate values across ID-ID4, Name-Name4, Phone-Phone4:
    ID Name Address Phone ID2 ID3 ID4 Name2 Name3 Name4 Phone2 Phone3 Phone4
    35 Blue 13 Washington St 888-555-1111 33 32 33 Blue Green Red 888-555-1111 888-555-2222 888-555-1111

    The number of duplicates on Address will vary as well as for the other variables. The duplicates will be in different orders too. For example, another set of duplicate values within the dataset might look like this:
    ID Name Address Phone
    61 Jack 1 Franklin Ave 999-444-333
    61 Mary 1 Franklin Ave 222-555-666
    62 Mary 1 Franklin Ave 777-333-222
    64 Donald 1 Franklin Ave 222-555-666
    61 Harry 1 Franklin Ave 222-555-666
    63 Harry 1 Franklin Ave 777-333-222
    64 Jack 1 Franklin Ave 777-333-222

    Thanks!

  • #2
    If no variables other than ID Name Address and Phone are needed, you could do this:

    Code:
    keep ID Name Address Phone
    duplicates drop
    by Address, sort: gen _j = _n
    reshape wide ID Name Phone, i(Address) j(_j)
    That said, why do you want to do this? The vast majority of analyses you might do with this data in Stata will be far easier if you keep your data in long layout.

    Comment


    • #3
      Thank you for your code. I may not have explained my issue properly. I ran your code and still have the issue where the duplicate ID, Name, and Phone show up in separate variables whereas I want to keep only the unique values. Using the example above, your code gives me this, where Name and Name2 are still duplicates, ID2 and ID4 are still duplicates, etc:
      ID Name Address Phone ID2 ID3 ID4 Name2 Name3 Name4 Phone2 Phone3 Phone4
      35 Blue 13 Washington St 888-555-1111 33 32 33 Blue Green Red 888-555-1111 888-555-2222 888-555-1111

      What I would like is this where each variable for this observation contains a unique value:
      ID Name Address Phone ID2 ID3 Name2 Name3 Phone2
      35 Blue 13 Washington St 888-555-1111 33 32 Green Red 888-555-2222

      Is there a next step I can take after running your code to delete duplicate values across variables ID-ID4, Name-Name4, Phone-Phone4?

      To answer your question, I won't be doing any further analysis on the dataset. I use STATA to clean up the file and then I'll be exporting it to geocode the locations. Once they are mapped I use the additional variables to display the information in a pop up window for each location. I don't want to display duplicate values of ID, Name or Phone numbers in the pop up window.

      Thank you!

      Comment


      • #4
        OK. I misunderstood you to want to eliminate duplicates triples of (ID Name and Phone), but you actually want only distinct values of ID, Name, and Phone separately?

        Then I think you have to do each of those variables separately and merge the results:

        Code:
        keep ID Name Address Phone
        duplicates drop
        
        foreach v of varlist ID Name Phone {
            preserve
            keep Address `v'
            duplicates drop
            by Address, sort: gen _j == _n
            tempfile `v'
            save ``v''
            restore
        }
        
        keep Address
        duplicates drop
        isid Address, sort
        foreach v in ID Name Phone {
            merge 1:m Address using ``v'', nogenerate assert(match master) sorted
            reshape wide `v', i(Address) j(_j)
        }
        Lacking a suitable data set, I have not tested the above code, but I think it will work.

        Comment

        Working...
        X