Removing duplicate values within groups or across variables

Jenna Khan

Join Date: May 2015
Posts: 9

Removing duplicate values within groups or across variables

27 May 2015, 14:50

I'm having trouble removing duplicates on address and retaining all unique values from the duplicates.

I'd like to take a dataset like this:

ID	Name	Address	Phone
35	Blue	13 Washington St	888-555-1111
33	Blue	13 Washington St	888-555-1111
32	Green	13 Washington St	888-555-2222
33	Red	13 Washington St	888-555-1111

And convert it to this where I keep only the unique values for each address:

ID	Name	Address	Phone	ID2	ID3	Name2	Name3	Phone2
35	Blue	13 Washington St	888-555-1111	33	32	Green	Red	888-555-2222

I can get this far, but I don't know how to remove the duplicate values across ID-ID4, Name-Name4, Phone-Phone4:

Name

Address

Phone

ID2

ID3

ID4

Name2

Name3

Name4

Phone2

Phone3

Phone4

Blue

13 Washington St

888-555-1111

Blue

Green

Red

888-555-1111

888-555-2222

888-555-1111

The number of duplicates on Address will vary as well as for the other variables. The duplicates will be in different orders too. For example, another set of duplicate values within the dataset might look like this:

ID	Name	Address	Phone
61	Jack	1 Franklin Ave	999-444-333
61	Mary	1 Franklin Ave	222-555-666
62	Mary	1 Franklin Ave	777-333-222
64	Donald	1 Franklin Ave	222-555-666
61	Harry	1 Franklin Ave	222-555-666
63	Harry	1 Franklin Ave	777-333-222
64	Jack	1 Franklin Ave	777-333-222

Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

27 May 2015, 15:00

If no variables other than ID Name Address and Phone are needed, you could do this:

Code:

keep ID Name Address Phone duplicates drop by Address, sort: gen _j = _n reshape wide ID Name Phone, i(Address) j(_j)

That said, why do you want to do this? The vast majority of analyses you might do with this data in Stata will be far easier if you keep your data in long layout.
Comment
Jenna Khan

Join Date: May 2015

Posts: 9
#3

27 May 2015, 16:24

Thank you for your code. I may not have explained my issue properly. I ran your code and still have the issue where the duplicate ID, Name, and Phone show up in separate variables whereas I want to keep only the unique values. Using the example above, your code gives me this, where Name and Name2 are still duplicates, ID2 and ID4 are still duplicates, etc:

ID Name Address Phone ID2 ID3 ID4 Name2 Name3 Name4 Phone2 Phone3 Phone4

35 Blue 13 Washington St 888-555-1111 33 32 33 Blue Green Red 888-555-1111 888-555-2222 888-555-1111

What I would like is this where each variable for this observation contains a unique value:
ID Name Address Phone ID2 ID3 Name2 Name3 Phone2

35 Blue 13 Washington St 888-555-1111 33 32 Green Red 888-555-2222

Is there a next step I can take after running your code to delete duplicate values across variables ID-ID4, Name-Name4, Phone-Phone4?

To answer your question, I won't be doing any further analysis on the dataset. I use STATA to clean up the file and then I'll be exporting it to geocode the locations. Once they are mapped I use the additional variables to display the information in a pop up window for each location. I don't want to display duplicate values of ID, Name or Phone numbers in the pop up window.

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#4

27 May 2015, 16:47

OK. I misunderstood you to want to eliminate duplicates triples of (ID Name and Phone), but you actually want only distinct values of ID, Name, and Phone separately?

Then I think you have to do each of those variables separately and merge the results:

Code:

keep ID Name Address Phone duplicates drop foreach v of varlist ID Name Phone { preserve keep Address `v' duplicates drop by Address, sort: gen _j == _n tempfile `v' save ``v'' restore } keep Address duplicates drop isid Address, sort foreach v in ID Name Phone { merge 1:m Address using ``v'', nogenerate assert(match master) sorted reshape wide `v', i(Address) j(_j) }

Lacking a suitable data set, I have not tested the above code, but I think it will work.
Comment

Announcement

Removing duplicate values within groups or across variables

Comment

Comment

Comment