Delete the duplicates by column

Priver JM

Join Date: Feb 2019
Posts: 30

Delete the duplicates by column

04 Sep 2019, 11:00

Hi, I have more than 3,000 dta files with same format and this is one of the file.dta below.
As you can see, this is a wide type data set and I want to get rid of same value by each column. For your understanding, Var1 only contains 5 digit number value and the rest of cell is empty. For Var2, it only has date/month/year in the column and the rest is empty as well.
Var5 and Var7 has the same value, which does not have to be seen more than one so, I need only one 360 ad 1 for them.
The rest of variables (Var3, Var4, Var6 ...) have binary (0 or 1) values but I need only one value for each one. So, if there "1" in the Var3 column, I do not need to see zero. Otherwise, there should be one "0" by each column.
I have been trying to use "duplicates", "carryforward" option, however, it is hard to eliminate them with holding all of the conditions that I mentioned.

Thank you in advance!

Current state of data

ID	Var1	Var2	Var3	Var4	Var5	Var6	Var7	...	Var99
1			0	0	360	0	1		0
2			0	0	360	1	1		0
3		31/01/1948	0	0	360	0	1		0
4	23789		1	0	360	0	1		1
		12/12/2014	0	0	360	0	1		1
	12345		0	0	360	0	1		1
		24/07/1999	0	0	360	0	1		0
...	98765		0	0	360	0	1		0
3000			0	0	360	1	1		0

It should be like this.

ID	Var1	Var2	Var3	Var4	Var5	Var6	Var7	...	Var99
1	23789	31/01/1948	1	0	360	1	1		1
2	12345	12/12/2014
3	98765	24/07/1999

Last edited by Priver JM; 04 Sep 2019, 11:09.

Tags: categorical, data, loop, string

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

04 Sep 2019, 11:48

Not only is this one of the most bizarre data layouts I have ever seen, your description of it does not agree with what you show in the tables. Var1 does not have any five digit numbers. Var5 clearly does not have the same values as Var7.

Most confusing: how do you know that the 12345 value of Var1 belongs, in the final result, with ID 2: there is nothing in the data you show that suggests that. Similarly, it is unclear why 24/907/1999 is to be linked with ID 3 instead of one of the others. I cannot make any sense of "The rest of variables (Var3, Var4, Var6 ...) have binary (0 or 1) values but I need only one value for each one. So, if there "1" in the Var3 column, I do not need to see zero. Otherwise, there should be one "0" by each column."

I suggest you post back with a clearer explanation of what you have, what you want, and how Stata is supposed to decide which row's values get placed together with which ID. And please clarify what you mean about Var3.
1 like
Comment
Priver JM

Join Date: Feb 2019

Posts: 30
#3

04 Sep 2019, 12:07

I apologize for making a confusion. As you mentioned, there is no relationship between ID and Variables so please ignore the ID. I just generated the ID to see if I can sort the variables, but it seems not working. Each variables are made from the first column as Var1 to Var99, if a certain keyword is found. For example, if there is a keyword "apple" in the first column, "1" is filled in the Var1 at the certain row position and otherwise it will be "0".

But, this process made a current situation that is not able to see all of result at a glance since each column has 3,000 rows. That's why I asked this issue. I hope this makes clear to explain to you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

04 Sep 2019, 12:29

I'm sorry, but I don't understand any better at this point. None of my questions in #2 have been answered.
Comment
Priver JM

Join Date: Feb 2019

Posts: 30
#5

04 Sep 2019, 12:41

1) Var1 does not have any five digit numbers; the value ranges [00001] to [99999]; Numeric values should be kept in the column. Missing values are to be deleted.
2) Var5 and Var7 has same value within the column respectively. It needs to have one "360" and one "1" in each row.
3) Var3 has "1" and "0"; if there is "1", I need to delete all of the "0" in the column. If there is not "1", I want to see only one "0" in the column.

Hope this helps understanding!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

04 Sep 2019, 12:45

I'm sorry, but I still don't understand this at all. Perhaps somebody else does and can help you.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#7

04 Sep 2019, 20:18

I also found your description, as well as the logic and reason for the output, unclear. Below is just a wild guess, targeting at your described output in #1.

Code:

tempfile Input Output save `Input', replace drop ID Var1 Var2 collapse (max) * save `Output', replace use `Input', clear keep Var2 drop if missing(Var2) merge 1:1 _n using `Output', nogen save `Output', replace use `Input', clear keep Var1 drop if missing(Var1) merge 1:1 _n using `Output', nogen

By the way, in the future, please use -dataex- to give out a small sample of your data. Read more detail on FAQ, section 12.2.
Comment

Announcement

Delete the duplicates by column

Comment

Comment

Comment

Comment

Comment

Comment