finding duplicates in multiple variables and counting the one that are only duplicate in 1 variable

Svenja Laming

Join Date: Nov 2018

Posts: 4
#1

finding duplicates in multiple variables and counting the one that are only duplicate in 1 variable

19 Nov 2018, 11:31

Hello

I have this kind of dataset

HHID herd_size ls_code exotic
1 1 3 0
2 3 3 1
3 2 6 0
1 3 3 1
2 6 5 1
2 1 3 0
3 1 4 1

I would like, in a first time, to have a list of all the HHID dat have the same HHID number and the same ls_code and if possible herd_size
So I could drop those duplicates
In a second time I would like to sum up the herd_sizes for the one that have the same HHID and same ls_code but a different herd_size
After I have the sum of those i would like to drop the rest of the duplicates HHID numbers

Can someone help me with this? Untill now I haven't figured out how to do this.

Thanks a lot
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

19 Nov 2018, 11:50

I would like, in a first time, to have a list of all the HHID dat have the same HHID number and the same ls_code and if possible herd_size
So I could drop those duplicates

Code:

clear input float(HHID herd_size ls_code exotic) 1 1 3 0 2 3 3 1 3 2 6 0 1 3 3 1 2 6 5 1 2 1 3 0 3 1 4 1 end duplicates tag HHID ls_code herd_size, gen(flag) tab HHID if flag dulicates drop HHID ls_code herd_size, force

Note that the force option is necessary here because the observations still may differ on the variable exotic. Note also that this means that this information is being lost, and you should probably drop that variable to avoid confusion later.

In a second time I would like to sum up the herd_sizes for the one that have the same HHID and same ls_code but a different herd_size

What you say here is different from the title of your post, so I can't tell if what you want is a count of these, or the total of the herd sizes. The code below gives both.

Code:

by HHID ls_code, sort: egen count_of_obs = count(herd_size) by HHID ls_code, sort: egen total_herd_size = total(herd_size)

After I have the sum of those i would like to drop the rest of the duplicates HHID numbers

OK. At this point, the variable herd_size will be meaningless, because the herd sizes will be different depending on which one out of each set of duplicates is retained. So, to avoid confusion, I drop that variable first.

Code:

drop herd_size by HHID ls_code: keep if _n == 1

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Svenja Laming

Join Date: Nov 2018

Posts: 4
#3

20 Nov 2018, 08:23

Thanks it help me a lot.
I can identifie a part of the doubles now
1 like
Comment

Announcement

finding duplicates in multiple variables and counting the one that are only duplicate in 1 variable

Comment

Comment