Dropping duplicates based on their smallest/highest count

Jakhongir Akramov

Join Date: Feb 2016

Posts: 4
#1

Dropping duplicates based on their smallest/highest count

14 Feb 2016, 03:24

Hi all, I have "gravity" data set where i have tariffs and trade year. There are instances when, for example, the years for tariff data for a commodity number 4311 (SITC rev.3) is 2007,2004 and 2006, but trade is taking place only in 2008. So, TRAINS database is using 2008 as the closest trade data for these 3 years. The thing is the trade takes place in 2008, but there is no tariff set for 4311 for that year, instead there are 3 other (2007,2006 and 2004).
So, what i want to do is to keep my duplicate panelids that have the smallest absolute value of "tradeyear-tariffyear" or var dif as i put it in the dataset. In other words, I want to keep the smallest diff given a (duplicate) panelid.
I attached the screenshot of an example duplicate panelid. (i.e. i want to keep observation # 4, since dif is smallest given its panelid, etc.)

Appreciate it.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#2

14 Feb 2016, 03:43

Jakhongir:
you may want to try something along the lines of the following toy-example:

Code:

set obs 4 g id=1 in 1/2 replace id=2 in 3/4 g diff=runiform()*10 bysort id: egen flag=min(diff) keep if diff==flag drop flag

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#3

14 Feb 2016, 03:52

Consider this variation on Carlo's code:

Code:

set obs 4 g id=1 in 1/2 replace id=2 in 3/4 g diff=runiform()*10 bysort id (diff): keep if _n == 1

The main idea is to sort the values for each group that the smallest on your variable is always first within group. Then keep just those first values.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#4

14 Feb 2016, 03:54

Nick:
thanks for the variation.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jakhongir Akramov

Join Date: Feb 2016

Posts: 4
#5

15 Feb 2016, 06:13

Thank you for your suggestions.
I am sorry i think i did not formulate my question in the right way.
The fact is i have over 140,000 duplicates (less than 6% of my data) and they arise for variety of reasons.
Commands you provided did not do what i needed, but thanks to your suggestions, I was able to solve the problem.
What I basically did was to

Code:

bysort panelid: egen flag=min(diff) duplicates tag panelid tradeyear, gen(dup) drop if dup>0 & diff!=flag

The steps were a little more complicated, since there were many types of duplicates (duplicates with different importvalues only, duplicates with different tariff rates only, both and etc.)
I do appreciate your help!
Comment

Announcement

Dropping duplicates based on their smallest/highest count

Comment

Comment

Comment

Comment