Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping duplicates based on their smallest/highest count

    Hi all, I have "gravity" data set where i have tariffs and trade year. There are instances when, for example, the years for tariff data for a commodity number 4311 (SITC rev.3) is 2007,2004 and 2006, but trade is taking place only in 2008. So, TRAINS database is using 2008 as the closest trade data for these 3 years. The thing is the trade takes place in 2008, but there is no tariff set for 4311 for that year, instead there are 3 other (2007,2006 and 2004).
    So, what i want to do is to keep my duplicate panelids that have the smallest absolute value of "tradeyear-tariffyear" or var dif as i put it in the dataset. In other words, I want to keep the smallest diff given a (duplicate) panelid.
    I attached the screenshot of an example duplicate panelid. (i.e. i want to keep observation # 4, since dif is smallest given its panelid, etc.)

    Appreciate it.
    Click image for larger version

Name:	sample.png
Views:	1
Size:	108.1 KB
ID:	1326814

  • #2
    Jakhongir:
    you may want to try something along the lines of the following toy-example:
    Code:
    set obs 4
    g id=1 in 1/2
    replace id=2 in 3/4
    g diff=runiform()*10
    bysort id: egen flag=min(diff)
    keep if diff==flag
    drop flag
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Consider this variation on Carlo's code:

      Code:
      set obs 4
      g id=1 in 1/2
      replace id=2 in 3/4
      g diff=runiform()*10
      bysort id (diff): keep if _n == 1
      The main idea is to sort the values for each group that the smallest on your variable is always first within group. Then keep just those first values.

      Comment


      • #4
        Nick:
        thanks for the variation.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thank you for your suggestions.
          I am sorry i think i did not formulate my question in the right way.
          The fact is i have over 140,000 duplicates (less than 6% of my data) and they arise for variety of reasons.
          Commands you provided did not do what i needed, but thanks to your suggestions, I was able to solve the problem.
          What I basically did was to
          Code:
          bysort panelid: egen flag=min(diff)
          duplicates tag panelid tradeyear, gen(dup)
          drop if dup>0 & diff!=flag
          The steps were a little more complicated, since there were many types of duplicates (duplicates with different importvalues only, duplicates with different tariff rates only, both and etc.)
          I do appreciate your help!

          Comment

          Working...
          X