Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • droping observations with the same ID, but keeping the first

    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int ROUND_DATE long(COM_NAME STAGE)
    22592 2 2
    22592 2 2
    22420 3 3
    22420 3 3
    22818 3 3
    22818 3 3
    22818 3 3
    22818 3 3
    22818 3 3
    19205 4 6
    21192 5 2
    21192 5 2
    22195 5 3
    22195 5 3
    22195 5 3
    22195 5 3
    22712 5 3
    22712 5 3
    22712 5 3
    22712 5 3
    19926 6 4
    22834 7 2
    22834 7 2
    22834 7 2
    22834 7 2
    22834 7 2
    22834 7 2
    15809 8 1
    18995 9 2
    20515 9 3
    20515 9 3
    21444 9 3
    22362 9 4
    22362 9 4
    22362 9 4



    I have this dataset, COM_NAME is ID variable >> it was a string and I encoded it and make it numerical. As you can see, there are several observations with the same COM_NAME or ID.
    What I need to do is: Keep only the first (oldest) observation for each COM_NAME.

    could you please provide me with guidance on how to do this?

    Many thanks

  • #2
    I used this code

    sort COM_NAME ROUND_DATE
    by COM_NAME: keep if _n==1

    and I think it works properly,

    Thanks

    Comment


    • #3
      That’s fine. Could be done by

      Code:
      bysort COM_NAME (ROUND_DATE): keep if _n ==1£

      Comment


      • #4
        I think there is something wrong with both codes. I don't know, something is odd

        I know that my data has 25,890 unique companies' IDs in the original file. However, when I used the above codes. it only kept 18,354 observations!

        why would this happen?I don't know how exactly stata works when using sort and _n, could you provide help with this? why the code doesn't provide all the company observations?

        Comment


        • #5
          In #3 there was a stray character £ -- sorry about that -- but otherwise the principle is clear here. The code in #2 and #3 will keep one observation only for each distinct company name. So "something wrong with both codes" needs to be substantiated with specific examples of puzzling results. In particular, how do you know that you have 25890 unique (meaning distinct) companies?

          Comment

          Working...
          X