droping observations with the same ID, but keeping the first

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#1

droping observations with the same ID, but keeping the first

02 Mar 2023, 06:36

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input int ROUND_DATE long(COM_NAME STAGE)
22592 2 2
22592 2 2
22420 3 3
22420 3 3
22818 3 3
22818 3 3
22818 3 3
22818 3 3
22818 3 3
19205 4 6
21192 5 2
21192 5 2
22195 5 3
22195 5 3
22195 5 3
22195 5 3
22712 5 3
22712 5 3
22712 5 3
22712 5 3
19926 6 4
22834 7 2
22834 7 2
22834 7 2
22834 7 2
22834 7 2
22834 7 2
15809 8 1
18995 9 2
20515 9 3
20515 9 3
21444 9 3
22362 9 4
22362 9 4
22362 9 4

I have this dataset, COM_NAME is ID variable >> it was a string and I encoded it and make it numerical. As you can see, there are several observations with the same COM_NAME or ID.
What I need to do is: Keep only the first (oldest) observation for each COM_NAME.

could you please provide me with guidance on how to do this?

Many thanks
Tags: None
Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#2

02 Mar 2023, 06:51

I used this code

sort COM_NAME ROUND_DATE
by COM_NAME: keep if _n==1

and I think it works properly,

Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#3

02 Mar 2023, 08:01

That’s fine. Could be done by

Code:

bysort COM_NAME (ROUND_DATE): keep if _n ==1£
Comment
Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#4

03 Mar 2023, 05:09

I think there is something wrong with both codes. I don't know, something is odd

I know that my data has 25,890 unique companies' IDs in the original file. However, when I used the above codes. it only kept 18,354 observations!

why would this happen?I don't know how exactly stata works when using sort and _n, could you provide help with this? why the code doesn't provide all the company observations?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#5

03 Mar 2023, 05:27

In #3 there was a stray character £ -- sorry about that -- but otherwise the principle is clear here. The code in #2 and #3 will keep one observation only for each distinct company name. So "something wrong with both codes" needs to be substantiated with specific examples of puzzling results. In particular, how do you know that you have 25890 unique (meaning distinct) companies?
Comment

Announcement

droping observations with the same ID, but keeping the first

Comment

Comment

Comment

Comment