Removing earlier dates all across my dataset

Mohammad Raza

Join Date: Apr 2015

Posts: 15
#1

Removing earlier dates all across my dataset

15 Oct 2015, 09:07

Hello everyone,
I have a fairly large dataset (~35000 observations) that is arranged as such:

Patient ID Date Variable X

1 1/12/2000 00:00:00 100

1 1/10/1992 00:00:00 120

2 2/15/2003 00:00:00 98

2 1/8/2005 00:00:00 79

My question is, if I want to keep only the EARLIEST date for every patient and drop all later dates, how can I go about doing that efficiently?

Thank you,
MQR
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35688
#2

15 Oct 2015, 09:20

Code:

bysort ID (Date) : keep if _n == 1
Comment
Mohammad Raza

Join Date: Apr 2015

Posts: 15
#3

15 Oct 2015, 15:33

Thank you for getting back to me Nick! This didn't work, unfortunately - but it did set me off on the right foot. Here's a sample of the data:

Name ID Date

Joe 1 5/30/1995 00:00:00

Joe 1 6/17/1994 00:00:00

Al 2 6/3/1997 00:00:00

Al 2 5/8/1997 00:00:00

Before -keep- I actually decided to do this:

Code:

bysort ID Date: gen x = _n

This actually led to 'x' being 1 for all. Probably because they were all unique values?

Any other advice/input would be highly appreciated. Have been struggling for the greater part of my day with this.
Comment
Mohammad Raza

Join Date: Apr 2015

Posts: 15
#4

15 Oct 2015, 15:51

On the flip-side,

Code:

sort ID Date

followed by

Code:

bysort mrn: gen x = _n

seems to have given me what i need!
Awesome.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35688
#5

15 Oct 2015, 16:33

Mohammad:

I have to say that I am at a loss to know why my code "did not work". You don't explain why not.

It is possible I interpreted your post #1 incorrectly. Your title says "removing earlier dates"; your body of text says "keep only the EARLIEST date"; one of these must be wrong. But I took your emphasis to be decisive.

Nor can I make sense of #4. This refers to an unexplained variable mrn and drops or keeps nothing.

This example is modelled on your data example in #1 and my code in #2 and will be seen to keep only the EARLIEST date.

Code:

clear input ID str9 SDate 1 "1/12/2000" 1 "1/10/1992" 2 "2/15/2003" 2 "1/8/2005" end gen Date = daily(SDate, "MDY") format Date %td bysort ID (Date) : keep if _n == 1 list +----------------------------+ | ID SDate Date | |----------------------------| 1. | 1 1/10/1992 10jan1992 | 2. | 2 2/15/2003 15feb2003 | +----------------------------+
Comment
Mohammad Raza

Join Date: Apr 2015

Posts: 15
#6

15 Oct 2015, 17:04

I apologise, let me clear a few things from the previous posts:
- When you said "bysort ID (Date): keep if _n == 1" , i thought "(Date)" meant my date variable, without the parenthesis. That was causing the issue.
- 'mrn' was actually Patient ID, I was a bit excited that it worked and didn't pay attention while posting. I eventually did keep the earlier dates.

However I'm still struggling with the next step which is similar in nature. If you're willing to help, I can post it here? Otherwise I can post it in the general forum.

Thanks again, you've been very helpful.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35688
#7

15 Oct 2015, 17:11

The parentheses have a vital syntactic role. See the help on by.

This is the General forum and naturally you may post other questions if you have them. Who answers is contingent!
Comment

Mohammad Raza

Join Date: Apr 2015
Posts: 15

15 Oct 2015, 17:26

Just looked up the parenthesis in help -by-. Very cool! Learn something new everyday.
Here's my current problem. I have two columns containing dates. edate contains the earlier dates that I managed to get through your help and merged into a new document. zdate contains those dates as well as additional dates. I want to drop all observations that don't match with edate. Please see example below:

Name	ID	zdate	edate
Allen	1	9/11/2002	9/11/2002
Allen	1	9/11/2002	9/11/2002
Allen	1	9/11/2002	9/11/2002
Allen	1	5/5/2003	9/11/2002
Allen	1	5/5/2003	9/11/2002
Joan	2	6/17/1994	6/17/1994
Joan	2	5/30/1995	6/17/1994
Joan	2	5/30/1995	6/17/1994
Joan	2	5/30/1995	6/17/1994

So in the above example, I would want to keep all observations for Allen that coincide with edate (which would be the first three), and then keep just the first observation for Joan which coincides with edate, and so on for the rest of the data.

Thank you all for your help!

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35688
#9

15 Oct 2015, 17:29

Code:

drop if zdate != edate * or keep if zdate == edate
Comment
Mohammad Raza

Join Date: Apr 2015

Posts: 15
#10

16 Oct 2015, 15:30

Brilliant. worked like a charm.
Thank you very much Nick!
Comment

Patient ID	Date	Variable X
1	1/12/2000 00:00:00	100
1	1/10/1992 00:00:00	120
2	2/15/2003 00:00:00	98
2	1/8/2005 00:00:00	79

Name	ID	Date
Joe	1	5/30/1995 00:00:00
Joe	1	6/17/1994 00:00:00
Al	2	6/3/1997 00:00:00
Al	2	5/8/1997 00:00:00

Announcement