Need help with dataset with multiple dates

Ushma Agarwal

Join Date: Nov 2020

Posts: 37
#1

Need help with dataset with multiple dates

08 May 2021, 14:40

Hi I am working with a dataset which ahs multiple dates for the same diagnosis for a patient and I only want to keep the observations with the earliest date and drop the rest. I am going to explain a bit how the dataset is so that it can be understandable.

ID DIAGNOSIS DATE
1 CKD 2/9/2000
1 CKD 4/7/2008
1 CHF 2/5/2005
1 CHF 5/5/2009

What I want it to like is to have a column of diagnosis and only the first date.
Please help me out. Thanks

Ushma
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

08 May 2021, 14:49

With a tableau like you show, which does not include metadata that is critical to answering your specific question, it is impossible to know whether your variable DATE is, in fact, a Stata internal format date variable, or if it is some string that, to human eyes, looks like a date, or if it is some other integer with value labels that look like dates. If it is anything other than a Stata internal format date variable, the first thing you need to do is convert it to that. Once that is done:

Code:

by ID (DATE), sort: keep if _n == 1

In the future, it is best not to make people who want to help you guess about your data. The helpful way to show example data and avoid questions like this, is to use the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Ushma Agarwal

Join Date: Nov 2020

Posts: 37
#3

08 May 2021, 14:54

Hi
Thank you so much. I wanted to clarify one more thing. Can I also do it for each diagnosis separately? Like can I use the same code with a condition like if CKD =1 ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

08 May 2021, 15:00

If you want it separately for each diagnosis it's:

Code:

by ID DIAGNOSIS (DATE), sort: keep if _n == 1

Last edited by Clyde Schechter; 08 May 2021, 15:01. Reason: correct error in code
Comment
Ushma Agarwal

Join Date: Nov 2020

Posts: 37
#5

08 May 2021, 15:09

I am sorry if its a bit confusing. What I tried to do is have a separate column for each diagnosis and I want to keep only the first date for each one of them
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

08 May 2021, 15:42

That requires some additional steps beyond what is given in #4.

Code:

by ID, sort: gen seq = _n reshape wide DIAGNOSIS DATE, i(id) j(seq)

That said, be careful what you wish for. I don't know what you plan to do with the data, but most things in Stata are easier (or only possible) with the data in the long layout that you have after the code in #4. Putting the diagnoses and dates "side by side" makes the data easier for human eyes to grasp, but harder for Stata to work with. There are only a few things in Stata that work best with this kind of data laid out wide. So, I've shown you how to do what you ask, but I suggest you reconsider whether it is wise to do so.
Comment
Ushma Agarwal

Join Date: Nov 2020

Posts: 37
#7

08 May 2021, 16:16

Clyde Schechter
The data looks like this right now after using the code:
clear
input long mrn str11 diagnosis1 long date_n1 str11 diagnosis2 long date_n2 str11 diagnosis3 long date_n3
1558 "CKD" 18883 "ESKD" 18898 "HLD" 21193
6394 "CAD" 20217 "CHF" 20108 "CKD" 19606
22243 "CKD" 17098 "ESKD" 19901 "HLD" 20395
30383 "CKD" 13696 "ESKD" 15888 "HTN" 14828
30490 "CAD" 18430 "CHF" 19667 "HTN" 13818
33638 "CKD" 15505 "ESKD" 17065 "HLD" 15979
44044 "CKD" 16218 "ESKD" 16771 "HLD" 21094
52431 "CAD" 14004 "CHF" 14661 "CKD" 13892

My problem is that diagnosis1 diagnosis2 are not the same for each mrn I wanted it to be the same.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#8

08 May 2021, 17:32

Well, unless everybody in the data set has all exactly the same diagnoses, you can't have that. I'm not talking about any limitation of Stata: if you think it through, there is no possible organization of the data that satisfies those criteria. But you can come close. Start out with your original data, as it was in #1. Then do this:

Code:

by id diagnosis (date), sort: keep if _n == 1

You will now have a data set that looks like the -dataex- in the code below, to which you can apply:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long mrn str11 diagnosis long date 1558 "CKD" 18883 1558 "ESKD" 18898 1558 "HLD" 21193 6394 "CAD" 20217 6394 "CHF" 20108 6394 "CKD" 19606 22243 "CKD" 17098 22243 "ESKD" 19901 22243 "HLD" 20395 30383 "CKD" 13696 30383 "ESKD" 15888 30383 "HTN" 14828 30490 "CAD" 18430 30490 "CHF" 19667 30490 "HTN" 13818 33638 "CKD" 15505 33638 "ESKD" 17065 33638 "HLD" 15979 44044 "CKD" 16218 44044 "ESKD" 16771 44044 "HLD" 21094 52431 "CAD" 14004 52431 "CHF" 14661 52431 "CKD" 13892 end format %td date replace diagnosis = strtoname(diagnosis) rename date _date reshape wide @_date, i(mrn) j(diagnosis) string

Note: This code runs fairly smoothly if all the values of the diagnosis variable are well-behaved. I have taken the precaution of applying the strtoname() function to deal with the possibility that some of them have embedded spaces or non-alphanumeric characters. The problem this doesn't solve is if any of the diagnosis variable values is longer than 27 characters. If that happens, when -reshape- tries to create a new variable to correspond to the date of that diagnosis, it will break because of the 32 character limit on variable names. If it's just off by a few characters you could replace _date by _dt, or just dt, or just d as the case may be in the last two lines of the code. If any of the entries in diagnosis are already more than 32 characters, then you will have to change that outright.

Again, I will emphasize that you shouldn't go for this layout unless you have a clear understanding why you need it. It will likely prove an albatross around your neck if you do further data management or analysis in Stata.
Comment

Announcement

Need help with dataset with multiple dates

Comment

Comment

Comment

Comment

Comment

Comment

Comment