How to Keep ID's with earliest dates without Diagnosis and ID's with earliest dates with Diagnosis

Guest
#1

How to Keep ID's with earliest dates without Diagnosis and ID's with earliest dates with Diagnosis

18 Jan 2019, 12:14

I have a dataset with almost 3 million patientadmissions over a timespan of several years. Every ID has several observations, because every patient may have had several admissions or registrations at various departments, where in some deparment-registrations they have the diagnosis that i am interested in and in some they do not. Some patients do not have any registered diagnosis at all, but i still need them to contribute to the analysis, as not having the outcome. One patient ID can appear many times because of registration a various departments and therefore the same ID, with the same diagnosis can appear twice or more, because he/she was registered at several departments.

I therefore need to keep all ID's that have a diagnosis code (the earliest registered) + i simultaneously need to keep all ID's that doesn't have a diagnosis code (the earliest registered)

Example:
ID: 0000012354 Diagnosis Code: D862 Admissiondate: 04 mar 13
ID: 0000012354 Diagnosis Code: . Admissiondate: 07 jun 14
ID: 0000012354 Diagnosis Code: . Admissiondate: 08 aug 14
ID: 0000012354 Diagnosis Code: C425 Admissiondate: 21 dec 14
ID: 0000043567 Diagnosis Code: . Admissiondate: 03 jan 13
ID: 0000043567 Diagnosis Code: G700 Admissiondate: 16 dec 14
ID: 0000043567 Diagnosis Code: G700 Admissiondate: 21 dec 14
ID: 0000093231 Diagnosis Code: C243 Admissiondate: 01 may 16
ID: 0000074333 Diagnosis Code: . Admissiondate: 01 may 16
ID: 0000074333 Diagnosis Code: . Admissiondate: 01 may 17
ID: 0000074333 Diagnosis Code: . Admissiondate: 01 may 18
ID: 0000074333 Diagnosis Code: . Admissiondate: 07 may 18

Hope you can help again, thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

18 Jan 2019, 12:30

Your "example" is clearly not a Stata data set. If your data are not yet imported to Stata, it is premature to ask for help with code as the details of a solution may well depend on the details of the Stata data set (including metadata). Please read the Forum FAQ for helpful advice on the best ways to show information. In the case of example data, it is by using the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

That said, let me assume that your data set will have 3 variables: id, diagnosis_code, and admission_date, the last of these being a numeric Stata internal format date variable. I am not entirely clear from your description of the problem whether you want the first observation associated with a each patient or, more subtlely, you want the first such observation that has a diagnosis code if there is any such, and just the first observation overall if none have a diagnosis code. I will assume it's the latter. That can be done with:

Code:

gen byte no_dx_code = missing(diagnosis_code) by id (no_dx_code admission_date), sort: keep if _n == 1

Now, there is another difficulty with the way you have formulated the problem. Even in the example you provide, it is possible for a person to have more than one observation with the same date, and the diagnosis codes for those dates do not necessarily agree. The above code is indeterminate in this regard: it will select one of those observations randomly and arbitrarily. Consequently each time you run this code, you will encounter some variability in the retained diagnosis codes. If this is not acceptable, then you need to refine your problem statement by providing some rule for which observation to select when there is more than one observation for a patient on the same date and there are conflicting daignosis codes.

Finally, I will reiterate that this code was written for imaginary data, so it may not work with what you actually have.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

18 Jan 2019, 12:35

Your example data is not helpful for developing code.
"Diagnosis Code" is not a valid Stata variable name.

The diagnosis code appears to be a string variable, but missing values are shown with the "." code Stata uses for a numeric missing value, so perhaps diagnosis code is instead a numeric variable with value labels.

Admissiondate could be a Stata Internal Format daily date value with a %td-style format, or it could be a string variable.

Even the best descriptions of data are no substitute for an actual example of the data, and this was not the best. As I showed, there are many ways your data might be organized that are consistent with what you showed, and each would require a somewhat different approach. In order to get a helpful response, you need to show some example data that better reflects your data.

This is done by using the dataex command to do this. If you are running version 15.1 or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from descriptions like yours. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use dataex.

More advice on making effective use of Statalist can be found in the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, especially in sections 9-12 on how to best pose your question.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

18 Jan 2019, 17:08

Guest, as both Clyde and William mentioned, come back once you have your data in Stata. Once you've done that, it is *much* easier for others to help you if you share your data using Stata's dataex command. If you need help using dataex to share data, there is a YouTube tutorial here. (I made the video too long. Just watch the first 6 minutes at 2x speed

)

Code:

dataex id diagnosis_code date_admitted // data shared via -dataex-.
clear
input long id str4 diagnosis_code float date_admitted
12354 "D682" 19421
12354 ""     19881
12354 ""     19943
12354 "C425" 20078
43567 ""     19361
43567 "G700" 20073
43567 "G700" 20078
74333 ""     20575
74333 ""     20940
74333 ""     21305
74333 ""     21311
93231 "C243" 20575
end
format %td date_admitted

sort id date_admitted
bysort id (date_admitted): gen n = _n
bysort id (date_admitted): gen total_visits = _N
gen has_diag = 1 if diagnosis_code!=""

* Creating date patient first has a diagnosis
sort id has_diag date_admitted
bys id: gen date_first_diag = date_admitted[1] if has_diag==1
format date_first_diag %td

* Does patient ever have a diagnosis (among any of their visits?)
bysort id: gen ever_diag = (has_diag[1]==1)
bysort id: replace date_first_diag= date_first_diag[1] if ever_diag==1
sort id date_admitted

. list, sepby(id) abbrev(17) noobs

  +----------------------------------------------------------------------------------------------------+
  |    id   diagnosis_code   date_admitted   n   total_visits   has_diag   date_first_diag   ever_diag |
  |----------------------------------------------------------------------------------------------------|
  | 12354             D682       04mar2013   1              4          1         04mar2013           1 |
  | 12354                        07jun2014   2              4          .         04mar2013           1 |
  | 12354                        08aug2014   3              4          .         04mar2013           1 |
  | 12354             C425       21dec2014   4              4          1         04mar2013           1 |
  |----------------------------------------------------------------------------------------------------|
  | 43567                        03jan2013   1              3          .         16dec2014           1 |
  | 43567             G700       16dec2014   2              3          1         16dec2014           1 |
  | 43567             G700       21dec2014   3              3          1         16dec2014           1 |
  |----------------------------------------------------------------------------------------------------|
  | 74333                        01may2016   1              4          .                 .           0 |
  | 74333                        01may2017   2              4          .                 .           0 |
  | 74333                        01may2018   3              4          .                 .           0 |
  | 74333                        07may2018   4              4          .                 .           0 |
  |----------------------------------------------------------------------------------------------------|
  | 93231             C243       01may2016   1              1          1         01may2016           1 |
  +----------------------------------------------------------------------------------------------------+


* NOTE: if I sort by id has_diag date_admitted it looks like this:

  . list, sepby(id) abbrev(17)

     +----------------------------------------------------------------------------------------------------+
     |    id   diagnosis_code   date_admitted   n   total_visits   has_diag   date_first_diag   ever_diag |
     |----------------------------------------------------------------------------------------------------|
  1. | 12354             D682       04mar2013   1              4          1         04mar2013           1 |
  2. | 12354             C425       21dec2014   4              4          1         04mar2013           1 |
  3. | 12354                        07jun2014   2              4          .                 .           1 |
  4. | 12354                        08aug2014   3              4          .                 .           1 |
     |----------------------------------------------------------------------------------------------------|
  5. | 43567             G700       16dec2014   2              3          1         16dec2014           1 |
  6. | 43567             G700       21dec2014   3              3          1         16dec2014           1 |
  7. | 43567                        03jan2013   1              3          .                 .           1 |
     |----------------------------------------------------------------------------------------------------|
  8. | 74333                        01may2016   1              4          .                 .           0 |
  9. | 74333                        01may2017   2              4          .                 .           0 |
 10. | 74333                        01may2018   3              4          .                 .           0 |
 11. | 74333                        07may2018   4              4          .                 .           0 |
     |----------------------------------------------------------------------------------------------------|
 12. | 93231             C243       01may2016   1              1          1         01may2016           1 |
     +----------------------------------------------------------------------------------------------------+

You can then use Clyde's code to decide you want to delete the data (keep if n==1).

Last edited by sladmin; 28 Jan 2019, 15:08. Reason: anonymize original poster

Comment

Guest
#5

20 Jan 2019, 14:26

Thank you so much Guys! This was really helpful. I might return with a follow up question later, if that is ok
Comment

Announcement