Data cleaning

Sama Noroozi

Join Date: Apr 2023
Posts: 3

11 Apr 2023, 00:17

Hello everyone,
I have a very large dataset, in which each unique id has multiple columns (different codes) and different dates. I'm trying to figure out a way to extract ids with desired codes and their oldest dates, and then calculate the time difference between codes for each id, and I'm lost

any suggestions on how should I do it? I'd really appreciate your help.
a small sample of my dataset:
id code date

1	00190	20160204
1	00300	20180110
1	00670	20160114
1	01420	20160115
1	01480	20160419
1	13132	20160108
1	13132	20180110
1	13133	20160108
2	13133	20180110
2	00190	20160204
2	00300	20160419
2	00670	20160114
2	01420	20160114
2	01480	20160114
2	13132	20160114
2	13132	20160114
2	00670	20180112

Attached Files

Last edited by Sama Noroozi; 11 Apr 2023, 00:26.

Tags: None

Joseph Coveney

Join Date: Apr 2014

Posts: 4452
#2

11 Apr 2023, 02:07

Originally posted by Sama Noroozi View Post

. . . a way to extract ids with desired codes

If you have few codes

Code:

tempfile dataset quietly save `dataset' keep if inlist(code, <comma separated list of desired codes>) contract id, freq(discard) merge 1:m id using `dataset', assert(match using) keep(match)

If you have many codes, then you'd probably want to do this with a separate dataset that contains the desired codes (one observation per desired code) and, starting from that dataset in memory

Code:

merge 1:m code using `dataset', keep(match) // temporary file created as shown above contract id, freq(discard) merge 1:m id using `dataset', keep(match)

and their oldest dates

Code:

input byte id str5 code str10 date . . . end generate int dt1 = date(date, "YMD") format dt1 %tdCY-N-D quietly bysort id code (dt1): keep if _n == 1

Note: because your dates' format complies with the international standard (ISO-8601), you don't actually need to convert your dates to Stata's internal format in order to do this first step. But, because you'll need to convert them for the next step (below), I went ahead and did the conversion upfront.

. . . and then calculate the time difference between codes for each id

Code:

quietly bysort id (dt1): generate int del = dt - dt[_n-1]
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4452
#3

11 Apr 2023, 02:12

Note that that's the oldest date for each distinct code within id, and the time differences between distinct codes within id. I assume that that's what you were seeking.
Comment
Sama Noroozi

Join Date: Apr 2023

Posts: 3
#4

15 Apr 2023, 23:25

thank you so much for your quick response. I'm sorry, I had to explain my data better. My initial inclusion criteria were ICD codes for a specific disease, and my goal is to look at the incidence of another disease in this population and calculate the time difference between the oldest encounters of these two diagnoses. there are 63,373 unique patient IDs with 37,176,440 observations for 10 variables (of which I only will use date and ICD code). each unique id had multiple encounters with the date and their ICD codes (that can be the same ICD codes with different dates). additionally, each of these two diagnoses has multiple ICD codes (meaning different patients might have different codes, but have the same diagnosis). I also have another big file with the same patient IDs for surgeries (with their own specific codes), and I'm trying to figure out how to merge these files as well, cause I also want to know the time difference between the oldest dates of surgery of the first disease and diagnosis of the second disease. I'm a novice in STATA, but trying to find my way! Thank you so much for your time and your huge help. let me know if I need to provide more information. thanks
Comment
Sama Noroozi

Join Date: Apr 2023

Posts: 3
#5

16 Apr 2023, 02:52

attached is the do file I have prepared so far.
Attached Files

CTS dx.do (1.3 KB, 1 view)

Last edited by Sama Noroozi; 16 Apr 2023, 03:09.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4452
#6

16 Apr 2023, 18:36

Originally posted by Sama Noroozi View Post

My initial inclusion criteria were ICD codes for a specific disease, and my goal is to look at the incidence of another disease in this population and calculate the time difference between the oldest encounters of these two diagnoses. there are 63,373 unique patient IDs with 37,176,440 observations for 10 variables (of which I only will use date and ICD code). each unique id had multiple encounters with the date and their ICD codes (that can be the same ICD codes with different dates). additionally, each of these two diagnoses has multiple ICD codes (meaning different patients might have different codes, but have the same diagnosis). I also have another big file with the same patient IDs for surgeries (with their own specific codes), and I'm trying to figure out how to merge these files as well, cause I also want to know the time difference between the oldest dates of surgery of the first disease and diagnosis of the second disease.

I'm afraid that I don't follow your description and so I doubt that I can be of much practical help.

The description strikes me, though, as trying to accomplish too much at once and as a result the endeavor has becomed jumbled. So, about the only suggestion I could make here is to divide and conquer, breaking the thing into smaller, more cognitively manageable tasks, and tackling each one by one.

You can always save the intermediate datasets that result from these component tasks, and have things come together when companion-task components are ready. For now, you can save the intermediate datasets as permanent datasets, and when you have everything worked out, convert them into temporary datasets (using tempfile) during your final code refactoring in order to leave a cleaner working environment behind when your production code completes.
1 like
Comment

Announcement

Data cleaning

Comment

Comment

Comment

Comment

Comment