deleting duplicates in rows across variables.

hannah mitchell

Join Date: Mar 2018

Posts: 18
#1

deleting duplicates in rows across variables.

03 Mar 2021, 12:29

I have a dataset and want to delete duplicate observations for diagnoses.

Each row I have a unique patient ID - and then diagnosis variables diag_code_1 - diag_code_x (up to 800 for some patiends). diag_code_1 is a combination of 2-5 numbers and letters representing an ICD9 code. The majority of these variables are repeats. What is the best code to use to delete all diagnosis duplicates for each row of the table leaving only unique diagnoses.

Many thanks - very grateful for any help with this.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

03 Mar 2021, 12:59

So this task, like so much data management in Stata, is easy to do if you -reshape- the data to long layout. You do not show example data, so I have to imagine what your data set looks like. This code will show you the general approach, it may or may not actually work in your data, depending on whether I have successfully read your mind.

Code:

reshape long diag_code_, i(patient_id) j(_j) by patient_id diag_code_ (_j), sort: keep if _n == 1

At that point you will have a data set with one observation per distinct diag_code_ value per patient. It is probably best to leave the data in this long arrangement since whatever you plan to do next with the data is probably easiest this way. But if you have a compelling reason to go back to the original wide layout, you can do so as follows:

Code:

by patient_id (_j), sort: replace _j = _n reshape wide

When asking for help with code, always show example data. When showing example data, always use -dataex-. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

hannah mitchell

Join Date: Mar 2018
Posts: 18

08 Mar 2021, 12:30

Hello

Thank you so much for your help with this Clyde, I have a further question.

So I have returned to an original version of the data and am trying to reshape long to wide. I want my table to have one row for each ID and record and for this row to contain (in no specific order) all of the diag_code for that ID row combination.

I had originally used the command

reshape wide diag_cd_1 diag_cd_2 diag_cd_3 diag_cd_4 diag_cd_5 diag_cd_6 diag_cd_7 diag_cd_8 diag_cd_9, i(idrecord) j(obs)

However my dataset is very large and this kept crashing the server.

Is there a more efficient way to do this command? Any variables which are missing data are unimportant to me so can be deleted.
This is my data:

Code:

input str36 id double record str39 idrecord str8(diag_cd_1 diag_cd_2 diag_cd_3 diag_cd_4 diag_cd_5 diag_cd_6 diag_cd_7 diag_cd_8 diag_cd_9)
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "4280"  "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 1 "33333333EEjDjPf_1" "5859"  "40390" "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "2512"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "25001" ""      "" "" "" "" "" "" ""
"33333333EEjDjPf" 2 "33333333EEjDjPf_2" "5856"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "486"   ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "5856"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "7862"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "5856"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "78609" ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "7862"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 1 "33333333gEEPfrg_1" "7862"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 2 "33333333gEEPfrg_2" "78650" ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 2 "33333333gEEPfrg_2" "4293"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 2 "33333333gEEPfrg_2" "78930" ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 2 "33333333gEEPfrg_2" "5750"  ""      "" "" "" "" "" "" ""
"33333333gEEPfrg" 3 "33333333gEEPfrg_3" "5856"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 1 "33333333jPfr3E8_1" "6826"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 1 "33333333jPfr3E8_1" "6826"  "5856"  "" "" "" "" "" "" ""
"33333333jPfr3E8" 1 "33333333jPfr3E8_1" "7823"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 1 "33333333jPfr3E8_1" "6826"  "40391" "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "7931"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "78959" ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "5693"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "7863"  ""      "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "7863"  "5856"  "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "40391" "5856"  "" "" "" "" "" "" ""
"33333333jPfr3E8" 2 "33333333jPfr3E8_2" "7863"  "7580"  "" "" "" "" "" "" ""
"3333333833K83fg" 1 "3333333833K83fg_1" "5289"  "7840"  "" "" "" "" "" "" ""
"3333333833K83fg" 1 "3333333833K83fg_1" "0542"  "V420"  "" "" "" "" "" "" ""
"3333333833K83fg" 1 "3333333833K83fg_1" "5289"  "7840"  "" "" "" "" "" "" ""
"3333333833K83fg" 1 "3333333833K83fg_1" "0542"  "V420"  "" "" "" "" "" "" ""
"3333333838ff38f" 1 "3333333838ff38f_1" "5856"  ""      "" "" "" "" "" "" ""
"3333333838ff38f" 1 "3333333838ff38f_1" "5856"  ""      "" "" "" "" "" "" ""
"3333333838ff38f" 1 "3333333838ff38f_1" "5856"  ""      "" "" "" "" "" "" ""
"333333383DPDKEP" 1 "333333383DPDKEP_1" "2851"  "4019"  "" "" "" "" "" "" ""
"333333383DPDKEP" 1 "333333383DPDKEP_1" "6238"  "25040" "" "" "" "" "" "" ""
"333333383DPDKEP" 2 "333333383DPDKEP_2" "99656" ""      "" "" "" "" "" "" ""
"333333383DPDKEP" 2 "333333383DPDKEP_2" "55320" ""      "" "" "" "" "" "" ""
"333333383DPDKEP" 2 "333333383DPDKEP_2" "99656" ""      "" "" "" "" "" "" ""
"333333383DPDKEP" 2 "333333383DPDKEP_2" "5856"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "V5844" "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5849"  "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5849"  "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "V5844" "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5859"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5849"  "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5859"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "V582"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "V5844" "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5859"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 1 "333333383DgEr8f_1" "5849"  "V420"  "" "" "" "" "" "" ""
"333333383DgEr8f" 2 "333333383DgEr8f_2" "78903" "59970" "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "31401" ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "5856"  "5859"  "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "78900" "28419" "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "5859"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "40391" "28419" "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "5859"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "591"   "78959" "" "" "" "" "" "" ""
"333333383DgEr8f" 3 "333333383DgEr8f_3" "4019"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 4 "333333383DgEr8f_4" "31401" ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 4 "333333383DgEr8f_4" "4019"  "5859"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "78039" "78097" "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "31401" ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "4293"  "5856"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "4293"  "5856"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "99681" "5856"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "4019"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "V812"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "5859"  "V4511" "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "78097" "5859"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "4019"  ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "4293"  "5856"  "" "" "" "" "" "" ""
"333333383DgEr8f" 5 "333333383DgEr8f_5" "78039" ""      "" "" "" "" "" "" ""
"333333383DgEr8f" 6 "333333383DgEr8f_6" "5859"  "V4511" "" "" "" "" "" "" ""
"333333383DgEr8f" 6 "333333383DgEr8f_6" "3453"  "4019"  "" "" "" "" "" "" ""
"333333383DgEr8f" 6 "333333383DgEr8f_6" "5859"  "V4511" "" "" "" "" "" "" ""
end

Really grateful for any help with this.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

09 Mar 2021, 10:40

First, are you sure that you're actually crashing the server? -reshape wide- on a large data set can take a very long time and Stata will be unresponsive during that time. Have you tried letting it run overnight? You might be surprised to find it runs. Is it actually halting with an error message saying that it's out of memory or something like that? If it just seems to be hung up, you can't be sure it's not really running.

That said, what you have is already a hybrid between long and wide. I have some questions about it. In the example you show here, the variables id and id record contain mostly the same information, and the separate parts can be recovered from idrecord alone. So I would drop id and record and just keep idrecord. You can always recreate id and record after you're done if you need them. Next, in the example you show all the diag_cd* variables have missing values except diag_cd1 and diag_cd2. If that's true throughout the whole data set, get rid of all those empty diag_cd3-diagcd800 variables: they have no information but they take up memory and processing time.

Next, there is this variable idrecord. I don't know what it is and what you want to do with it. The same value can appear more than once with the same id (see, for example record

I'm going to assume now that in the full data set the diag_cd* variables are not completely empty, so you can't just get rid of them. Then I think the first step is to actually first go completely long and get rid of all the missing diag_cd values. Then go fully wide in a single step from there.

Code:

drop id record gen long obs_no = _n reshape long diag_cd_, i(obs_no) drop if missing(diag_cd_) by idrecord, sort: replace _j = _n drop obs_no reshape wide diag_cd, i(idrecord) j(_j)

If that still crashes your server, then I would do that as far as drop obs_no. Then, I would break up the data into smaller subsets, chunks consisting, say of those observations with _j between 1 and 100, _j between 101 and 200,... _j between 701 and 800. I would then reshape each of those subsets wide, and then merge those results altogether.
1 like
Comment
hannah mitchell

Join Date: Mar 2018

Posts: 18
#5

14 Mar 2021, 17:04

Thank you so much this is very helpful.

This seems to work to get one row for each idrecord value - however there are up to 1,400 diag_cd_ variables for some records the majority of which are duplicates - what is the best way to drop these duplicates but preserve the idrecord.

I had tried this

however then I loose around 2000 idrecord values - presumably ones which have no diagnoses. Really grateful for any suggestions of best code which drops any duplicate diagnoses. Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

14 Mar 2021, 19:21

The following code, adapted from #4 will do this:

Code:

drop id record gen long obs_no = _n reshape long diag_cd_, i(obs_no) drop if missing(diag_cd_) by idrecord diag_cd_, sort: keep if _n == 1 by idrecord, sort: replace _j = _n drop obs_no reshape wide diag_cd, i(idrecord) j(_j)

By the way, I see the code you show in #5 refers to prcdr_cd instead of diag_cd. It's the same either way. Perhaps you are doing both of these?
Comment
hannah mitchell

Join Date: Mar 2018

Posts: 18
#7

16 Mar 2021, 19:55

Thank you so much! This seems to work much better. Apologies I had put prcdr_cd instead of diag_cd in the last post but am doing both in the same way.

The only issue with this code is that I am loose a small amount of idrecords with this reshaping (I presume because some have no procedure or dx codes). Is there a way to avoid doing this.

I am really appreciative of your help with this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#8

16 Mar 2021, 20:03

Replace the -drop if missing(diag_cd_) command with:

Code:

by idrecord (diag_cd_), sort: drop if missing(diag_cd_) & _n > 1
Comment
hannah mitchell

Join Date: Mar 2018

Posts: 18
#9

17 Mar 2021, 10:40

Thank you so much for your help, I am incredibly grateful.
Comment

Announcement