Help with long data and counting unique variables by dates

Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#1

Help with long data and counting unique variables by dates

06 May 2019, 11:41

Hi,

I am in desperate need of help! I tried to show my data with dataex but given the size (25 million observations) there are a lot of limitations so I will show the best example below.

id sequence_ date_procedure procedure_1 procedure_2
1 942 3/3/2011 1 .
1 942 3/3/2011 . .
1 942 3/3/2011 1 1
1 952 5/5/2012 . 1
1 952 5/5/2012 . 1
1 952 5/5/2012 . .

the code for sequence was : egen sequence=group(date_procedure)

procedure 1 and procedure 2 can occur on the same or different dates and are based on hcpcs codes in medicare. There may be more than one code for the procedure that is listed which is why procedure can occur twice in the same date.

dates are %tdnn/dd/CCYY format

I want to do two things:
1. I want to count the total number of unique procedures that a patient received. I am having a hard time figuring out how to do this based since the data is in long format and the same procedure may be listed multiple times due to repeat HCPCS codes.

2. I want the date of the second procedure that patients received.

I apologize if this is not a great format to put this question in. Please let me know if I can improve the post or clarify what I need.

Thank you in advance
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30174

06 May 2019, 12:39

Actually, your data are not in a fully long layout; they are in a hybrid layout that is half-long and half-wide. And, in fact, the way to easily do what you need is to reshape it to fully long. Then, if there is a good reason to return to the original hybrid layout once these calculations are done, you can.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int sequence_ float date_procedure byte(procedure_1 procedure_2)
1 942 18689 1 .
1 942 18689 . .
1 942 18689 1 1
1 952 19118 . 1
1 952 19118 . 1
1 952 19118 . .
end
format date_procedure %tdnn/dd/CCYY

//    GO TO FULLY LONG LAYOUT
gen long obs_no = _n
reshape long procedure, i(obs_no) string

//    COUNT NUMBER OF DISTINCT PROCEDURES EACH PATIENT HAS
by id procedure, sort: gen long procedure_count = (_n == 1) if !missing(procedure)
by id: replace procedure_count = sum(procedure_count)
by id: replace procedure_count = procedure_count[_N]

//    FIND SECOND PROCEDURE DATE
by id date_procedure, sort: gen long date_order = (_n == 1) if !missing(date_procedure)
by id: replace date_order = sum(date_order)
by id: egen second_procedure_date = max(cond(date_order == 2, date_procedure, .))
format second_procedure_date %tdnn/dd/CCYY

//    IF THERE IS A GOOD REASON TO GO BACK TO ORIGINAL LAYOUT:
reshape wide

Note: -dataex- accepts both a varlist and -if- and -in- conditions, so you could have shown your example using -dataex- by

Code:

dataex id sequence_ date_procedure procedure_1 procedure_2 if/in whatever

where you replace if/in whatever by an appropriate if or in condition (or both) that selects those particular observations. And that would have been faster for you to do than creating a tableau, and faster for me to do than going through the steps needed to import your tableau to Stata.

Comment

Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#3

06 May 2019, 15:35

Hi Clyde,

I tried and tried to get it into long format based on your code but was not successful. Your dataex code did help (I apologize for making things harder for you). If you wouldn't mind helping again I would appreciate it.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id seq_ date_bil ercp_s ptbd sbb) 2 933 16273 378 0 0 2 933 16273 379 0 0 2 933 16273 380 0 0 2 933 16273 381 0 0 2 942 16282 394 0 0 2 942 16282 0 399 0 2 949 16289 0 489 0 2 949 16289 0 503 0 2 976 16316 0 553 0 end format %tdnn/dd/CCYY date_bil
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#4

06 May 2019, 16:04

Please do not, in posts here, say non-specifically that something "did not work" or was "unsuccessful." How is anybody supposed to imagine what went wrong and in what way it did not work?

The data you post here (thank you for using -dataex-) is different from what you showed in #1. (Another reason not to make up a data tableau: if the real data are different, the code you are given may not work.) There is a new variable, called sbb, that I cannot relate to anything in the original example. But it appears that date_bil is what you previously called date_procedure, ercp_s is what you previously called procedure_1, and ptbd is what you previously called procedure_2. The new names for the procedure variables are rather hostile to the way -reshape- works, so renaming them is critical to getting this code to work. As long as we're doing that anyway, we may as well go all the way and simply rename all of the variables to correspond to the names in the original example. Then we can apply that code directly--it worked properly with the example data in #1, and it will work properly again if we do that. So, here we go:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id seq_ date_bil ercp_s ptbd sbb) 2 933 16273 378 0 0 2 933 16273 379 0 0 2 933 16273 380 0 0 2 933 16273 381 0 0 2 942 16282 394 0 0 2 942 16282 0 399 0 2 949 16289 0 489 0 2 949 16289 0 503 0 2 976 16316 0 553 0 end format %tdnn/dd/CCYY date_bil rename date_bil date_procedure rename ercp_s procedure_1 rename ptbd procedure_2 // GO TO FULLY LONG LAYOUT gen long obs_no = _n reshape long procedure, i(obs_no) string // COUNT NUMBER OF DISTINCT PROCEDURES EACH PATIENT HAS by id procedure, sort: gen long procedure_count = (_n == 1) if !missing(procedure) by id: replace procedure_count = sum(procedure_count) by id: replace procedure_count = procedure_count[_N] // FIND SECOND PROCEDURE DATE by id date_procedure, sort: gen long date_order = (_n == 1) if !missing(date_procedure) by id: replace date_order = sum(date_order) by id: egen second_procedure_date = max(cond(date_order == 2, date_procedure, .)) format second_procedure_date %tdnn/dd/CCYY // IF THERE IS A GOOD REASON TO GO BACK TO ORIGINAL LAYOUT & VARIABLE NAMES: reshape wide rename date_procedure date_bil rename procedure_1 ercp_s rename procedure_2 ptbd

It produces correct results with this example data. (One caveat: I notice that in variables ercp_s and ptbd, most of the codes are 3 digit numbers, but some are 0. If 0 is not a valid code but instead represents a flag for "nothing" or "missing" or something like that, then this code will not be quite right, because it counts 0 as a valid procedure code. If that is the case, change

Code:

by id procedure, sort: gen long procedure_count = (_n == 1) if !missing(procedure) // AND by id date_procedure, sort: gen long date_order = (_n == 1) if !missing(date_procedure)

to

Code:

by id procedure, sort: gen long procedure_count = (_n == 1) if !missing(procedure) /// & procedure != 0 // AND gsort id date_procedure -procedure by id date_procedure: gen long date_order = (_n == 1) if !missing(date_procedure) /// & procedure != 0 // RESPECTIVELY.

The code also ignores the variable sbb. If sbb is, in fact, a third procedure code variable, then near the top of the code, where the first group of -rename- commands is found, then add another one: -rename sbb procedure_3-. And at the end of the code add another command: -rename procedure_3 sbb-

Added: If 0 is a code for "no procedure" or "missing" or the like, you can make the changes in the code I suggested for that. But actually, it would be better just at the beginning, after the first set of -rename- commands to add:

Code:

mvdecode procedure*, mv(0)

which changes all of those 0's to system missing values and not make the changes I recommended above for dealing with 0 procedure codes. The reason I say that is that in Stata, it is usually a recipe for problems to use actual number as a code for missing values. Stata has both system missing and extended missing value built-in codes which it automatically treats as missing values, and this results in the correct handling in most situations. When you use an actual number as a proxy for missing values, you have to constantly add conditions to your commands that exclude observations that have that number in them. Doing that consistently over more than a few lines of code is beyond the capability of most humans, so you end up with incorrect code and garbage results. Use Stata's built-in missing values, not numbers, for missing values. So even though I have shown you how to modify the code to accommodate those zeroes, it is really not a good idea to do that: those zeroes are likely to get you in trouble later anyhow.

Last edited by Clyde Schechter; 06 May 2019, 16:20.
1 like
Comment
Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#5

06 May 2019, 18:30

Hi Clyde,
thank you so much and sorry for the prior unhelpful response. The code worked! I appreciate all your help.
Comment
Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#6

07 May 2019, 17:27

Hi Clyde,

Another question--sorry.

This code worked very well:
// GO TO FULLY LONG LAYOUT gen long obs_no = _n reshape long procedure, i(obs_no) string **I was able to get my data into long format. I had trouble here: //mvdecode procedure*, mv(0) //by id procedure, sort: gen long procedure_count = (_n == 1) if !missing(procedure)
**Specifically, my data set uses multiple hcpcs codes to identify a procedure. For example, procedure _1 in my example below has four different hcpcs codes that identify it. However, only one procedure occurred on that date. The total count for below should be 5. Any suggestions?

Thank you in advance

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id long procedure_count str2 _j float(date_procedure procedure) 2 9 "_1" 16273 378 2 9 "_1" 16273 379 2 9 "_1" 16273 380 2 9 "_1" 16273 382 2 9 "_2" 16282 390 2 9 "_1" 16282 402 2 9 "_2" 16289 502 2 9 "_2" 16289 503 2 9 "_2" 16316 550 2 9 "_3" 16316 . 2 9 "_1" 16316 . 2 9 "_1" 16289 . 2 9 "_2" 16282 . 2 9 "_3" 16273 . 2 9 "_3" 16282 . 2 9 "_2" 16273 . 2 9 "_3" 16289 . 2 9 "_3" 16273 . 2 9 "_1" 16282 . 2 9 "_3" 16282 . 2 9 "_3" 16273 . 2 9 "_2" 16273 . 2 9 "_3" 16273 . 2 9 "_2" 16273 . 2 9 "_1" 16289 . 2 9 "_3" 16289 . 2 9 "_2" 16273 . end format %tdnn/dd/CCYY date_procedure
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#7

07 May 2019, 20:57

Well, you show 9 different codes which you say correspond actually to only 5 different procedures. To make the code work for that would require information that tells which codes correspond to the same procedure and which are different. In effect, one would have to create a new variable that maps all the variant hcpcs codes into a single encoding of the procedures. Otherwise, I don't see any way to know that the count is 5 rather than any number between 1 and 9: information that identifies codes that are the same procedure is crucial.

I have no experience working with hcpcs codes. Looking at a file that I downloaded from the Centers from Medicare and Medicaid Services (CMS), https://www.cms.gov/Medicare/Coding/...Dir=descending, it is difficult for me to relate it to your situation. First, all of the codes shown in that file begin with one or more letters--there are no purely numeric codes in it. In addition, while on a casual look I see many highly similar or closely related procedures, there are very few codes whose descriptions are actually the same. So it doesn't seem to me that this file would be helpful in identifying when two of your codes refer to the same procedure. Do you have some other source for this information?
Comment
Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#8

08 May 2019, 05:03

Hi Clyde,

Thank you for the message. I do have a variable for each of the procedures (1,2,3) that codes the variant hcpcs codes into one single code. These codes are based on timing of the procedure (after a diagnosis) and which one occurred immediately after the procedure.

My code for one procedure looks like this:
by PATIENT_ID: egen bil_before = min(cond((ercp_s!=0 | ptbd!=0 | sbb!=0) & (date_bil<date_dx & date_bil!=.), seq_o,.)) **excludes patients who had a procedure before diagnosis. ERCP_s is made from multiple hcpcs codes**
by PATIENT_ID: egen first_ERCP = min(cond((ercp_s & ercp_s !=.) & (bil_before !=1), seq_,.)) **first ERCP procedure**
by PATIENT_ID: egen ercp_only = min(cond((first_ERCP !=. & ercp_ptbd == . & ercp_sbb== . & sbb_ptbd == .) & (first_ERCP<first_PTBD) & (first_ERCP<first_SBB) & (bil_before !=1), seq_,.)) **excludes patients who had multiple procedures on the same date and finds the first procedure done for each patient as long as it happened after their diagnosis**

ercp_s is the variable I made from the hcpcs codes (same with ptbd and sbb).

I do not know if showing you this is helpful (probably is not). I am wondering if I can just do a series of codes to find the total number but I don't know how to do it across all observations with my data.

I have hcpcs information attached to this thread. Thank you for any help.

Attached Files

CMS_Procedure_Codes.xlsx (185.2 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#9

08 May 2019, 13:22

I can't really follow the code you show in #8. But if you have a way of crosswalking the various hcpcs codes that all correspond to the same procedure into a single code, then the simplest way to proceed is to create new variables that do that. So for example, if 378, 379, 380 and 382 all are codes for the same procedure, and if 502 and 503 also represent the same procedure, create a new variable where 379, 380, and 382 are all replaced by 378, and where 503 is replaced by 502. Then run the code I gave you earlier using these new variables instead of the original hcpcs codes. The code I wrote counts the number of distinct non-missing values per id. So it will give you what you want if you provide it with variables where all of the "synonyms" have been replaced by a single common value.
Comment
Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#10

10 May 2019, 06:15

Hi Clyde,

I just wanted to follow up. Thank you for your last post. It worked!

**Obtain count of procedures**
foreach var of varlist seq_{
gen e = ercp_s if ercp_s !=0 & `var'==`var'[_n-1]
replace e = ercp_s if ercp_s!=0 & e == . & `var'==`var'[_n]
}
order PATIENT_ID e ercp_s seq_o seq_ date_bil
egen proc_1 = min(e) if e!=., by(seq_)
order PATIENT_ID e proc_1
*****************************************

foreach var of varlist seq_{
gen p = ptbd if ptbd !=0 & `var'==`var'[_n-1]
replace p = ptbd if ptbd!=0 & p == . & `var'==`var'[_n]
}
order PATIENT_ID p ptbd seq_o seq_ date_bil
egen proc_2 = min(p) if p!=., by(seq_)
order PATIENT_ID p proc_2
***********************************************

foreach var of varlist seq_{
gen s = sbb if sbb !=0 & `var'==`var'[_n-1]
replace s = sbb if sbb!=0 & p == . & `var'==`var'[_n]
}

order PATIENT_ID s sbb seq_o seq_ date_bil
egen proc_3 = min(s) if s!=., by(seq_)
order PATIENT_ID s proc_3

Once I made a variable that corresponds to the same procedure during a given date, your code worked. I appreciate your help.
Comment

Announcement