Counting number of specific observations in a group of variables by unique id

Dan Feldman

Join Date: Oct 2019

Posts: 7
#1

Counting number of specific observations in a group of variables by unique id

10 Oct 2019, 11:59

Hi all- first post. I'm using Stata 15.1 on a Mac. I'm working with health claims data with the goal to produce a patient's multi-morbidity count using 1 or 2 generated variables. I've attached toy data below to help.

claim = claim number
id = patient id
dx1 - dx4 = medical diagnosis

For each patient id, I want Stata to scan variables dx1 dx2 dx3 dx4 across all of the patient's claims and return a count of the number of specific diagnoses that appear at least once.

For example, let's say diagnoses A, B, and C are of interest. Patient 1 has been given an A, B, and C diagnosis at least once, so I would want the command to return the value of "3" for patient 1. Patient 2 has been given the diagnosis of A and B at least once, but not C, so I would want the command to return the value of "2".

I've looked through a number of egen possibilities, including tag, but none of them seem to help. Any suggestions?

Code:

clear input str2 claim float id str1(dx1 dx2 dx3 dx4) "1" 1 "C" "T" "E" "B" "2" 1 "C" "U" "E" "B" "3" 1 "C" "F" "E" "" "4" 1 "B" "A" "E" "" "5" 1 "B" "A" "" "" "6" 1 "C" "E" "" "" "7" 1 "D" "" "" "" "8" 1 "D" "" "" "" "9" 2 "D" "B" "T" "E" "10" 2 "D" "B" "T" "E" "11" 2 "D" "T" "E" "" "12" 2 "A" "T" "" "" "13" 2 "B" "T" "" "" "14" 2 "B" "" "" "" "15" 2 "B" "" "" "" end
Tags: None

Jenny Williams

Join Date: Sep 2017
Posts: 35

10 Oct 2019, 12:43

Since the diagnosis columns are not sorted in any order, I think the easiest would be to do this:

Code:

/* claim level */

generate _temp_a=1 if dx1=="A" | dx2=="A" | dx3=="A" | dx4=="A"
generate _temp_b=1 if dx1=="B" | dx2=="B" | dx3=="B" | dx4=="B"
generate _temp_c=1 if dx1=="C" | dx2=="C" | dx3=="C" | dx4=="C"

/* patient level */

foreach var in _temp_a _temp_b _temp_c {
egen `var'_patient=sum(`var'), by(id)
replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.
}

/* final categories */

generate dxcategory=3 if _temp_a_patient==1 & _temp_b_patient==1 & _temp_c_patient==1
replace dxcategory=2 if _temp_a_patient==1 & _temp_b_patient==1 & _temp_c_patient!=1

Last edited by Jenny Williams; 10 Oct 2019, 12:55.

Comment

Dan Feldman

Join Date: Oct 2019

Posts: 7
#3

10 Oct 2019, 13:18

Thanks Jenny - that works. Since I have 28 different diagnoses that can be considered in the multi-morbidity total, I wonder if your final line of code could simplify to:

Code:

egen dxcategory = rowtotal(_has_a_patient _has_b_patient _has_c_patient)

It seems to give the same result.
Comment
Jenny Williams

Join Date: Sep 2017

Posts: 35
#4

10 Oct 2019, 13:51

Absolutely.

One side caution is that using "egen rowtotal" will allow missing values in the count, whereas using "dxcategory=_temp_a_patient + _temp_b_patient + _temp_c_patient" will only count rows where the all of the variables are non-missing.

You can also save some space in the first line of code (if the codes for each condition are simple letters or coded numeric values):

Code:

foreach condition in A B C D E F Z { generate _temp_`condition'=1 if dx1=="`condition'" | dx2=="`condition'" | dx3=="`condition'" | dx4=="`condition'" | dx5=="`condition'" | dx6=="`condition'" /*...repeating all the way to dx28*/ }

Last edited by Jenny Williams; 10 Oct 2019, 14:03.
Comment

Wouter Wakker

Join Date: Nov 2018
Posts: 621

10 Oct 2019, 14:02

There is a small error in Jenny's code.

Code:

replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.

should be

Code:

replace `var'_patient=1 if `var'_patient>0 & `var'_patient!=.

You could also do it like this. It's based on the same idea but the code is a bit simpler.

Code:

gen dx_all = dx1 + dx2 + dx3 + dx4

gen count = 0
foreach n in A B C {
    bysort id: egen `n' = max(strpos(dx_all, "`n'")>0)
    replace count = count + `n'
    drop `n'
    }

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35646
#6

10 Oct 2019, 14:08

There is a dedicated egen function to do this in egenmore (SSC). Written up in my rowwise column in the Stata Journal.
Comment
Dan Feldman

Join Date: Oct 2019

Posts: 7
#7

10 Oct 2019, 19:03

Jenny, Wouter, and Nick: Thank you all for your input, I think I have enough to move forward with my work and I'm very pleased wit hthe result of my first Statalist post. A few last comments if I may:

Wouter: I think Jenny had it right to begin with using:

Code:

replace `var'_patient=1 if `var'_patient>1 & `var'_patient!=.

This is because the preceding line of code creates variable that return either a 1, an integer >1, or a 0.
Since, in this context, integers >1 have the same meaning as true 1's, the code only needs to replace numbers >1.

Jenny:

Code:

egen dxcategory = rowtotal(_temp_a_patient _temp_b_patient _temp_c_patient)

Rowtotal works because the "patient level" code that generated the column totals for each patient returns a "0" when it totals a column containing all missing values. In other words, temp_a_patient _temp_b_patient _temp_c_patient do not contain missing values, even if _temp_a _temp_b _temp_c do. At least that's how it's executing on my end. Nick: Thanks for the reference!
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#8

11 Oct 2019, 02:48

Wouter: I think Jenny had it right to begin with

This is true indeed, I stand corrected.

Glad you found a solution.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35646

11 Oct 2019, 03:01

Sorry: #6 is not pertinent. I misread the problem badly,

To compensate, here is another way to think about it. You might want to pepper the code with list commands if it's not clear what is happening.

Code:

clear
input str2 claim float id str1(dx1 dx2 dx3 dx4)
"1"  1 "C" "T" "E" "B"
"2"  1 "C" "U" "E" "B"
"3"  1 "C" "F" "E" ""
"4"  1 "B" "A" "E" ""
"5"  1 "B" "A" ""  ""
"6"  1 "C" "E" ""  ""
"7"  1 "D" ""  ""  ""
"8"  1 "D" ""  ""  ""
"9"  2 "D" "B" "T" "E"
"10" 2 "D" "B" "T" "E"
"11" 2 "D" "T" "E" ""
"12" 2 "A" "T" ""  ""
"13" 2 "B" "T" ""  ""
"14" 2 "B" ""  ""  ""
"15" 2 "B" ""  ""  ""
end

save dx 

reshape long dx, i(id claim) j(which)
drop if !inlist(dx, "A", "B", "C")
bysort id dx : drop if _n > 1 
by id: gen wanted = _N 
collapse wanted, by(id)


merge 1:m id  using dx 

tabdisp id, c(wanted)

Comment

Dan Feldman

Join Date: Oct 2019

Posts: 7
#10

11 Oct 2019, 08:53

Thanks Nick -

Also, I also found a good solution at, of all places, a Stata FAQ:

HTML Code:

https://www.stata.com/support/faqs/data-management/check-for-range-of-icd-codes/

Looks like this is the exact code I need, with a few minor modifications, and looks a lot like Jenny's. The icd9 command in Stata is very helpful for this kind of work!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35646
#11

11 Oct 2019, 08:54

I wouldn't have thought of that!
Comment

Announcement