How to deal with overlapping obs in healthcare data

Duong Le

Join Date: Apr 2020
Posts: 66

How to deal with overlapping obs in healthcare data

15 Apr 2020, 03:57

Hi all,

I have two data sets on healthcare costs during the past 12 months, in which one is at individual level and the other is at household level. I want to merge the two but I encounter an issue of overlapping obs in the individual data. Specifically, the individual data consists of information on: 1) healthcare centers (public or private centers) where individuals visit for healthcare treatment; 2) services they use (whether out-patient or in-patient services); 3) costs incurred by outpatient visits; and 4) costs incurred by in-patient visits. Thus, it could be cases that an individual may use healthcare services more than one time during the last 12 months, at different healthcare centers, and different services. For example, Mr. went to hospitals three times in a given year, in which the 1st and 2nd time he used outpatient services at public healthcare centers, but the 3rd time he was hospitalized at a private healthcare center (note: this may not be the case in my example data but similar things may happen). That created challenges in generating a unique identifier for each individual.

My question is how to generate a unique identifier for each individual so that I can merge the individual data set into the household one, without dropping any variables in the both data sets. Any help is highly appreciated.

Note: prid comid hhid are uniquely identified observations in the household data
prid comid hhid invid should be uniquely identified observations in the individual data, but due to the overlapping issue, yet they are not.

Individual data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int prid long(comid hhid invid opcost ipcost) float hcenter byte service
101 1010103 14101 1410101 1000     0 1 0
101 1010103 14102 1410201  500     0 1 0
101 1010103 14102 1410201  700     0 0 0
101 1010103 14103 1410301  500     0 1 0
101 1010103 14103 1410301    0  1000 1 1
101 1010103 14103 1410303    0  3000 1 1
101 1010103 14103 1410303  300     0 0 0
101 1010103 14104 1410401 1500     0 1 0
101 1010109  2002  200201   90     0 1 0
101 1010109  2011  201101  200     0 0 0
101 1010109  2011  201102  210     0 0 0
101 1010109  2011  201103  500     0 1 0
101 1010109  2011  201104  300     0 0 0
101 1010109  2013  201302  150     0 1 0
101 1010109  2014  201403   20     0 1 0
101 1010109  2014  201403  480     0 0 0
101 1010109  2014  201403  800     0 0 0
101 1010109  2014  201404    .   400 1 1
101 1010109  2019  201901    .   200 1 1
101 1010109  2020  202002    .  5000 1 1
101 1010109  2021  202102  700     0 0 0
101 1010109  2021  202102  500     0 0 0
101 1010109  2021  202104  300     0 0 0
101 1010109  2022  202201    0     0 1 0
101 1010115  5006  500601 2000     0 1 0
101 1010115  5006  500603   15     0 1 0
101 1010115  5006  500604   15     0 1 0
101 1010115  5007  500701    .   300 1 1
101 1010115  5007  500701    .   400 1 1
101 1010115  5008  500801    .   315 1 1
101 1010115  5008  500801    .   210 1 1
101 1010115  5008  500801    .   315 1 1
101 1010115  5008  500801    .   315 1 1
101 1010115  5008  500802  200     0 1 0
101 1010115  5010  501001  400     0 1 0
101 1010115  5010  501001    .   500 1 1
101 1010115  5011  501101 3000     0 1 0
101 1010115  5011  501102 1500     0 1 0
101 1010115  5012  501204  195     0 1 0
101 1010115  5012  501204  220     0 1 0
101 1010115  5013  501304    .   600 1 1
101 1010115  5013  501304    .   200 1 1
101 1010115  5015  501501    .   100 1 1
101 1010115  5015  501501    .   120 1 1
101 1010115  5015  501501    .   130 1 1
101 1010115  5015  501503  280     0 1 0
101 1010115  5016  501601  335     0 1 0
101 1010115  5016  501603  335     0 1 0
101 1010115  5016  501604  335     0 1 0
101 1010115  5016  501605  335     0 1 0
101 1010115  5016  501606  440     0 1 0
101 1010115  5020  502002    .  1100 1 1
101 1010115  5020  502002 1000     0 0 0
101 1010115  5020  502003  150     0 0 0
101 1010115  5020  502003    .  1500 1 1
101 1010115  5020  502003   30     0 1 0
101 1010115  5020  502004   30     0 1 0
101 1010115  5020  502004    .  1500 1 1
101 1010115  5020  502004  150     0 0 0
101 1010115  5020  502005   30     0 1 0
101 1010115  5103  510301    0  1000 1 1
101 1010115  5106  510601   80     0 1 0
101 1010115  5107  510702  200     0 1 0
101 1010115  5108  510801   80     0 1 0
101 1010115  5108  510802   30     0 1 0
101 1010115  5109  510901  200     0 1 0
101 1010117 27101 2710101   50     0 1 0
101 1010117 27101 2710102  110     0 1 0
101 1010117 27103 2710301  300     0 0 0
101 1010117 27103 2710304  550     0 0 0
101 1010117 27104 2710401   20     0 1 0
101 1010117 27104 2710402   40     0 1 0
101 1010117 27105 2710501  300     0 1 0
101 1010117 27105 2710503 1150     0 0 0
101 1010117 27105 2710505  850     0 0 0
101 1010117 27106 2710602  500     0 1 0
101 1010123  5101  510101    0  2500 1 1
101 1010123  5102  510201  500     0 0 0
101 1010123  5102  510202 2000     0 0 0
101 1010123  5104  510401   80     0 1 0
101 1010303  5101  510101  100     0 1 0
101 1010303  5101  510102  100     0 1 0
101 1010311  1102  110205  500     0 1 0
101 1010311  1102  110206  600     0 1 0
101 1010311  1104  110401    0  1000 1 1
101 1010313 13001 1300102  400     0 1 0
101 1010313 13002 1300202 2000     0 1 0
101 1010313 13004 1300403  500     0 0 0
101 1010313 13006 1300602    .  4000 1 1
101 1010313 13009 1300901  100     0 1 0
101 1010313 13012 1301201  100     0 0 0
101 1010313 13013 1301302  200     0 1 0
101 1010313 13014 1301402  500     0 1 0
101 1010313 13016 1301603    . 35000 1 1
101 1010313 13019 1301902    .   120 1 1
101 1010313 13101 1310102  200     0 1 0
101 1010313 13103 1310301    0  3000 1 1
101 1010313 13103 1310302    0  2000 1 1
101 1010313 13105 1310504    0   500 1 1
101 1010503  4101  410103    0   200 1 1
end
label values hcenter hcenter
label values service service

Household data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int prid long(comid hhid topcost tipcost)
101 1010103 14101 1000     0
101 1010103 14102 1200     0
101 1010103 14103  800  4000
101 1010103 14104 1500     0
101 1010103 14105    0     0
101 1010109  2001    0     0
101 1010109  2002   90     0
101 1010109  2004    0     0
101 1010109  2006    0     0
101 1010109  2007    0     0
101 1010109  2008    0     0
101 1010109  2009    0     0
101 1010109  2010    0     0
101 1010109  2011 1210     0
101 1010109  2012    0     0
101 1010109  2013  150     0
101 1010109  2014 1300   400
101 1010109  2015    0     0
101 1010109  2016    0     0
101 1010109  2019    0   200
101 1010109  2020    0  5000
101 1010109  2021 1500     0
101 1010109  2022    0     0
101 1010109  2023    0     0
101 1010109  2024    0     0
101 1010115  5001    0     0
101 1010115  5002    0     0
101 1010115  5003    0     0
101 1010115  5004    0     0
101 1010115  5005    0     0
101 1010115  5006 2030     0
101 1010115  5007    0   700
101 1010115  5008  200  1155
101 1010115  5009    0     0
101 1010115  5010  400   500
101 1010115  5011 4500     0
101 1010115  5012  415     0
101 1010115  5013    0   800
101 1010115  5014    0     0
101 1010115  5015  280   350
101 1010115  5016 1780     0
101 1010115  5017    0     0
101 1010115  5018    0     0
101 1010115  5019    0     0
101 1010115  5020 1390  4100
101 1010115  5103    0  1000
101 1010115  5106   80     0
101 1010115  5107  200     0
101 1010115  5108  110     0
101 1010115  5109  200     0
101 1010117 27101  160     0
101 1010117 27103  850     0
101 1010117 27104   60     0
101 1010117 27105 2300     0
101 1010117 27106  500     0
101 1010123  5101    0  2500
101 1010123  5102 2500     0
101 1010123  5103    0     0
101 1010123  5104   80     0
101 1010123  5105    0     0
101 1010303  5101  200     0
101 1010303  5102    0     0
101 1010303  5103    0     0
101 1010303  5104    0     0
101 1010303  5105    0     0
101 1010311  1101    0     0
101 1010311  1102 1100     0
101 1010311  1103    0     0
101 1010311  1104    0  1000
101 1010311  1105    0     0
101 1010313 13001  400     0
101 1010313 13002 2000     0
101 1010313 13003    0     0
101 1010313 13004  500     0
101 1010313 13005    0     0
101 1010313 13006    0  4000
101 1010313 13007    0     0
101 1010313 13008    0     0
101 1010313 13009  100     0
101 1010313 13010    0     0
101 1010313 13011    0     0
101 1010313 13012  100     0
101 1010313 13013  200     0
101 1010313 13014  500     0
101 1010313 13015    0     0
101 1010313 13016    0 35000
101 1010313 13017    0     0
101 1010313 13018    0     0
101 1010313 13019    0   120
101 1010313 13020    0     0
101 1010313 13101  200     0
101 1010313 13102    0     0
101 1010313 13103    0  5000
101 1010313 13104    0     0
101 1010313 13105    0   500
101 1010503  4101    0   200
101 1010503  4103    0     0
101 1010503  4104    0     0
101 1010503  4106  200     0
101 1010503  4107    0     0
end

Tags: None

Duong Le

Join Date: Apr 2020

Posts: 66
#2

16 Apr 2020, 10:28

Dear Professor Clyde Schechter,

Do you have any suggestions on how to deal with such a type of data structure?

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

16 Apr 2020, 11:04

Well, you actually are raising two separate issues. One is how to combine the two data sets, and the other is how to create a unique identifier for observations in the individual level data set. But you don't need a unique identifier in the individual data set in order to do this merge:

Code:

use individual_dataset, clear merge m:1 prid comid hhid using household_dataset

When doing a -merge-, you only need one of the data sets to be uniquely identified by the linking variables. 1:m and m:1 merges are valid and accepted by the -merge- command.

Now, you might need a unique identifier in the individual data set for some other reason. There is no apparent other variable that would help create a unique identifier in the example data you show. So the simplest resort is to create a sequential variable within the prid comid hhid invid groups:

Code:

use individual_dataset, clear by prid comid hhid invid, sort: gen seq = _n

Now prid comid hhid invid seq uniquely identify observations.

Another way to create a unique identifier, but one that will not "play" with the usual linking variables is to just do a sequential variable for all observations:

Code:

gen long seq = _n

Please note: it is not a good idea to address a post to a particular responder. There may be others who would be able to respond to your question sooner than I get to it, or provide a better response. By addressing it to me, you discourage them from even thinking about it. So it is in your best interests when posting a question to not mention any particular person. The only exception might be if you are responding to something that I specifically said and your question seeks an explanation of what I meant, then it makes sense to address your question specifically to me But even then, sometimes somebody else might be able to respond sooner or better.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

16 Apr 2020, 11:10

Jason,

I also work in health services research. You referred to the first dataset as an individual one. However, the unit of observation there is not the individual. You actually have a claim or encounter as the unit of observation, i.e. each row refers to one time that the person involved had an outpatient or inpatient visit.

What's your ultimate goal? You could take the first dataset, and summarize it to the household level with a count of total outpatient visits, count of inpatient visits, total outpatient spending, and total inpatient spending. That looks something like:

Code:

gen inpatient_count = ipcost > 0 & ipcost != . gen outpatient_count = opcost > 0 & opcost != . collapse (sum) outpatient_count inpatient_count opcost ipcost, by(prid comid hhid)

That ignores whatever the variables private and service do.

This request needs clarification:

merge the individual data set into the household one, without dropping any variables in the both data sets.

If the individual data has a many to one relationship with the household data, then after a merge, you basically still have individual data, but you could add household characteristics. In the example presented, there aren't any household characteristics aside from total inpatient and total outpatient cost. Is that what you want? If you took your claim/encounter data and summarized it (the collapse command is the equivalent of PROC SUMMARY in SAS if you're familiar with that) to the individual level, then you could get each person's total spending and count of services.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#5

16 Apr 2020, 21:51

Originally posted by Clyde Schechter View Post

Well, you actually are raising two separate issues. One is how to combine the two data sets, and the other is how to create a unique identifier for observations in the individual level data set. But you don't need a unique identifier in the individual data set in order to do this merge:

Code:

use individual_dataset, clear merge m:1 prid comid hhid using household_dataset

When doing a -merge-, you only need one of the data sets to be uniquely identified by the linking variables. 1:m and m:1 merges are valid and accepted by the -merge- command.

Now, you might need a unique identifier in the individual data set for some other reason. There is no apparent other variable that would help create a unique identifier in the example data you show. So the simplest resort is to create a sequential variable within the prid comid hhid invid groups:

Code:

use individual_dataset, clear by prid comid hhid invid, sort: gen seq = _n

Now prid comid hhid invid seq uniquely identify observations.

Another way to create a unique identifier, but one that will not "play" with the usual linking variables is to just do a sequential variable for all observations:

Code:

gen long seq = _n

Please note: it is not a good idea to address a post to a particular responder. There may be others who would be able to respond to your question sooner than I get to it, or provide a better response. By addressing it to me, you discourage them from even thinking about it. So it is in your best interests when posting a question to not mention any particular person. The only exception might be if you are responding to something that I specifically said and your question seeks an explanation of what I meant, then it makes sense to address your question specifically to me But even then, sometimes somebody else might be able to respond sooner or better.

Dear Professor Clyde Schechter,

Thank you so much for your help. I appreciated it. In addition, I will keep in mind your suggestions regarding "mentioning" someone in my future posts.
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#6

16 Apr 2020, 22:07

Originally posted by Weiwen Ng View Post

Jason,

I also work in health services research. You referred to the first dataset as an individual one. However, the unit of observation there is not the individual. You actually have a claim or encounter as the unit of observation, i.e. each row refers to one time that the person involved had an outpatient or inpatient visit.

What's your ultimate goal? You could take the first dataset, and summarize it to the household level with a count of total outpatient visits, count of inpatient visits, total outpatient spending, and total inpatient spending. That looks something like:

Code:

gen inpatient_count = ipcost > 0 & ipcost != . gen outpatient_count = opcost > 0 & opcost != . collapse (sum) outpatient_count inpatient_count opcost ipcost, by(prid comid hhid)

That ignores whatever the variables private and service do.

This request needs clarification:

If the individual data has a many to one relationship with the household data, then after a merge, you basically still have individual data, but you could add household characteristics. In the example presented, there aren't any household characteristics aside from total inpatient and total outpatient cost. Is that what you want? If you took your claim/encounter data and summarized it (the collapse command is the equivalent of PROC SUMMARY in SAS if you're familiar with that) to the individual level, then you could get each person's total spending and count of services.

Dear Weiwen Ng,

Thank you for your response. Glad to hear that you are also working on health service issues.

"You actually have a claim or encounter as the unit of observation, i.e. each row refers to one time that the person involved had an outpatient or inpatient visit" - you are right about this.

-collapse- command is what I was thinking of and had tried similar approach as you did in #4. However, what I want is to keep both hcenter and service variables as well because I need them for my later analyses but it seems that -collapse- command may not be suitable here (as several scenarios may occur, please see #1) , please correct me if I am wrong.

Do you have any suggestions to generate a sum of ipcost and opcost while keeping both hcenter and service, then merge those information into the household data?

Thank you.
Comment

Duong Le

Join Date: Apr 2020
Posts: 66

17 Apr 2020, 03:34

I think I found a way to handle my issues. Regarding hcenter variable, my approach is that I first generate the total of visits to public and private health centers. Then, I generate a variable based on the total of visit information generated, indicating whether an individual visits to a public or private, or a combination of the two.

As for ipcost and opcost, I used similar approach as you did in #4, then I merge two data sets together. However, a smarter person may have a better approach than mine.

For now, I think I may not need to use service variable.

Code:

use individual, clear
preserve
* Generate the total number of public health visits
gen pub = .
    bys prid comid hhid (invid): replace pub=1 if hcenter==1
* Generate the total number of private visits
gen priv = .
    bys prid comid hhid (invid): replace priv=1 if hcenter==0
* Generate the total of number of visits, including public and private
    bys prid comid hhid invid: gen tpub = _N
    collapse (sum) pub priv, by(tinh xa hoso matv)
gen wanted = .
    bys prid comid hhid (invid): replace wanted=1 if pub>0 | priv==0
    bys prid comid hhid (invid): replace wanted=0 if pub==0 | priv>0
    bys prid comid hhid (invid): replace wanted=2 if pub>0 & priv>0
    
    la var wanted "Healthcare centers"
    la define wanted 0"Priviate center" 1"Public center" 2"Both public and private centers"
    la value wanted wanted
    tempfile temp
save `temp'
restore

// For ipcost and opcost
preserve
    collapse (sum) ipcost opcost, by(prid comid hhid invid)
tempfile temp1
save `temp1'

// Merge 2 data sets
use `temp', clear
     merge 1:1 prid comid hhid invid using `temp1', keep(master match) nogenerate

Announcement

How to deal with overlapping obs in healthcare data

Comment

Comment

Comment

Comment

Comment

Comment