merge, joinby, or... nothing?

AnneKirstine Dyrvig

Join Date: Mar 2016
Posts: 7

merge, joinby, or... nothing?

02 Mar 2016, 06:16

I use Stata 13.1 on a Mac and have obtained data from the National Patient Registry in Denmark, which means that the data come in multiple datasets. Each observation has a recordid which is unique to the hospital contact, i.e. if you were admitted to hospital all the tests that you were exposed to during the admission would belong to that one contact. Only one of the tables also include a personid uniquely identifying each person in the dataset.

Tabel	Contents	Examples
Adm	Administrative information	Date of admission, discharge, primary diagnosis, hospital department, and personid
Sur	Surgical procedures	Date of surgery, surgical procedures, type of procedure, additional procedures, and hospital department
Dit	Diagnostic procedures and treatment	Diagnostic tests, diagnostic procedures, type of procedure, additional procedures, and hospital department
Vit	Vital status	Date of birth, vital status of the person
Dia	Diagnosis for contact	Temporary (or permanent) diagnosis, type of diagnosis, additional diagnoses, and hospital department

I believe the format of the data is long, i.e. each participant can be registered at several occasions, with each occasion constituting one observation.
Now, my dependent variables are the diagnoses and survival, and independent variables are the diagnostic and surgical procedures that each individual was exposed to. So, in order to be able to conduct the regression analyses, I really need data to be in one file. But the only common variable is recordid, which is not unique. And according to the Stata Manual,

First, if you think you need to perform an m:m merge, then we suspect you are wrong.

And I understand why. But it is not completely clear to me, whether joinby has the same unnattractive features (being dependent on current sort order and potentially ruining my data). I have tried to use joinby (see below), but I am not sure, if it did something different than the m:m merge would do.

joinby recordid using "filename", unmatched(using)

Finally: My question is, if I can use joinby as above? Or if I should try to reshape data to be wide to obtain unique ids and then merge 1:1? Or perhaps some other solution that I have not considered? Or will I have to drop the thought of regressions?

Tags: None

Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

02 Mar 2016, 07:05

I can imagine a variety of data structures, so you will need to be clearer about how the dataset you want will ultimately look, and details about the analysis. What is the unit of analysis (person or diagnosis or recordid)? I assume individuals can have multiple diagnoses both over time and at a single visit. Do you want to retain a panel structure so that each individual has multiple observations? Or do you want all the information in wide format?

Also, I think I need more info about the dataset. Does recordid uniquely identify the data in the ADM dataset, or is recordid repeated? Does the VIT dataset contain personid and is it collected for each visit? It may be that there are multiple variables that identify the datasets, and that they may vary for each merge.

The more information you can provide about the final dataset and the current datasets, the better. Ultimately, the way you want the dataset to look should be informed by the way you conceptualize your analysis.

Last edited by Carole J. Wilson; 02 Mar 2016, 07:17. Reason: Edited to request more info about the data

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2420
#3

02 Mar 2016, 07:28

I agree that more detail on what you want is needed. But, taking a bit of a stab in the dark: This sounds to me as though you would want to -append- the various files that haveonly a recordid. Then, you could spread the personid to all the related recordid observations with: bysort recordid (personid): replace personid = personid[1]
This trick presumes that your personid is numeric, so that missing values sort high.

You allude to some general issue with sort order, and I don't know what that is, but you could capture the current sort order with generate int seqnum = _n

This gives a file structure that is long, but -reshape- could make it wide, with personid as the observation identifier, if that's what you need.

Hope this is of some use.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30139
#4

02 Mar 2016, 08:28

It may be that the previous responses have answered your question. If not, I suggest that you post some examples of the data from each of the data sets you want to combine, and then show us what you would want the combined data set to look like. I realize that the real data is confidential and you cannot show us any of that, but you can make up things that look similar for the purpose of showing how the data is laid out.

As for whether -joinby- depends on the sort order, it does not. What -joinby varlist- does is pair each observation in the master data set with particular values of varlist with every observation in the using data set that has the same values of varlist. As such, the sort orders of the data sets make no difference at all. One consequence of this is that, in general, the result of -joinby- will be a data set with many more observations than either the master or the using.

-merge m:m varlist - works very differently: it tries to pair up observations in the master data set with particular values of varlist with observations from the using data set having those same values, but it does so by pairing them up in the order they appear in the data sets, and if it runs out of values in one of the data set, it just copies the last available observation over again to fill in. This clearly is sort order dependent and, if you think about it clearly, unless the data sets have been carefully matched up and sorted in advance, it produces nonsensical results. The results of -merge m:m- will have the same number of observations as the larger of the master or using data set.

Hope this helps.

Last edited by Clyde Schechter; 02 Mar 2016, 08:30. Reason: Use slightly clearer language.
Comment

AnneKirstine Dyrvig

Join Date: Mar 2016
Posts: 7

02 Mar 2016, 12:15

Dear all,
Thank you so much for your kind replies in spite of my not having provided sufficient information. I try to answer your questions, and insert examples of data that I switched between observations, so although they are real, they have been manipulated. Please let me know if this is illegal!

@Carole
In ADM, the personid AND recordid are both unique. In the remaining datasets, none of the variables are unique.
Individuals may have more than one diagnosis per recordid and over time.
Unit of analysis: one person
I want my data to be organized like this:

personid

recordid

diag1

hospitaldepartment

admissiondate

dischargedate

diag2

diagnostic

diagnosticdate

surgery

surgerydate

vitalstatus

vitaldate

...

@Mike
Your suggestion might work! I will try that - thanks a lot!

@Clyde
I hope the presentation of my wish for organizing of data above is what you request?
Thanks for explaining the -joinby- command. That is clearly not the way to go

. I guess the -merge m:m- was already ruled out.

Please note that all values are strings as that was the format I received them in. I change those to something useful when working with the data. Also, I concat hospital and department to get hospitaldepartment.

ADMINISTRATIVE DATA

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8  personid str23 recordid str8 diag1 str5(hospital department) str10(admissiondate dischargedate)
"153260" "10000006595" "DC619"  "1401" "089" "2011-10-07" "2011-10-09"
"148242" "10000011257" "DZ081J" "1401" "089" "2010-01-29" "NULL"
"2041" "10000030438" "DC329"  "1301" "729" "2012-04-10" "NULL"
"100937" "10000030809" "DI341"  "1301" "039" "2010-03-01" "2010-03-11"
"77687" "10000032140" "DN979"  "1301" "525" "2012-08-31" "2012-09-10"
"130847" "10000032561" "DZ031J" "1401" "089" "2009-02-28" "NULL"
"20887" "10000033276" "NULL"   "1301" "039" "2013-05-28" "NULL"
"89207" "10000034413" "DC619"  "1401" "089" "2009-02-28" "2009-03-01"
"39870" "10000034592" "DM060"  "1401" "239" "2009-02-28" "2009-02-28"
"77846" "10000035037" "DZ031J" "1401" "089" "2009-02-28" "NULL"
end

DIAGNOSIS FOR CONTACT

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 recordid str8 diag2
"9785211016"  "DC443" 
"9792782657"  "DT689" 
"9499467565"  "DZ080R"
"9972283065"  "DS525" 
"9790254183"  "DZ080" 
"9979473438"  "DE119" 
"9811987379"  "DZ031B"
"9844116886"  "DZ016" 
"9804951482"  "DZ489" 
"10182134667" "DM171" 
end

DIAGNOSTIC PROCEDURES AND TREATMENTS

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 recordid str9 diagnostic str6(department hospital) str10 diagnosticdate
"9531224701" "UXRG10"  "040" "4202" "2013-03-17"
"9531224701" "ZZ0172X" "446" "4202" "2013-03-17"
"9531224701" "UXRG15"  "040" "4202" "2013-03-17"
"9531224701" "UXRG15"  "040" "4202" "2013-03-17"
"9727163145" "UXCC00"  "021" "7005" "2013-03-22"
"9934470933" "AAF22"   "A0L" "3800" "2013-01-29"
"9934470933" "ZZ2100"  "A0L" "3800" "2012-10-09"
"9934470933" "BVAA33A" "A0L" "3800" "2012-10-18"
"9934470933" "AAF22"   "A0L" "3800" "2012-10-12"
"9934470933" "AAF22"   "A0L" "3800" "2012-11-01"
end

SURGICAL PROCEDURES

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 recordid str8 surgery str6(department hospital) str10 surgerydate
"9088836149"  "KUDB22" "S6D" "3800" "2012-11-14"
"10022501608" "KUJF32" "07D" "6006" "2013-05-21"
"9738929178"  "KNBU02" "321" "6630" "2013-02-18"
"10271931116" "KFNG05" "031" "8001" "2013-06-14"
"9759947898"  "KYQF10" "669" "1301" "2013-04-22"
"9666849305"  "KUDB22" "239" "8001" "2013-02-08"
"9181198726"  "KUDB22" "41B" "6650" "2013-01-25"
"9506206495"  "KUJD05" "H0G" "3800" "2013-02-22"
"9034690922"  "KFNG05" "039" "1301" "2013-01-02"
"9803640274"  "KNGF11" "48A" "5501" "2013-01-08"
end

VITAL STATUS
I can handle this one, as the observations are unique for each individual.

Thanks again! I hope this informs you better in case the experiment with -append- does not work. I will keep you updated

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30139
#6

02 Mar 2016, 14:40

So, taken literally, your desired endpoint is not compatible with the data sets you have. The problem, which you have already recognized, is that you have multiple observations of diagnostic procedures and treatments for the same recordid. Your data dont' show multiple observations on surgical procedures for the same recordid, but it is difficult to imagine that there wouldn't be at least some instances of that.

So you need to make a decision. I see three paths.

1. You can expand the final single record per recordid to include multiple diagnostic procedures and treatments, so it would look like
personid recordid diag1 hospitaldepartment admissiondate dischargedate diag2 diagnostic1 diagnostic1_department diagnostic1_hospital diagnostic1_date diagnostic2 diagnostic2_department diagnostic2_hospital diagnostic1_date diagnostic3 diagnostic3_department diagnostic3_hospital diagnostic3_date ...surgery surgerydate vitalstatus vitaldate...

and sit may be that more than 3 series of diagnostic procedure information will be involved, and there may also be multiple surgery series as well. If you decide this is what you want, you will need to -reshape wide- in the diagnostic procedures and possibly the surgical procedures data sets, and then combine all of the data sets with a series of 1:1 merges. This approach will give you a very wide data set

2. You may want to pick a single diagnostic procedure from the complete list for each record id. Perhaps the first, or the last, or one at random, or perhaps give preference to certain particular procedures This is complicated, and doesn't fit with my experience in health services research, but depending on what you're doing, it might be suitable. If so, it is just a matter of selecting the one observation for each recordid from each data set, and then a series of 1:1 merges.

3. You can organize the data with each diagnostic procedure (and each surgery if there are several) in a separate record, with the constant information (record id, diag1, daig2, admission and discharge dates, vital status and vital status date, repeated in each observation. This would be the long layout of your data. TO create this, you would begin by renaming the variables in the diagnostic procedures and surgery data sets to be the same, e.g. record_id procedure department hospital procedure_date, and add another variable, procedure_type with different codes for diagnostic procedures and surgery. Then append those two data sets to each other and save the appended result. Then do 1:1 merge of the administrative and diagnosis for contact data sets, and then do a 1:m merge of that with the appended diagnostic/surgery file.

Which of these approaches will work best for you depends on how you will be using this data set once you have it. If you plan a lot of graphical work, a wide layout may be preferrable. For most analyses, however, the long layout usually makes life easier.
Comment
AnneKirstine Dyrvig

Join Date: Mar 2016

Posts: 7
#7

03 Mar 2016, 00:03

Dear Clyde,

Thank you so much! Now everything makes sense. What you don't know is that I have already spent quite some time (and used two different methods) identifying the 'most relevant' diagnostic and surgical procedures. I just didn't know how to get from that to the analyses I wanted to conduct. But with your suggestion #2 it seems to fit. Only one aspect now worries me. You wrote that it:

doesn't fit with my experience in health services research

So I will go forward with both suggestion #2 and #3 - just to be safe. Because I think I need the long layout (at least I know that I prefer long over wide for analyses) and both #2 and #3 seems to provide that. Also, trying two methods will improve my Stata skills, which seem to always be useful investment of time
Comment

Announcement

merge, joinby, or... nothing?

Comment

Comment

Comment

Comment

Comment

Comment