Inner joining a table in Stata

Nghia Nguyen

Join Date: Feb 2017
Posts: 19

Inner joining a table in Stata

09 Feb 2017, 13:58

Creating a new thread for this question:

I was wondering if it would be possible to join a table in Stata onto itself if the data format is already in a long format. Instead of making two copies/tables of the raw data and trying to do a one-to-many merge, is there a good program/function to use to merge a table onto itself and allows for mismatched values to be in separate columns. My example data is below:

ID	Admission Date	DX1	DX2	DX3
1	1/1/2011	573.0	574.0	575.0
1	2/1/2011	573.0	.	571.0
1	3/1/2011	573.0	.	.
2	1/1/2011	573.0	574.0
2	3/1/2011	573.0	574.0	575.0
3	1/1/2011	573.0	.	575.0
3	2/1/2011	573.0	571.0	.
3	3/1/2011	572.0	574.0	.
3	4/1/2011	571.0	574.0	575.0

I would like to create a wide format table with each row denoting an observation. If DX1, DX2, and DX3 values will not match in the new wide format, then will there be new columns to account for mismatches (eg, DX1, DX2, DX3, DX1', DX2', DX3')?

Thank you in advance for the help!

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

09 Feb 2017, 14:42

I find myself unable to follow your verbal description, and can't tell what you want to merge to what. You show an example, though not in a form that's likely to help people who would like to help you. See item 12.2 of the FAQ regarding how to present data in this forum. The kind of example I'd find useful would be one showing an example of what the data would be before the change you want, and what it would be after what you want.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

09 Feb 2017, 14:48

You had an earlier post where you were essentially trying to flag a claim if it had a malnutrition diagnosis. If this post is related to that, then I assume this is related to basically finding out who - not what claims, but which people - have had a malnutrition diagnosis. If that's right, then why do you feel like you need to reshape the file?

How I would normally do something like this is:

1) Flag all claims for the desired diagnosis. Flags can be binary.

2) Collapse the claims by person:

Code:

collapse (max) malnutrition_dx, by(ID) save dx_file.dta, replace

3) Merge that collapsed claim file to your enrollment/beneficiary characteristics/whatever file. So, open that file, then:

Code:

merge 1:1 ID using dx_file.dta

If I'm off base and you really need to make a wide file, then it's not clear to me what you need, and this might be much more easily done in SQL (which I do not know).

Last edited by Weiwen Ng; 09 Feb 2017, 14:52.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Nghia Nguyen

Join Date: Feb 2017

Posts: 19
#4

09 Feb 2017, 19:16

Thanks for your suggestions, everyone! My apologies about the code. Will read the FAQ and download the ssc file and submit my code.

Weiwen, I have one master database that I have to subset and partition to flag different events. My primary outcome is hospital readmission. When I was trying to identify malnutrition, I am using that variable as an inclusion criterion. This databases has millions of observations, so I wanted to find a faster way to flag variables.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

09 Feb 2017, 20:13

OK, but the question remains, why do you need to reshape this into a wide dataset? This is going to be a bit of a challenge, because the distribution of the number of stays per person is likely to be very right skewed, i.e. most people in the inpatient data will have just 1, most people in the denominator if you're merging it to a general population will have 0, but you'll have a few people with 10 stays and maybe even a person or 2 with 100. Reshaping the data wide without knowing how many columns you are going to reshape to is hard, and the number of columns is going to be determined by the number of unique diagnoses that the sickest person got. I mean, maybe this is possible in SQL, but I am struggling to think of how it could be done in Stata.

I'm also struggling to think about why what you probably need can't be accomplished with the data in long format as they are. It might help if you say what your higher-level goal is with this request, and people can see if they can think of a more efficient way to do it.

Speaking of efficiency, I'm not an expert on optimizing programming speed, but in my experience, hospital claims data can be pretty wide, and it can be worth it to eliminate any variables you don't need before flagging diagnostic codes. The dataset's size, after all, is determined by the number of rows * columns. You can't minimize the number of rows until you flag stuff, but you can minimize the number of columns.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Nghia Nguyen

Join Date: Feb 2017

Posts: 19
#6

09 Feb 2017, 22:35

I see - your point is well taken! I will have to be very selective with my inclusion criteria and outcomes.
Comment
Nghia Nguyen

Join Date: Feb 2017

Posts: 19
#7

09 Feb 2017, 23:27

Hi Mike,

I downloaded ssc install dataex and have formatted my dataset (values changed for privacy purposes) as seen below:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str6 nrd_visitlink long nrd_daystoevent str5(dx1 dx2 dx3) "000011" 17609 "8900" "920" "78009" "000099" 15782 "4019" "27800" "3542" "000099" 15828 "43491" "3572" "2724" "0000AA" 14265 "42731" "78057" "53081" "0000AA" 14287 "42731" "7231" "V1582" "0000CC" 15664 "5750" "3051" "27800" "0000DD" 19796 "43411" "4439" "41401" "0000DD" 19825 "43491" "2724" "2689" "0000DD" 19889 "4580" "4439" "2689" "0000EE" 11726 "193" "V0382" "53081" end

So I have a dataset with ~14,000,000 variables and ~120 columns. The nrd_visitlink is the unique patient ID that links hospital records across multiple hospital admissions to keep track of a single patient that has multiple hospital admissions. The nrd_daystoevent is a variable that captures the number of days between hospitalizations. Dx1-Dx3 are columns for ICD9 codes. Dx1 is the primary diagnosis that a patient was admitted to the hospital for. Dx2 and Dx3 are their secondary and tertiary diagnoses, respectively. The dataset is currently in long format. I would like to create an index admission variable (the first hospitalization) followed by a readmission variable (the second or third hospitalization) for a given observation. For example, patient "0000DD" has 3 hospital admissions. The index admission for this patient is with an nrd_daystoevent "19796." Subsequent readmissions for the same patient should nrd_daystoevent values greater than "19796." I would like to create a new dataset, in wide format, where each row is a single observation, and not repeated observations like the above dataset. In this new dataset, there will be an index variable (for initial hospitalization) and 1st readmission (variable) and 2nd readmission (variable). For EACH admission, there is an associated primary, secondary, and third diagnosis. The dataset is set up to allow researchers to identify the reasons for the initial/index hospitalization admission (variables Dx1-Dx3) and the reasons for the SUBSEQUENT admissions (variables Dx1-Dx3). My goal is to examine the rates of 30-day readmissions and the diagnoses/reason associated with the readmissions.

I hope that was more clear and thank you again, everyone!
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

10 Feb 2017, 08:29

Oh. Then this is more tractable, although it's still a challenge. People shouldn't tend to have a hundred readmissions. One crude way to do this is reshape wide. Using the data you have:

Code:

sort nrd_visitlink nrd_daystoevent by nrd_visitlink: gen n = _n reshape wide dx* nrd_daystoevent, i(nrd_visitlink) j(n)

dx12 is dx1 for the 2nd stay. Anyone who didn't have a 2nd stay has missing dx12. That will get you a dataset that's amenable to visual inspection. However, that could get tedious for a large dataset.

I can't think of a simple way to de-duplicate diagnoses over multiple stays, which is clearly important to you. Hopefully others can chime in. Like I've said, I suspect this is more easily done in SQL, and if you know an SQL programmer or a SAS programmer who knows SQL (SAS has a native SQL interface), you may have better luck.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

10 Feb 2017, 09:01

There seem to be a number of possible issues in your objectives. It appears that 0000DD has two admissions, the first with one 30-day readmission, and the second with no readmisson. Combining all of those on a single wide record would seem to pose problems. Further, you do not identify what happens with a patient who is admitted on days 15000, 15020, 15040, and 15060. Since 15040 is less than 30 days from 15020, does it also count as a readmission? Or since 15020 is a readmission rather than an initial admission, will 15040 have to begin a new admission?

Issues such as these have been discussed at length on Statalist and elsewhere, under the general topic of identifying spells. Unfortunately, the Stata Journal article I would like to reference, http://www.stata-journal.com/article...article=dm0079 by Nick Cox , appears to be recent enough that it is not yet available at no cost. Searching the Statalist archives should lead you to discussions that informed the article. If after doing so you have further questions, I'd recommend starting a new topic with "spells" in the title, since the experts in spells are apparently not looking at this topic about "inner joining"

In general, I think you're headed in the wrong direction with a wide layout, at least initially given the difficult of identifying spells. I think a truly long layout would be where you would want to start from. Starting with the sample data you posted above (thank you for using dataex) and reshaping as follows leaves a dataset that should be tractable for what you need to accomplish. If it's too large (3 times 14,000,000 observations) break it into pieces. Certainly, split off these identifers and diagnosis codes from the remainder of the 120 variables and work with them separately, later merging the other variables back onto your results.

Code:

reshape long dx, i(nrd_visitlink nrd_daystoevent)  j(dx_num)
list, sepby(nrd_visitlink)

Code:

     +--------------------------------------------------+
     | nrd_visitlink   nrd_daystoevent   dx_num      dx |
     |--------------------------------------------------|
  1. |        000011             17609        1    8900 |
  2. |        000011             17609        2     920 |
  3. |        000011             17609        3   78009 |
     |--------------------------------------------------|
  4. |        000099             15782        1    4019 |
  5. |        000099             15782        2   27800 |
  6. |        000099             15782        3    3542 |
  7. |        000099             15828        1   43491 |
  8. |        000099             15828        2    3572 |
  9. |        000099             15828        3    2724 |
     |--------------------------------------------------|
 10. |        0000AA             14265        1   42731 |
 11. |        0000AA             14265        2   78057 |
 12. |        0000AA             14265        3   53081 |
 13. |        0000AA             14287        1   42731 |
 14. |        0000AA             14287        2    7231 |
 15. |        0000AA             14287        3   V1582 |
     |--------------------------------------------------|
 16. |        0000CC             15664        1    5750 |
 17. |        0000CC             15664        2    3051 |
 18. |        0000CC             15664        3   27800 |
     |--------------------------------------------------|
 19. |        0000DD             19796        1   43411 |
 20. |        0000DD             19796        2    4439 |
 21. |        0000DD             19796        3   41401 |
 22. |        0000DD             19825        1   43491 |
 23. |        0000DD             19825        2    2724 |
 24. |        0000DD             19825        3    2689 |
 25. |        0000DD             19889        1    4580 |
 26. |        0000DD             19889        2    4439 |
 27. |        0000DD             19889        3    2689 |
     |--------------------------------------------------|
 28. |        0000EE             11726        1     193 |
 29. |        0000EE             11726        2   V0382 |
 30. |        0000EE             11726        3   53081 |
     +--------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35580
#10

10 Feb 2017, 09:58

William is right. My vocabulary, for example, does not include "inner join" as hitherto I have escaped courses and literature in which it is used.

Thanks for the mention. The article cited requires subscription to the SJ or one-off payment until 2018 Q1. But spells are discussed in http://www.stata-journal.com/article...article=dm0029 and tsspell (SSC) implements the principles of the latter.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#11

10 Feb 2017, 11:03

[QUOTE=William Lisowski;n1373737]There seem to be a number of possible issues in your objectives. It appears that 0000DD has two admissions, the first with one 30-day readmission, and the second with no readmisson. Combining all of those on a single wide record would seem to pose problems. Further, you do not identify what happens with a patient who is admitted on days 15000, 15020, 15040, and 15060. Since 15040 is less than 30 days from 15020, does it also count as a readmission? Or since 15020 is a readmission rather than an initial admission, will 15040 have to begin a new admission?

Issues such as these have been discussed at length on Statalist and elsewhere, under the general topic of identifying spells. Unfortunately, the Stata Journal article I would like to reference, http://www.stata-journal.com/article...article=dm0079 by Nick Cox , appears to be recent enough that it is not yet available at no cost. Searching the Statalist archives should lead you to discussions that informed the article. If after doing so you have further questions, I'd recommend starting a new topic with "spells" in the title, since the experts in spells are apparently not looking at this topic about "inner joining"

In general, I think you're headed in the wrong direction with a wide layout, at least initially given the difficult of identifying spells. I think a truly long layout would be where you would want to start from. Starting with the sample data you posted above (thank you for using dataex) and reshaping as follows leaves a dataset that should be tractable for what you need to accomplish. If it's too large (3 times 14,000,000 observations) break it into pieces. Certainly, split off these identifers and diagnosis codes from the remainder of the 120 variables and work with them separately, later merging the other variables back onto your results.
.../QUOTE]

This is why I was trying to push the OP away from his initial request, but stated a lot better. I didn't know you could specify multiple i variables in reshape long either, although of course I guess it makes sense.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Nghia Nguyen

Join Date: Feb 2017

Posts: 19
#12

11 Feb 2017, 14:31

Thank you, everyone!
Comment

Announcement

Inner joining a table in Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment