sorting data

Jessica Harding

Join Date: Sep 2014
Posts: 22

09 Dec 2018, 20:10

Hi Stata Forum,

I have a long dataset of hospitalization data. I am looking at number of lower limb amputations, by year.
For each year, I want to select only the highest level of amputation per person.

For example, if one individual (user ID=3) has four amputations in total between 2000 and 2015 (one in 2000, one in 2003, and then two in 2005) I want to include the amputation in 2000, the one in 2003, and then only the highest amputation in 2005. So, if in 2005, they had a toe AND a foot amputation, i only want to include the foot amputation.

I can sort the data (code below), so that i can take the FIRST amputation for each year (taking simply dup==1). But i cannot work out how to take the HIGHEST amputation for each year.

sort id date _year
gen dup=1
replace dup = dup[_n-1]+1 if id==id[_n-1] & _year==_year[_n-1]

*each amputation is coded as no(=0) or year(=1)

I apologize for the table below (but i am having difficulty installing the ado file (directory issues) to do it any other way).

id	date	year	toe	foot	ankle	leg	dup
3	03feb2000	2000	1	0	0	0	1
3	06jun2003	2003	1	0	0	0	1
3	02jan2005	2005	1	0	0	0	1
3	21feb2005	2005	0	1	0	0	2
4	30nov2008	2008	0	1	0	0	1
5	12aug2000	2000	1	0	0	0	1
5	11sept2010	2010	1	0	0	0	1
5	19sept2010	2010	0	1	0	0	2
5	8oct2010	2010	0	0	0	1	3

Any suggestions?

Many thanks
Jess

Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

09 Dec 2018, 20:44

With your data as you currently have it (and currently sorted), you could simply do:

Code:

by id year: keep if dup==_N

Code:

* When I pasted in your data, date came in as a string, so I created date2 as a date
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id double date int year byte(toe foot ankle leg dup)
3 14643 2000 1 0 0 0 1
3 15862 2003 1 0 0 0 1
3 16438 2005 1 0 0 0 1
3 16488 2005 0 1 0 0 2
4 17866 2008 0 1 0 0 1
5 14834 2000 1 0 0 0 1
5 18516 2010 1 0 0 0 1
5 18524 2010 0 1 0 0 2
5 18543 2010 0 0 0 1 3
end
format %td date

Some other ideas:

Code:

* Another way to create some summary data for each patient
sort id date _year
bysort id (year): gen count = _N
bysort id (year): gen visit = _n
bysort id year: gen visit_that_yr = _n   // essentially re-creates your "dup" variable

* Just labeling things
label var visit "Cumulative number of amputations"
label var visit_that_yr "Cum number of amputations that year"
label var count "Total number of amputations patient had in data"

. list id date year toe foot ankle leg dup visit visit_that_yr count, sepby(id) noobs

  +-----------------------------------------------------------------------------------+
  | id        date   year   toe   foot   ankle   leg   dup   visit   visit_~r   count |
  |-----------------------------------------------------------------------------------|
  |  3   03feb2000   2000     1      0       0     0     1       1          1       4 |
  |  3   06jun2003   2003     1      0       0     0     1       2          1       4 |
  |  3   02jan2005   2005     1      0       0     0     1       3          1       4 |
  |  3   21feb2005   2005     0      1       0     0     2       4          2       4 |
  |-----------------------------------------------------------------------------------|
  |  4   30nov2008   2008     0      1       0     0     1       1          1       1 |
  |-----------------------------------------------------------------------------------|
  |  5   12aug2000   2000     1      0       0     0     1       1          1       4 |
  |  5   11sep2010   2010     1      0       0     0     1       2          1       4 |
  |  5   19sep2010   2010     0      1       0     0     2       3          2       4 |
  |  5   08oct2010   2010     0      0       0     1     3       4          3       4 |
  +-----------------------------------------------------------------------------------+


* Keeping latest obs (by date) within a given year
by id year: keep if dup==_N
* (3 observations deleted)

. list id date year toe foot ankle leg dup visit visit_that_yr count, sepby(id) noobs

  +-----------------------------------------------------------------------------------+
  | id        date   year   toe   foot   ankle   leg   dup   visit   visit_~r   count |
  |-----------------------------------------------------------------------------------|
  |  3   03feb2000   2000     1      0       0     0     1       1          1       4 |
  |  3   06jun2003   2003     1      0       0     0     1       2          1       4 |
  |  3   21feb2005   2005     0      1       0     0     2       4          2       4 |
  |-----------------------------------------------------------------------------------|
  |  4   30nov2008   2008     0      1       0     0     1       1          1       1 |
  |-----------------------------------------------------------------------------------|
  |  5   12aug2000   2000     1      0       0     0     1       1          1       4 |
  |  5   08oct2010   2010     0      0       0     1     3       4          3       4 |
  +-----------------------------------------------------------------------------------+

Last edited by David Benson; 09 Dec 2018, 20:48.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

09 Dec 2018, 21:00

David Benson's solutions are fine when the data is already sorted with the highest amputation in a given year in the last observation for that year. But I assume that in Jessica Harding's real data, that will not be the case.

The solution requires a few steps. First, as with most things in Stata, this is best done in long layout. Next, we need to establish a height ranking of the locations of the amputation. This step is achieved with the -label define- and -encode- commands. Then we sort on location within id and year and pick the last observation.

Finally, if there is a good reason to do so, she can go back to the original wide layout. Though it is likely that whatever else she wants to do to analyze this data it, too, will be better done in long layout.

I have appropriated David Benson's -dataex- in this code and I remind Jessica Harding that we ask that people always use -dataex- when showing example data.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id double date int year byte(toe foot ankle leg dup) 3 14643 2000 1 0 0 0 1 3 15862 2003 1 0 0 0 1 3 16438 2005 1 0 0 0 1 3 16488 2005 0 1 0 0 2 4 17866 2008 0 1 0 0 1 5 14834 2000 1 0 0 0 1 5 18516 2010 1 0 0 0 1 5 18524 2010 0 1 0 0 2 5 18543 2010 0 0 0 1 3 end format %td date rename (toe foot ankle leg) amputated= reshape long amputated, i(id date dup) j(_j) string label define location 1 toe 2 foot 3 ankle 4 leg encode _j, gen(location) label(location) // KEEP ONLY HIGHEST LOCATION IN EACH YEAR drop if !amputated by id year (location), sort: keep if _n == _N // GO BACK TO ORIGINAL LAYOUT IF THERE IS SOME COMPELLING REASON TO DO SO reshape wide mvencode amputated*, mv(0) rename amputated* *

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
1 like
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#4

09 Dec 2018, 21:19

Hi clyde,
the data is already in long format. not sure how to apply your suggested code to data that is already long.
jess
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#5

09 Dec 2018, 21:32

Jessica,

Some of the confusion may be due to what you mean by "highest" amputation. I interpreted that to mean "occurring latest (by date) in a given year." Clyde interpreted that to mean "Occurring highest (by physical location) on the body", hence leg > ankle > foot > toe

Also, do you ever have instances where a patient had multiple amputations on the same date? And if so, how did you want to handle those?

And, BTW, more than once using this data I wanted to say, "Yikes!" Patient #5 had 3 amputations in 2010!

Last edited by David Benson; 09 Dec 2018, 21:38.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

09 Dec 2018, 21:43

"Yikes!" Patient #5 had 3 amputations in 2010!

I guess my comment here is not really appropriate for a statistical and Stata forum, but I'll go ahead with it any way.

That is what vascular disease does to people. When you have advanced artery rot, your body parts get chipped away from you bit by bit in fairly rapid succession. The wages of smoking and diabetes (and a few other causes that are less common) is death by a thousand surgical cuts.
2 likes
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#7

09 Dec 2018, 22:15

Sorry David - yes i meant highest by level of amputation, knee being higher than foot etc etc. Sometimes there are planned amputations whereby they plan to take a toe and then the foot in sequential operations - which s why i only want to record the highest level for that year.

These are people with kidney disease and diabetes - so rates of amputations are VERY high!
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#8

10 Dec 2018, 03:31

For me, the sharp argument has been pointed out by Clyde at #3: "Then we sort on location within id and year and pick the last observation". It then goes along with my wondering, since the target could be touched with 1-line codes following exactly with this argument

Code:

bys id year (leg ankle foot toe): keep if _n==_N

.
1 like
Comment

Announcement

sorting data

Comment

Comment

Comment

Comment

Comment

Comment

Comment