Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • sorting data

    Hi Stata Forum,

    I have a long dataset of hospitalization data. I am looking at number of lower limb amputations, by year.
    For each year, I want to select only the highest level of amputation per person.

    For example, if one individual (user ID=3) has four amputations in total between 2000 and 2015 (one in 2000, one in 2003, and then two in 2005) I want to include the amputation in 2000, the one in 2003, and then only the highest amputation in 2005. So, if in 2005, they had a toe AND a foot amputation, i only want to include the foot amputation.

    I can sort the data (code below), so that i can take the FIRST amputation for each year (taking simply dup==1). But i cannot work out how to take the HIGHEST amputation for each year.

    sort id date _year
    gen dup=1
    replace dup = dup[_n-1]+1 if id==id[_n-1] & _year==_year[_n-1]


    *each amputation is coded as no(=0) or year(=1)

    I apologize for the table below (but i am having difficulty installing the ado file (directory issues) to do it any other way).

    id date year toe foot ankle leg dup
    3 03feb2000 2000 1 0 0 0 1
    3 06jun2003 2003 1 0 0 0 1
    3 02jan2005 2005 1 0 0 0 1
    3 21feb2005 2005 0 1 0 0 2
    4 30nov2008 2008 0 1 0 0 1
    5 12aug2000 2000 1 0 0 0 1
    5 11sept2010 2010 1 0 0 0 1
    5 19sept2010 2010 0 1 0 0 2
    5 8oct2010 2010 0 0 0 1 3

    Any suggestions?

    Many thanks
    Jess






  • #2
    With your data as you currently have it (and currently sorted), you could simply do:
    Code:
    by id year: keep if dup==_N

    Code:
    * When I pasted in your data, date came in as a string, so I created date2 as a date
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id double date int year byte(toe foot ankle leg dup)
    3 14643 2000 1 0 0 0 1
    3 15862 2003 1 0 0 0 1
    3 16438 2005 1 0 0 0 1
    3 16488 2005 0 1 0 0 2
    4 17866 2008 0 1 0 0 1
    5 14834 2000 1 0 0 0 1
    5 18516 2010 1 0 0 0 1
    5 18524 2010 0 1 0 0 2
    5 18543 2010 0 0 0 1 3
    end
    format %td date

    Some other ideas:
    Code:
    * Another way to create some summary data for each patient
    sort id date _year
    bysort id (year): gen count = _N
    bysort id (year): gen visit = _n
    bysort id year: gen visit_that_yr = _n   // essentially re-creates your "dup" variable
    
    * Just labeling things
    label var visit "Cumulative number of amputations"
    label var visit_that_yr "Cum number of amputations that year"
    label var count "Total number of amputations patient had in data"
    
    . list id date year toe foot ankle leg dup visit visit_that_yr count, sepby(id) noobs
    
      +-----------------------------------------------------------------------------------+
      | id        date   year   toe   foot   ankle   leg   dup   visit   visit_~r   count |
      |-----------------------------------------------------------------------------------|
      |  3   03feb2000   2000     1      0       0     0     1       1          1       4 |
      |  3   06jun2003   2003     1      0       0     0     1       2          1       4 |
      |  3   02jan2005   2005     1      0       0     0     1       3          1       4 |
      |  3   21feb2005   2005     0      1       0     0     2       4          2       4 |
      |-----------------------------------------------------------------------------------|
      |  4   30nov2008   2008     0      1       0     0     1       1          1       1 |
      |-----------------------------------------------------------------------------------|
      |  5   12aug2000   2000     1      0       0     0     1       1          1       4 |
      |  5   11sep2010   2010     1      0       0     0     1       2          1       4 |
      |  5   19sep2010   2010     0      1       0     0     2       3          2       4 |
      |  5   08oct2010   2010     0      0       0     1     3       4          3       4 |
      +-----------------------------------------------------------------------------------+
    
    
    * Keeping latest obs (by date) within a given year
    by id year: keep if dup==_N
    * (3 observations deleted)
    
    . list id date year toe foot ankle leg dup visit visit_that_yr count, sepby(id) noobs
    
      +-----------------------------------------------------------------------------------+
      | id        date   year   toe   foot   ankle   leg   dup   visit   visit_~r   count |
      |-----------------------------------------------------------------------------------|
      |  3   03feb2000   2000     1      0       0     0     1       1          1       4 |
      |  3   06jun2003   2003     1      0       0     0     1       2          1       4 |
      |  3   21feb2005   2005     0      1       0     0     2       4          2       4 |
      |-----------------------------------------------------------------------------------|
      |  4   30nov2008   2008     0      1       0     0     1       1          1       1 |
      |-----------------------------------------------------------------------------------|
      |  5   12aug2000   2000     1      0       0     0     1       1          1       4 |
      |  5   08oct2010   2010     0      0       0     1     3       4          3       4 |
      +-----------------------------------------------------------------------------------+
    Last edited by David Benson; 09 Dec 2018, 20:48.

    Comment


    • #3
      David Benson's solutions are fine when the data is already sorted with the highest amputation in a given year in the last observation for that year. But I assume that in Jessica Harding's real data, that will not be the case.

      The solution requires a few steps. First, as with most things in Stata, this is best done in long layout. Next, we need to establish a height ranking of the locations of the amputation. This step is achieved with the -label define- and -encode- commands. Then we sort on location within id and year and pick the last observation.

      Finally, if there is a good reason to do so, she can go back to the original wide layout. Though it is likely that whatever else she wants to do to analyze this data it, too, will be better done in long layout.

      I have appropriated David Benson's -dataex- in this code and I remind Jessica Harding that we ask that people always use -dataex- when showing example data.

      Code:
       * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id double date int year byte(toe foot ankle leg dup)
      3 14643 2000 1 0 0 0 1
      3 15862 2003 1 0 0 0 1
      3 16438 2005 1 0 0 0 1
      3 16488 2005 0 1 0 0 2
      4 17866 2008 0 1 0 0 1
      5 14834 2000 1 0 0 0 1
      5 18516 2010 1 0 0 0 1
      5 18524 2010 0 1 0 0 2
      5 18543 2010 0 0 0 1 3
      end
      format %td date
      
      rename (toe foot ankle leg) amputated=
      reshape long amputated, i(id date dup) j(_j) string
      label define location 1 toe 2 foot 3 ankle 4 leg
      encode _j, gen(location) label(location)
      
      //    KEEP ONLY HIGHEST LOCATION IN EACH YEAR
      drop if !amputated
      by id year (location), sort: keep if _n == _N
      
      //    GO BACK TO ORIGINAL LAYOUT IF THERE IS SOME COMPELLING REASON TO DO SO
      reshape wide
      mvencode amputated*, mv(0)
      rename amputated* *
      In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use -dataex-.

      Comment


      • #4
        Hi clyde,
        the data is already in long format. not sure how to apply your suggested code to data that is already long.
        jess

        Comment


        • #5
          Jessica,

          Some of the confusion may be due to what you mean by "highest" amputation. I interpreted that to mean "occurring latest (by date) in a given year." Clyde interpreted that to mean "Occurring highest (by physical location) on the body", hence leg > ankle > foot > toe

          Also, do you ever have instances where a patient had multiple amputations on the same date? And if so, how did you want to handle those?

          And, BTW, more than once using this data I wanted to say, "Yikes!" Patient #5 had 3 amputations in 2010!
          Last edited by David Benson; 09 Dec 2018, 21:38.

          Comment


          • #6
            "Yikes!" Patient #5 had 3 amputations in 2010!
            I guess my comment here is not really appropriate for a statistical and Stata forum, but I'll go ahead with it any way.

            That is what vascular disease does to people. When you have advanced artery rot, your body parts get chipped away from you bit by bit in fairly rapid succession. The wages of smoking and diabetes (and a few other causes that are less common) is death by a thousand surgical cuts.

            Comment


            • #7
              Sorry David - yes i meant highest by level of amputation, knee being higher than foot etc etc. Sometimes there are planned amputations whereby they plan to take a toe and then the foot in sequential operations - which s why i only want to record the highest level for that year.

              These are people with kidney disease and diabetes - so rates of amputations are VERY high!

              Comment


              • #8
                For me, the sharp argument has been pointed out by Clyde at #3: "Then we sort on location within id and year and pick the last observation". It then goes along with my wondering, since the target could be touched with 1-line codes following exactly with this argument
                Code:
                bys id year (leg ankle foot toe): keep if _n==_N
                .

                Comment

                Working...
                X