Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying nonconsecutive runs of variables

    I am currently working on a dataset of physicians, with variables year* identifying the years they were registered in a database:

    ID year1 year2 year3 year4 year5
    1 1995 1996 1997 1998 1999
    2 1995 1997 1998 1999 2000
    3 1997 2001 2002 2003 2004
    4 2003 2005 2006 2010 2011

    I am trying to generate a series of variables to identify how many times they dropped out, as well as the intervals in which these physicians are not registered in that database:

    ID missed yearmissedbeg1 yearmissedend1 yearmissedbeg2 yearmissedend2
    1 0 . . . .
    2 1 1996 1996 . .
    3 1 1998 2000 . .
    4 2 2004 2004 2007 2009

    I have read the STATA FAQ pages on tsset, but it seems I would have to reshape long my database in order for that solution to work, which would not be ideal as I have +17,000 observations listed by 16 years, as well as 40 other variables for the physicians.

    Any help would be appreciated.

  • #2
    So the first obstacle is that your data is in wide layout. The majority of Stata commands work most easily (or only at all) with data in long layout. The next step, after reshaping to wide, is to find spells of consecutive years of registration. Then the gaps between them run between the end of one spell and the beginning of the next (non-inclusive). To get to your final result, you can -reshape- back to wide.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id int(year1 year2 year3 year4 year5)
    1 1995 1996 1997 1998 1999
    2 1995 1997 1998 1999 2000
    3 1997 2001 2002 2003 2004
    4 2003 2005 2006 2010 2011
    end
    
    reshape long year, i(id) j(_j)
    
    //    IDENTIFY CONSECUTIVE SPELLS
    by id (_j), sort: gen spell = sum(year != year[_n-1]+1)
    
    //    REDUCE TO ONE OBSERVATION PER SPELL
    collapse (min) spell_start = year (max) spell_end = year, by(id spell)
    
    //    CHARACTERIZE GAPS BETWEEN SPELLS WITHIN ID
    by id (spell): gen year_missed_begin = spell_end + 1 if _n != _N
    by id (spell): gen year_missed_end = spell_start[_n+1] - 1 if _n != _N
    by id (spell): gen missed = _N - 1
    drop spell_*
    drop if missing(year_missed_begin) & missed > 0
    
    //    SKIP THIS FINAL STEP UNLESS YOU HAVE A COMPELLING REASON
    //    TO GO BACK TO WIDE LAYOUT
    reshape wide year_missed_begin year_missed_end, i(id) j(spell)
    Now, whatever analysis you have planned from this point on is also likely to be best done with (or absolutely require) the data in long layout. So unless you specifically know that you are going to be doing one of those relatively uncommon things that is best done in wide layout, I encourage you to skip the final -reshape wide- and leave your results long.

    Comment


    • #3
      You might also look at routines specifically designed for spells - see newspell for example.

      Comment

      Working...
      X