Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Commands depending on the next row

    My data is employment spell data, pid being person id, syear being the start of the spell and eyear being the end.

    As you can see, many spell dates seem to overlap- e.g. many begin with 1996 and end in different years.

    What command can I use to keep the last line (syear=1996 eyear=2008) but delete the other spell observations for which syear=1996?

    So basically I want to keep the observation if the next observation's syear is NOT the same, and the next year's pid IS the same, to get rid of the overlaps.

    Here is an example of the data:

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long pid int(syear eyear)
    1170296 1989 1991
    1170296 1989 1991
    1170296 1991 1992
    1170296 1991 1993
    1170296 1992 1994
    1170296 1994 1994
    1170296 1991 1995
    1170296 1995 1995
    1170296 1995 1995
    1170296 1995 1996
    1170296 1996 1996
    1170296 1966 1997
    1170296 1996 1998
    1170296 1996 1999
    1170296 1996 2000
    1170296 1996 2001
    1170296 1996 2002
    1170296 1996 2003
    1170296 1996 2004
    1170296 1996 2005
    1170296 1996 2006
    1170296 1996 2007
    1170296 1996 2008
    1145567 1993 1995
    1145567 1997 1998
    1145567 1997 1999
    end
    [/CODE]

    So eg. for person 1145567 I would like to delete the 1997-1998 observation.

  • #2
    Well, I'm not entirely clear on what you want. The following code will eliminate the 1997-1998 observation for 1145567:

    Code:
    by pid syear (eyear), sort: keep if _n == _N
    But what about 1170296? The above code will eliminate several of its observations. Basically it looks at all observations with the same start year and keeps only the one with the last end year. But in 1170296, we see that even after those are removed we are left with:

    Code:
            pid   syear   eyear  
        1170296    1966    1997  
        1170296    1989    1991  
        1170296    1991    1995  
        1170296    1992    1994  
        1170296    1994    1994  
        1170296    1995    1996  
        1170296    1996    2008
    Each of these surviving observations has a different start year. Yet collectively, what they amount to is a record of continuous employment starting in 1966 and continuing through 2008. Do you want to reduce that to a single observation with start year 1966 and end year 2008? If so, it's a bit more complicated:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long pid int(syear eyear)
    1170296 1989 1991
    1170296 1989 1991
    1170296 1991 1992
    1170296 1991 1993
    1170296 1992 1994
    1170296 1994 1994
    1170296 1991 1995
    1170296 1995 1995
    1170296 1995 1995
    1170296 1995 1996
    1170296 1996 1996
    1170296 1966 1997
    1170296 1996 1998
    1170296 1996 1999
    1170296 1996 2000
    1170296 1996 2001
    1170296 1996 2002
    1170296 1996 2003
    1170296 1996 2004
    1170296 1996 2005
    1170296 1996 2006
    1170296 1996 2007
    1170296 1996 2008
    1145567 1993 1995
    1145567 1997 1998
    1145567 1997 1999
    end
    
    //    VALIDATE YEAR VARIABLES
    assert syear <= eyear & !missing(syear, eyear)
    
    //    GO TO LONG LAYOUT AND IDENTIFY
    //    EMPLOYMENT STATE IN EACH YEAR
    gen obs_no = _n
    reshape long @year, i(obs_no) j(event) string
    gsort pid year -event // IF START & END IN SAME YEAR, END COMES LAST
    by pid: gen state = sum((event == "s") - (event == "e"))
    
    //    MARK SPELLS OF EMPLOYMENT
    //    A SPELL STARTS WHEN STATE TRANSITIONS FROM
    //    0 TO 1, OR FIRST OBSERVATION
    by pid (year): gen spell = sum((state == 1) & (state[_n-1] == 0 | _n == 1))
    
    //    KEEP START AND END OF SPELL
    by pid spell (year), sort: keep if _n == 1 | _n == _N
    
    //    RESTORE SINGLE RECORD PER SPELL
    drop obs_no state
    reshape wide @year, i(pid spell) j(event) string
    order eyear, after(syear)

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Well, I'm not entirely clear on what you want. The following code will eliminate the 1997-1998 observation for 1145567:

      Code:
      by pid syear (eyear), sort: keep if _n == _N
      But what about 1170296? The above code will eliminate several of its observations. Basically it looks at all observations with the same start year and keeps only the one with the last end year. But in 1170296, we see that even after those are removed we are left with:

      Code:
      pid syear eyear
      1170296 1966 1997
      1170296 1989 1991
      1170296 1991 1995
      1170296 1992 1994
      1170296 1994 1994
      1170296 1995 1996
      1170296 1996 2008
      Each of these surviving observations has a different start year. Yet collectively, what they amount to is a record of continuous employment starting in 1966 and continuing through 2008. Do you want to reduce that to a single observation with start year 1966 and end year 2008? If so, it's a bit more complicated:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long pid int(syear eyear)
      1170296 1989 1991
      1170296 1989 1991
      1170296 1991 1992
      1170296 1991 1993
      1170296 1992 1994
      1170296 1994 1994
      1170296 1991 1995
      1170296 1995 1995
      1170296 1995 1995
      1170296 1995 1996
      1170296 1996 1996
      1170296 1966 1997
      1170296 1996 1998
      1170296 1996 1999
      1170296 1996 2000
      1170296 1996 2001
      1170296 1996 2002
      1170296 1996 2003
      1170296 1996 2004
      1170296 1996 2005
      1170296 1996 2006
      1170296 1996 2007
      1170296 1996 2008
      1145567 1993 1995
      1145567 1997 1998
      1145567 1997 1999
      end
      
      // VALIDATE YEAR VARIABLES
      assert syear <= eyear & !missing(syear, eyear)
      
      // GO TO LONG LAYOUT AND IDENTIFY
      // EMPLOYMENT STATE IN EACH YEAR
      gen obs_no = _n
      reshape long @year, i(obs_no) j(event) string
      gsort pid year -event // IF START & END IN SAME YEAR, END COMES LAST
      by pid: gen state = sum((event == "s") - (event == "e"))
      
      // MARK SPELLS OF EMPLOYMENT
      // A SPELL STARTS WHEN STATE TRANSITIONS FROM
      // 0 TO 1, OR FIRST OBSERVATION
      by pid (year): gen spell = sum((state == 1) & (state[_n-1] == 0 | _n == 1))
      
      // KEEP START AND END OF SPELL
      by pid spell (year), sort: keep if _n == 1 | _n == _N
      
      // RESTORE SINGLE RECORD PER SPELL
      drop obs_no state
      reshape wide @year, i(pid spell) j(event) string
      order eyear, after(syear)
      Yes you are right - that would be ideal.

      But would this work if there were gaps in employment?

      In this case, the spells overlap. But if instead there were gaps in spells,

      e.g. 1987-1992, 1990-1994, 1990-1995, 2000-2007

      Would those commands create:

      1987 - 1995
      2000-2007

      ?

      Comment


      • #4
        But would this work if there were gaps in employment?
        Yes. In the case of pid 1145567 it produces two separate spells.

        And I believe it will work in general. In any case, try it. If the results are not correct, post back and I'll try to fix it. But I believe this will work generally. The only requirements are that the start year in each observation must be the same as or earlier than the end year, and neither start nor end year can be missing. But those assumptions are checked in the -assert- command, so if your data aren't suitable, Stata will stop and let you know that and won't go on to produce wrong answers.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Yes. In the case of pid 1145567 it produces two separate spells.

          And I believe it will work in general. In any case, try it. If the results are not correct, post back and I'll try to fix it. But I believe this will work generally. The only requirements are that the start year in each observation must be the same as or earlier than the end year, and neither start nor end year can be missing. But those assumptions are checked in the -assert- command, so if your data aren't suitable, Stata will stop and let you know that and won't go on to produce wrong answers.
          After the first line it comes up with the error "assertion is false"...

          Comment


          • #6
            Can I then perhaps drop the person - so drop by pid - if the assumption doesn't hold?

            And how would I do that?
            Last edited by Ella Ki; 09 Apr 2017, 09:24.

            Comment


            • #7
              You can do that but you probably shouldn't. Why do you have any observations where the end year precedes the start year or one of them is missing? That is most likely a data error and you should fix it, not drop it. Go back and run

              Code:
              browse if missing(syear, eyear) | eyear < syear  // OR USE LIST IF YOU PREFER
              to find out which observations these are.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                You can do that but you probably shouldn't. Why do you have any observations where the end year precedes the start year or one of them is missing? That is most likely a data error and you should fix it, not drop it. Go back and run

                Code:
                browse if missing(syear, eyear) | eyear < syear // OR USE LIST IF YOU PREFER
                to find out which observations these are.
                Would

                bysort pid : keep if year <=eyear & !missing(syear,eyear)

                achieve that?

                And I have realised why the assumption didn't hold! It's because missing end years were recorded as -9. That explains it!
                Last edited by Ella Ki; 09 Apr 2017, 11:52.

                Comment


                • #9
                  Code:
                  bysort pid : keep if syear <=eyear & !missing(syear,eyear)
                  will eliminate all data on that pid if any of his/her observations with questionable data. Whether that approach is preferable to eliminating just the particular observations with questionable data I leave to your judgment in light of how you will use the results and your research goals.

                  I'm glad that the invalid observations are all due to missing values coded as -9; that's better than some of the alternatives.

                  General lesson: unlike some other statistical packages that encourage and support the use of missing value coding with 9's, -9's, etc., Stata doesn't. When you create data sets for analysis, one step in the process should be looking for these situations and correcting them by converting those numeric codes to Stata missing values (either . or the extend missing values .a through .z). The -mvdecode- command makes this quick and easy. Leaving missing values coded with 9's, -9's, etc,. leads to problems sooner or later when working in Stata.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Code:
                    bysort pid : keep if syear <=eyear & !missing(syear,eyear)
                    will eliminate all data on that pid if any of his/her observations with questionable data. Whether that approach is preferable to eliminating just the particular observations with questionable data I leave to your judgment in light of how you will use the results and your research goals.

                    I'm glad that the invalid observations are all due to missing values coded as -9; that's better than some of the alternatives.

                    General lesson: unlike some other statistical packages that encourage and support the use of missing value coding with 9's, -9's, etc., Stata doesn't. When you create data sets for analysis, one step in the process should be looking for these situations and correcting them by converting those numeric codes to Stata missing values (either . or the extend missing values .a through .z). The -mvdecode- command makes this quick and easy. Leaving missing values coded with 9's, -9's, etc,. leads to problems sooner or later when working in Stata.
                    Thanks a lot Clyde! I really appreciate the help.

                    I will make sure to look for minus values as missing values in the future.

                    Comment

                    Working...
                    X