Commands depending on the next row

Ella Ki

Join Date: Mar 2017

Posts: 39
#1

Commands depending on the next row

08 Apr 2017, 11:12

My data is employment spell data, pid being person id, syear being the start of the spell and eyear being the end.

As you can see, many spell dates seem to overlap- e.g. many begin with 1996 and end in different years.

What command can I use to keep the last line (syear=1996 eyear=2008) but delete the other spell observations for which syear=1996?

So basically I want to keep the observation if the next observation's syear is NOT the same, and the next year's pid IS the same, to get rid of the overlaps.

Here is an example of the data:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid int(syear eyear)
1170296 1989 1991
1170296 1989 1991
1170296 1991 1992
1170296 1991 1993
1170296 1992 1994
1170296 1994 1994
1170296 1991 1995
1170296 1995 1995
1170296 1995 1995
1170296 1995 1996
1170296 1996 1996
1170296 1966 1997
1170296 1996 1998
1170296 1996 1999
1170296 1996 2000
1170296 1996 2001
1170296 1996 2002
1170296 1996 2003
1170296 1996 2004
1170296 1996 2005
1170296 1996 2006
1170296 1996 2007
1170296 1996 2008
1145567 1993 1995
1145567 1997 1998
1145567 1997 1999
end
[/CODE]

So eg. for person 1145567 I would like to delete the 1997-1998 observation.
Tags: command, overlap, Spell

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

08 Apr 2017, 12:03

Well, I'm not entirely clear on what you want. The following code will eliminate the 1997-1998 observation for 1145567:

Code:

by pid syear (eyear), sort: keep if _n == _N

But what about 1170296? The above code will eliminate several of its observations. Basically it looks at all observations with the same start year and keeps only the one with the last end year. But in 1170296, we see that even after those are removed we are left with:

Code:

        pid   syear   eyear  
    1170296    1966    1997  
    1170296    1989    1991  
    1170296    1991    1995  
    1170296    1992    1994  
    1170296    1994    1994  
    1170296    1995    1996  
    1170296    1996    2008

Each of these surviving observations has a different start year. Yet collectively, what they amount to is a record of continuous employment starting in 1966 and continuing through 2008. Do you want to reduce that to a single observation with start year 1966 and end year 2008? If so, it's a bit more complicated:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid int(syear eyear)
1170296 1989 1991
1170296 1989 1991
1170296 1991 1992
1170296 1991 1993
1170296 1992 1994
1170296 1994 1994
1170296 1991 1995
1170296 1995 1995
1170296 1995 1995
1170296 1995 1996
1170296 1996 1996
1170296 1966 1997
1170296 1996 1998
1170296 1996 1999
1170296 1996 2000
1170296 1996 2001
1170296 1996 2002
1170296 1996 2003
1170296 1996 2004
1170296 1996 2005
1170296 1996 2006
1170296 1996 2007
1170296 1996 2008
1145567 1993 1995
1145567 1997 1998
1145567 1997 1999
end

//    VALIDATE YEAR VARIABLES
assert syear <= eyear & !missing(syear, eyear)

//    GO TO LONG LAYOUT AND IDENTIFY
//    EMPLOYMENT STATE IN EACH YEAR
gen obs_no = _n
reshape long @year, i(obs_no) j(event) string
gsort pid year -event // IF START & END IN SAME YEAR, END COMES LAST
by pid: gen state = sum((event == "s") - (event == "e"))

//    MARK SPELLS OF EMPLOYMENT
//    A SPELL STARTS WHEN STATE TRANSITIONS FROM
//    0 TO 1, OR FIRST OBSERVATION
by pid (year): gen spell = sum((state == 1) & (state[_n-1] == 0 | _n == 1))

//    KEEP START AND END OF SPELL
by pid spell (year), sort: keep if _n == 1 | _n == _N

//    RESTORE SINGLE RECORD PER SPELL
drop obs_no state
reshape wide @year, i(pid spell) j(event) string
order eyear, after(syear)

Comment

Ella Ki

Join Date: Mar 2017
Posts: 39

08 Apr 2017, 16:45

Originally posted by Clyde Schechter View Post

Well, I'm not entirely clear on what you want. The following code will eliminate the 1997-1998 observation for 1145567:

Code:

by pid syear (eyear), sort: keep if _n == _N

Code:

pid syear eyear
1170296 1966 1997
1170296 1989 1991
1170296 1991 1995
1170296 1992 1994
1170296 1994 1994
1170296 1995 1996
1170296 1996 2008

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid int(syear eyear)
1170296 1989 1991
1170296 1989 1991
1170296 1991 1992
1170296 1991 1993
1170296 1992 1994
1170296 1994 1994
1170296 1991 1995
1170296 1995 1995
1170296 1995 1995
1170296 1995 1996
1170296 1996 1996
1170296 1966 1997
1170296 1996 1998
1170296 1996 1999
1170296 1996 2000
1170296 1996 2001
1170296 1996 2002
1170296 1996 2003
1170296 1996 2004
1170296 1996 2005
1170296 1996 2006
1170296 1996 2007
1170296 1996 2008
1145567 1993 1995
1145567 1997 1998
1145567 1997 1999
end

// VALIDATE YEAR VARIABLES
assert syear <= eyear & !missing(syear, eyear)

// GO TO LONG LAYOUT AND IDENTIFY
// EMPLOYMENT STATE IN EACH YEAR
gen obs_no = _n
reshape long @year, i(obs_no) j(event) string
gsort pid year -event // IF START & END IN SAME YEAR, END COMES LAST
by pid: gen state = sum((event == "s") - (event == "e"))

// MARK SPELLS OF EMPLOYMENT
// A SPELL STARTS WHEN STATE TRANSITIONS FROM
// 0 TO 1, OR FIRST OBSERVATION
by pid (year): gen spell = sum((state == 1) & (state[_n-1] == 0 | _n == 1))

// KEEP START AND END OF SPELL
by pid spell (year), sort: keep if _n == 1 | _n == _N

// RESTORE SINGLE RECORD PER SPELL
drop obs_no state
reshape wide @year, i(pid spell) j(event) string
order eyear, after(syear)

Yes you are right - that would be ideal.

But would this work if there were gaps in employment?

In this case, the spells overlap. But if instead there were gaps in spells,

e.g. 1987-1992, 1990-1994, 1990-1995, 2000-2007

Would those commands create:

1987 - 1995
2000-2007

?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

08 Apr 2017, 18:04

But would this work if there were gaps in employment?

Yes. In the case of pid 1145567 it produces two separate spells.

And I believe it will work in general. In any case, try it. If the results are not correct, post back and I'll try to fix it. But I believe this will work generally. The only requirements are that the start year in each observation must be the same as or earlier than the end year, and neither start nor end year can be missing. But those assumptions are checked in the -assert- command, so if your data aren't suitable, Stata will stop and let you know that and won't go on to produce wrong answers.
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#5

09 Apr 2017, 09:20

Originally posted by Clyde Schechter View Post

Yes. In the case of pid 1145567 it produces two separate spells.

And I believe it will work in general. In any case, try it. If the results are not correct, post back and I'll try to fix it. But I believe this will work generally. The only requirements are that the start year in each observation must be the same as or earlier than the end year, and neither start nor end year can be missing. But those assumptions are checked in the -assert- command, so if your data aren't suitable, Stata will stop and let you know that and won't go on to produce wrong answers.

After the first line it comes up with the error "assertion is false"...
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#6

09 Apr 2017, 09:21

Can I then perhaps drop the person - so drop by pid - if the assumption doesn't hold?

And how would I do that?

Last edited by Ella Ki; 09 Apr 2017, 09:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

09 Apr 2017, 10:54

You can do that but you probably shouldn't. Why do you have any observations where the end year precedes the start year or one of them is missing? That is most likely a data error and you should fix it, not drop it. Go back and run

Code:

browse if missing(syear, eyear) | eyear < syear // OR USE LIST IF YOU PREFER

to find out which observations these are.
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#8

09 Apr 2017, 11:25

Originally posted by Clyde Schechter View Post

You can do that but you probably shouldn't. Why do you have any observations where the end year precedes the start year or one of them is missing? That is most likely a data error and you should fix it, not drop it. Go back and run

Code:

browse if missing(syear, eyear) | eyear < syear // OR USE LIST IF YOU PREFER

to find out which observations these are.

Would

bysort pid : keep if year <=eyear & !missing(syear,eyear)

achieve that?

And I have realised why the assumption didn't hold! It's because missing end years were recorded as -9. That explains it!

Last edited by Ella Ki; 09 Apr 2017, 11:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

09 Apr 2017, 12:19

Code:

bysort pid : keep if syear <=eyear & !missing(syear,eyear)

will eliminate all data on that pid if any of his/her observations with questionable data. Whether that approach is preferable to eliminating just the particular observations with questionable data I leave to your judgment in light of how you will use the results and your research goals.

I'm glad that the invalid observations are all due to missing values coded as -9; that's better than some of the alternatives.

General lesson: unlike some other statistical packages that encourage and support the use of missing value coding with 9's, -9's, etc., Stata doesn't. When you create data sets for analysis, one step in the process should be looking for these situations and correcting them by converting those numeric codes to Stata missing values (either . or the extend missing values .a through .z). The -mvdecode- command makes this quick and easy. Leaving missing values coded with 9's, -9's, etc,. leads to problems sooner or later when working in Stata.
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#10

09 Apr 2017, 12:40

Originally posted by Clyde Schechter View Post

Code:

bysort pid : keep if syear <=eyear & !missing(syear,eyear)

will eliminate all data on that pid if any of his/her observations with questionable data. Whether that approach is preferable to eliminating just the particular observations with questionable data I leave to your judgment in light of how you will use the results and your research goals.

I'm glad that the invalid observations are all due to missing values coded as -9; that's better than some of the alternatives.

General lesson: unlike some other statistical packages that encourage and support the use of missing value coding with 9's, -9's, etc., Stata doesn't. When you create data sets for analysis, one step in the process should be looking for these situations and correcting them by converting those numeric codes to Stata missing values (either . or the extend missing values .a through .z). The -mvdecode- command makes this quick and easy. Leaving missing values coded with 9's, -9's, etc,. leads to problems sooner or later when working in Stata.

Thanks a lot Clyde! I really appreciate the help.

I will make sure to look for minus values as missing values in the future.
Comment

Announcement