Creating a time varying variable

William Lisowski

Join Date: Dec 2014

Posts: 10150
#16

10 Feb 2016, 15:45

OK, so along with creating the employment status variable, extra rows have to be added to the table to contain the unemployment episodes. That helps clarify what you are requesting.

I have more questions.

It appears that you ignore the first row in the table you have now, for pid 101 jobseq 1 with the missing end date. Is it the case that you ignore any row with missing end date if there is another row for the same pid and jobseq and start date that does not have a missing end date?

Your sample output does not have employment status for pid 101 after the end date of jobseq 2. Is that what you intended?

Looking at the last row in the table you have now, for pid 102 jobseq 2 with the missing end date. Is it the case that for any row with missing end date and no other row for the same pid and jobseq you assume the individual continued to be employed at that job at the time of the interview?

You show pid 101 as neither employed nor unemployed between August 2006 and the time of the interview. Is that correct?

You show pid 101 as both employed and unemployed in June 2001 and in June 2004, and pid 102 in June 2004 and in May 2005. Is that really what you want? When I've done work like this, usually each month is counted as either employed or unemployed.

You don't show sample data where the subject may have two jobs at the same time. Is that a possibility?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#17

10 Feb 2016, 16:25

In your data, is it possible for somebody to, for example, start a job in jan 2001, which ends in may 2001, but also start a job in mar 2001 which ends in jul 2001, so at some point the person is actively holding two jobs. In that case, do you want to have a single observation showing employment from jan 2001 through jul 2001?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

#18

10 Feb 2016, 21:55

It's getting late, and you haven't responded to #16 & #17. I'm going to make some assumptions about your answers to those questions, and provide some code.

1. I assume that any observation with missing values for the jobstart or jobend variables is to be ignored as uninformative, unless it is the last observation for that person. In that case, a missing jobend date is taken to mean that the person remains employed in that job as of the interview date.

2. A person may have jobs that are held for overlapping time periods, leading to sometimes having more than one job. In that case, I count the combined time period as a single time period.

3. I know you want to put in the interview date for the final end date if no end date is specified for that last period. But nothing in what you've posted indicates where that comes from, so I'm just going to leave it missing.

4. To illustrate how the code works with respect to point 2, I have added a third person to your illustrative data, who has two jobs that overlap in time.

The following code, if my assumptions are correct, should give you what you want, or at least come close.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int pid byte jobseq int jobstarty byte jobstartm int jobendy byte jobendm
101 1 1988 10    .  .
101 1 1988 10 2001  6
101 2 2004  6 2006  8
102 1 1989  4 2004  6
102 2 2005  5    .  .
103 1 1991  7 1995  4
103 2 1993  8 1995  6
103 3 1995 11 1996 12
end

list, noobs clean

//    CREATE STATA MONTHLY DATE VARIABLES
foreach x in start end {
    gen month`x' = mofd(mdy(job`x'm, 1, job`x'y))
    format month`x' %tm
}

//    DROP OBSERVATIONS WITH MISSING END DATE, UNLESS
//    IT IS LAST FOR THE PERSON
by pid (monthstart), sort: drop if missing(monthend) & _n < _N
duplicates drop
keep pid jobseq month*
reshape long month, i(pid jobseq) j(event) string

// GENERATE NUMBER OF JOBS ACTIVE
by pid (month), sort: gen njobs = sum(event == "start") - sum(event == "end")
by pid (month): gen spellnum = 1 if (njobs > 0 & njobs[_n-1] == 0) | _n == 1
by pid (month): replace spellnum = sum(spellnum)

//    IDENTIFY FIRST AND LAST DATES IN EACH SPELL
//    AND REDUCE TO ONE OBSERVATION PER SPELL
egen earliest = min(cond(event == "start", month, .)), by(pid spellnum)
egen latest = max(cond(event == "end", month, .)), by(pid spellnum) 
format earliest latest %tm
by pid spellnum, sort: keep if _n == 1

//    NOW CREATE SPELLS OF UNEMPLOYMENT BETWEEN SPELLS OF EMPLOYMENT
expand 2
by pid spellnum, sort: gen employment_status = cond(_n == 1, "employed", ///
    "unemployed")
by pid (spellnum employment_status), sort: replace earliest = ///
    latest[_n-1] + 1 if employment_status == "unemployed"
by pid (spellnum employment_status), sort: replace latest = ///
    earliest[_n+1] - 1 if employment_status == "unemployed"
drop if missing(earliest) & missing(latest)
isid pid spellnum employment_status, sort
keep pid earliest latest employment_status

list, noobs clean ab(18)

Please note the use of -dataex- to generate the example data. In the future, please use it when you wish to show us data. It makes it very easy for someone who wants to try things with your data to get it into Stata. If you do not already have the -dataex- command installed in your Stata setup, run -ssc install dataex- and then read -help dataex- to learn how to use it (it's easy). Thanks.

Comment

Guest
#19

11 Feb 2016, 04:53

Hi,

Thank your for your replies. it was practically midnight when I wrote it so couldn't check it after.

To begin with William Lisowski's questions

It appears that you ignore the first row in the table you have now, for pid 101 jobseq 1 with the missing end date. Is it the case that you ignore any row with missing end date if there is another row for the same pid and jobseq and start date that does not have a missing end date?

Yes, that is exactly the case. i presume that pid 101 continued his job 1 if the job doesn't have the end date until the next interview when then there is job end date.

Your sample output does not have employment status for pid 101 after the end date of jobseq 2. Is that what you intended?

I forgot to put another row where there is a start date of jobseq3 of pid 101 and no job end date until the time of the interview.

Looking at the last row in the table you have now, for pid 102 jobseq 2 with the missing end date. Is it the case that for any row with missing end date and no other row for the same pid and jobseq you assume the individual continued to be employed at that job at the time of the interview?

Yes, that is the case and I assume that the individual continued to be employed at that job at the time of the last interview.

You show pid 101 as neither employed nor unemployed between August 2006 and the time of the interview. Is that correct?

It is like that because I haven't taken all the job seq for pid 101 on the table until the time of the last interview.

You show pid 101 as both employed and unemployed in June 2001 and in June 2004, and pid 102 in June 2004 and in May 2005. Is that really what you want? When I've done work like this, usually each month is counted as either employed or unemployed.

Yes, that is what i intended although I wonder if it is wise to do that or to take a month after the jobendm.

You don't show sample data where the subject may have two jobs at the same time. Is that a possibility?

yes, that is a possibility.
Comment
Guest
#20

11 Feb 2016, 05:02

Dear Clyde,

Thank you so much. There is a possibility in my dataset that an individual has two jobs or more at the same time.In that case the total time period of being employed would suffice.

I will run the codes above and check the output.

i am learning bit by bit on how to ask questions so that it is much more clearer in the Statalist when someone tries to go through it.

I find it very helpful when you and William provided tips and comments on everything related Stata, data set and posting.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#21

11 Feb 2016, 09:11

Clyde has done an excellent job of working out a technique for what you need. But I think I see an area with potential to cause a problem.

Code:

// DROP OBSERVATIONS WITH MISSING END DATE, UNLESS // IT IS LAST FOR THE PERSON by pid (monthstart), sort: drop if missing(monthend) & _n < _N duplicates drop keep pid jobseq month* reshape long month, i(pid jobseq) j(event) string

Given that individuals can have multiple jobs at the same time, I think we need to drop observations with missing end date unless it is the last observation for the given jobseq for the person. Here is what I think is a minimal change:

Code:

// DROP OBSERVATIONS WITH MISSING END DATE, UNLESS // IT IS LAST FOR THE PERSON by pid jobseq (monthend), sort: drop if missing(monthend) & _n < _N duplicates drop keep pid jobseq month* reshape long month, i(pid jobseq) j(event) string

But the answers in #19 led to more questions. In the first answer in #19, you suggest that the file contains the results of multiple interviews. That throws a different light on the data.

Is it the case that if someone has multiple jobs, then jobseq will indicate the same job in both interviews? If it doesn't you have a serious problem.

If someone has the same job in two interviews, are we certain that it will show the same jobstarty and jobstartm? If it doesn't, you need to figure out a rule for which one to use.

Do you have a variable that indicates either the interview date or an interview number (1, 2, ...)? I think you should include it in the data and should sort by it, along with pid and jobseq, and keep only the observation from the final interview for each individual and jobseq. The code I proposed above would become something like

Code:

// DROP OBSERVATIONS WITH MISSING END DATE, UNLESS // IT IS LAST FOR THE PERSON by pid jobseq (interviewdate), sort: drop if _n < _N duplicates drop // this should not be necessary now, but it cannot hurt keep pid jobseq month* reshape long month, i(pid jobseq) j(event) string
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#22

11 Feb 2016, 09:28

I think William's correction to my code is right.
Comment

Guest

#23

11 Feb 2016, 10:06

Hi,

Thank you for your input. When I ran the codes I got an error for the reshape command

Code:

by pid jobseq (monthend), sort: drop if missing(monthend) & _n < _N
duplicates drop
keep pid jobseq month*
reshape long month, i(pid jobseq) j(event) string

Error:

Code:

reshape long month, i(pid jobseq) j(event) string
(note: j = end start)
variable id does not uniquely identify the observations
    Your data are currently wide.  You are performing a reshape long.  You
    specified i(pid jobseq) and j(event).  In the current wide form, variable
    pid jobseq should uniquely identify the observations.  Remember this
    picture:

         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem observations.

To answer your questions:

that if someone has multiple jobs, then jobseq will indicate the same job in both interviews?

Yes.

If someone has the same job in two interviews, are we certain that it will show the same jobstarty and jobstartm?

Yes, it is done as such according to the codebook.

I have attached a sample data set with this post.

Best,
Bibek

Attached Files

sample.dta (3.2 KB, 1 view)

Last edited by Bibek Sharma; 11 Feb 2016, 10:17.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#24

11 Feb 2016, 10:36

Thank you for the larger sample of data, it is very helpful. And the addition of the jobwave variable makes things more straightforward. This is the code I ran using your sample data (I changed command to drop observations, all else is the same)

Code:

use sample, clear

//    CREATE STATA MONTHLY DATE VARIABLES
foreach x in start end {
    gen month`x' = mofd(mdy(job`x'm, 1, job`x'y))
    format month`x' %tm
}

//    DROP OBSERVATIONS WITH MISSING END DATE, UNLESS
//    IT IS LAST FOR THE PERSON AND JOB
by pid jobseq (jobwave), sort: drop if _n < _N
duplicates drop // this should not be necessary now, but it cannot hurt
keep pid jobseq month*
reshape long month, i(pid jobseq) j(event) string

// GENERATE NUMBER OF JOBS ACTIVE
by pid (month), sort: gen njobs = sum(event == "start") - sum(event == "end")
by pid (month): gen spellnum = 1 if (njobs > 0 & njobs[_n-1] == 0) | _n == 1
by pid (month): replace spellnum = sum(spellnum)

//    IDENTIFY FIRST AND LAST DATES IN EACH SPELL
//    AND REDUCE TO ONE OBSERVATION PER SPELL
egen earliest = min(cond(event == "start", month, .)), by(pid spellnum)
egen latest = max(cond(event == "end", month, .)), by(pid spellnum) 
format earliest latest %tm
by pid spellnum, sort: keep if _n == 1

//    NOW CREATE SPELLS OF UNEMPLOYMENT BETWEEN SPELLS OF EMPLOYMENT
expand 2
by pid spellnum, sort: gen employment_status = cond(_n == 1, "employed", ///
    "unemployed")
by pid (spellnum employment_status), sort: replace earliest = ///
    latest[_n-1] + 1 if employment_status == "unemployed"
by pid (spellnum employment_status), sort: replace latest = ///
    earliest[_n+1] - 1 if employment_status == "unemployed"
drop if missing(earliest) & missing(latest)
isid pid spellnum employment_status, sort
keep pid earliest latest employment_status

list, noobs clean ab(18)

and these are the results

Code:

    pid   earliest    latest   employment_status  
    101    1988m10    2001m6            employed  
    101     2001m7    2001m5          unemployed  
    101     2001m6    2003m1            employed  
    101     2003m2    2004m9          unemployed  
    101    2004m10    2006m8            employed  
    101     2006m9    2006m7          unemployed  
    101     2006m8         .            employed  
    102     1989m3    2004m6            employed  
    102     2004m7    2006m1          unemployed  
    102     2006m2   2008m12            employed  
    102     2009m1    2010m1          unemployed  
    102     2010m2         .            employed  
    201     1975m3   2008m12            employed  
    201     2009m1         .          unemployed  
    202     2002m3         .            employed  
    203     2001m9    2002m5            employed  
    203     2002m6    2003m4          unemployed  
    203     2003m5   2003m12            employed  
    203     2004m1         .          unemployed  
    301     1985m3    1987m2            employed  
    301     1987m3    1987m2          unemployed  
    301     1987m3    1995m6            employed  
    301     1995m7    1995m5          unemployed  
    301     1995m6         .            employed  
    401     1996m4         .            employed  
    402     1992m6    1997m6            employed  
    402     1997m7    2001m9          unemployed  
    402    2001m10         .            employed  
    501     1995m4    1997m1            employed  
    501     1997m2    1997m1          unemployed  
    501     1997m2    1998m6            employed  
    501     1998m7    1998m6          unemployed  
    501     1998m7         .            employed  
    601    1958m10   1997m12            employed  
    601     1998m1         .          unemployed  
    602     1953m3    1955m3            employed  
    602     1955m4    1955m3          unemployed  
    602     1955m4    1958m9            employed  
    602    1958m10    1958m9          unemployed  
    602    1958m10   1997m12            employed  
    602     1998m1    1999m8          unemployed  
    602     1999m9    2000m4            employed  
    602     2000m5    2008m2          unemployed  
    602     2008m3    2010m2            employed  
    602     2010m3    2010m2          unemployed  
    602     2010m3         .            employed  
    603     1991m3    1992m9            employed  
    603    1992m10    1998m7          unemployed  
    603     1998m8   1999m12            employed  
    603     2000m1         .          unemployed  
    604     1996m6         .            employed  
    701     2000m3         .            employed  
    801     1997m9    1999m1            employed  
    801     1999m2    1999m1          unemployed  
    801     1999m2   1999m11            employed  
    801    1999m12   1999m11          unemployed  
    801    1999m12   2004m12            employed  
    801     2005m1         .          unemployed

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#25

11 Feb 2016, 10:37

The more we delve into your data, the more confusing it seems to get. The -reshape- command has identified that your data can have more than one observation with the same pid and jobseq. And looking at your sample, it seems that's true. You also have a variable jobwave in the sample you attached. Within the sample, the combination of pid jobseq and jobwave does uniquely identify observations. Is that true in your full data set? (Run -isid pid jobseq jobwave- and tell us what Stata responds.)

The organization of your data is still baffling. For example, your first three observations, although they correspnd to different values of jobwave, have the same data entered, and all three of them are, as best I can tell, useless: the starting dates they have are also present in the fourth observation, and these three have no end date. So why are they even there? What do jobwave and jobseq denote? Why are the numbers in jobwave not consecutive? What is going on?

As I don't understand what your data mean, I don't want to try to "correct" my code, as I am just likely to make things worse. And, unfortunately, my confusion about your data only grows.

Note: This crossed in cybersapce with William Lisowski's post #24. I'm glad to see that he has managed to grasp your data and has crafted a workable solution!

Last edited by Clyde Schechter; 11 Feb 2016, 10:40. Reason: Add final paragraph
Comment
Guest
#26

11 Feb 2016, 12:14

Thank you so much William. I will run the commands with the original data set.

Clyde,

I ran the code

Code:

isid pid jobseq jobwave

and the output was

variables pid jobseq jobwave do not uniquely identify the observations
r(459);

The first wave of data was collected in 1998 and the second wave in 1999 and so on. The data is thus build up with more information of original household that was recorded in the first survey in 1998. Some members of the original household might have moved on to create a new household and that new household and it's members are also added in the data set.
Job sequence denotes if the person is still with his/her first job or has started a new one since the first collection of the data.

The data set is a bit complicated but also has been a huge learning experience for me.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#27

11 Feb 2016, 13:54

Let me perhaps add a little to your learning. Please accept my apologies if you already understand what I am writing.

The type of data you have - observations of individuals repeated at multiple times - is called "longitudinal data" or "panel data" in Stata terminology. Stata has a rich set of commands designed specifically for dealing with this type of data: see the Stata Longitudinal-Data/Panel-Data Reference Manual" PDF included in the Stata installation (since version 11) and accessible from within Stata - for example, through Stata's Help menu.

Realizing that you have pane data was an important key to fully understanding your data and your problem. For most or all of the questions I asked back at post #16, I would guess the answers to for panel data.

Looking at your earlier questions on Statalist I see that you have used tsset in the past, which is much the same as xtset, but nothing I saw was explicitly about having panel data as we see it in this example. So I thought I should point this out to you since, as you say, this is a learning experience for you.

And it will continue to be, if indeed you are new to panel data. It's the type of data I'm currently working with, and the problems are never-ending, because repeated survey data offers a seemingly infinite number of combinations of possible responses, and many of those combinations are ones you will not have thought of (at best) or are inconsistent (at worst). The only uncomplicated panel data I have seen appears in textbooks and was made up to have no problems.

Good luck!
Comment
Guest
#28

13 Feb 2016, 04:10

Thank you so much for your advice. I knew that I was dealing with "panel data" but as you assumed it is my first time working with this type of data set.
I did not take into account that the data could be this complex. I had more of the made up data that appears in textbooks kind of idea

I will keep your words in mind as I still have more than a half way to go in regards to my research.

Best,
Bibek
Comment
Guest
#29

10 Nov 2016, 02:33

Hi,

I am writing in an older post since i have a question regarding #18 or #24. In the code that Clyde and William has written the missing values generated by Stata varies every time i re-run my dataset. The code is shown below. Is it due to the sort issue as that stata generates random numbers every time?

Code:

by pid (month): gen spellnum = 1 if (njobs > 0 & njobs[_n-1] == 0) | _n == 1

Thank you.

Best,
Bibek
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#30

10 Nov 2016, 02:43

Cross-reference to your previous thread: http://www.statalist.org/forums/foru...des-are-re-run

Yes; the same symptoms have the same first guess at diagnosis from every Stata clinician.

Note that the description is not that Stata generates random numbers; it's just that the sort order pid month does not uniquely fix the previous value of njobs for each observation in blocks of distinct pid month. I see no data here, but I'd guess that multiple entries per month for some individuals are the key here.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment