Creating Cohorts in Panel Data Where Populations Enter and Exit the Data

Chris Daigle

Join Date: Sep 2015

Posts: 13
#1

Creating Cohorts in Panel Data Where Populations Enter and Exit the Data

21 Feb 2016, 17:39

Software:
OSX, Stata 13.1

Problem:
To run regressions on my data, generating a unique identifier is imperative. My data is defined as panel, covering a period from year 2007 to 2013, grades 3 through 12, and districts of multiple numbers. I want to create a unique identifier based on the three listed variables: year, grade, and district. I hope to regress my data on the variables year and cohort; these variables are described below.

Data Information and problem elaborated:
In my data, cohorts enter and exit the data set. Cohorts exiting the data can be seen as any district's grade 12 in the first year is not present in the same district in year 2 (year 1, grade 12, district 1 does not indicate the same people in year 2, grade 12, district 1). Cohorts entering the data can be seen as the lowest numbered grade (3rd grade) in any district in one year not being the same people as the next year (those people in grade 3 in one year go to grade 4 in the next year and thus year+1 needs a new cohort identifier for the lowest grade, grade 3).

An example (how I want it to look):
Year: 2007, District: 1, Grade: 3: Cohort: 1
Year: 2007, District: 1, Grade: 4: Cohort: 2
Year: 2007, District: 1, Grade: 5: Cohort: 3
Year: 2007, District: 1, Grade: 6: Cohort: 4
Year: 2007, District: 1, Grade: 7: Cohort: 5
Year: 2007, District: 1, Grade: 8: Cohort: 6
Year: 2007, District: 1, Grade: 9: Cohort: 7
Year: 2007, District: 1, Grade: 10: Cohort: 8
Year: 2007, District: 1, Grade: 11, Cohort: 9
Year: 2007, District: 1, Grade: 12: Cohort: 10
Year: 2008, District: 1, Grade: 3, Cohort: 11
Year: 2008, District: 1, Grade: 4, Cohort: 1
Year: 2008, District: 1, Grade: 5, Cohort: 2
Year: 2008, District: 1, Grade: 6, Cohort: 3
Year: 2008, District: 1, Grade: 7, Cohort: 4
Year: 2008, District: 1, Grade: 8, Cohort: 5
Year: 2008, District: 1, Grade: 9, Cohort: 6
Year: 2008, District: 1, Grade: 10, Cohort: 7
Year: 2008, District: 1, Grade: 11, Cohort: 8
Year: 2008, District: 1, Grade: 12, Cohort: 9

Variables:
Year: signifies the year an observation takes occurs, ranges from 2007 to 2013 (7 unique observation: 2007:2013, sequentially)
District: signifies the district, a numerical value signifying a specific district's identity (numbers are not sequential, they don't count 1 2 3 4...; there is a different quantity in each year)
Grade: signifies the grade number of students (covers grades 3:12, sequentially)
Cohort (desired variable to create): will be some variable that uniquely identifies a population, over time, throughout the dataset's time span

Any help is greatly appreciated and I am open to answering any questions I can.

Thank you!
Tags: cohort, data, loop, panel data, Suggestion

Clyde Schechter

Join Date: Apr 2014
Posts: 30357

21 Feb 2016, 18:51

I believe the following works, provided that all grades 3-12 are represented in each year. It isn't clear if you want District 2's cohorts to restart numbering at 1 or to pick up where District 1 leaves off. The code below begins by generating numbers that restart at 1 with each district. The last few lines will adjust that to continue consecutive numbers if that's what you want. In the data below the district numbers are generated at random to satisfy your description that they are not simply 1, 2, .... The code at the end then generates sequential numbers starting with 1 to correspond, but you can drop that n_district variable later.

Code:

clear*

// GENERATE DATA SET WITH 2 DISTRICTS AND
// 4 YEARS TO ILLUSTRATE
set obs 10
set seed 54321
gen byte grade = _n + 2
expand 4
by grade, sort: gen year = 2006 + _n
expand 2
by grade year, sort: gen n_district = _n
by n_district, sort: gen district = rpoisson(30) if _n == 1
by n_district: replace district = district[1]
drop n_district
list, noobs clean


// GENERATE COHORT NUMBERS, STARTING AT 1
// IN EACH DISTRICT
sort district year grade
gen int cohort = (grade-year) + 2005
replace cohort = 11-cohort if cohort < 1

// IF COHORTS NEED TO HAVE DISTINCTIVE NUMBERING
// IN DIFFERENT DISTRICTS THEN ALSO DO THE FOLLOWING:
by district, sort: gen n_district = 1 if _n == 1
replace n_district = sum(n_district)
quietly summ cohort
replace cohort = cohort + (n_district-1)*`r(max)'

list, noobs clean

Comment

Chris Daigle

Join Date: Sep 2015

Posts: 13
#3

21 Feb 2016, 19:47

Clyde,

Thank you so very much! That did it!

Thank you,
Chris
Comment

Dani Vasquez

Join Date: Nov 2017
Posts: 5

25 Dec 2017, 10:09

Dear Clyde,
dear Statalist,

I have some troubles to create cohorts and would appreciate any help. Please find below an extract of my household data (pseudo-panel; not the same individuals) for x years. I want to create cohorts from several variables, e.g. gender (1,2), ethnic background (1,2,3) and locality (1,2,3). Theoretically, I would create 18 cohorts for each year. Is there a way that Stata can do that for me?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id h1 h2 h3) int year
1950 1 3 1 2008
1950 2 3 1 2008
1950 2 2 1 2009
1950 1 2 1 2009
1950 2 3 1 2009
1950 1 3 1 2009
5001 1 2 1 2010
5001 1 2 1 2010
5001 2 1 1 2010
5001 2 2 1 2010
5001 1 2 1 2010
5001 2 1 1 2011
5002 2 1 1 2011
5002 1 2 1 2011
5002 2 1 1 2011
5002 1 2 1 2011
5003 2 2 1 2012
5003 1 2 1 2012
5003 2 2 1 2012
5003 1 2 1 2012
end
format %ty year
label values id household id
label values h1 gender
label values h2 ethnicity
label values h3 locality

Last edited by Dani Vasquez; 25 Dec 2017, 10:11.

Comment

Chris Daigle

Join Date: Sep 2015
Posts: 13

04 Jan 2018, 06:00

Hi Dani,

Have you looked into the expand command?

Maybe this will be of help toward duplicate specific observations: https://www.stata.com/statalist/arch.../msg01039.html

Code:

sysuse auto, clear
sort make
l in 5/7
*duplicate number 6
expand 2 in 6
sort make
l in 5/8

For your case, though:

Code:

clear
input double(id h1 h2 h3) int year
1950 1 3 1 2008
1950 2 3 1 2008
1950 2 2 1 2009
1950 1 2 1 2009
1950 2 3 1 2009
1950 1 3 1 2009
5001 1 2 1 2010
5001 1 2 1 2010
5001 2 1 1 2010
5001 2 2 1 2010
5001 1 2 1 2010
5001 2 1 1 2011
5002 2 1 1 2011
5002 1 2 1 2011
5002 2 1 1 2011
5002 1 2 1 2011
5003 2 2 1 2012
5003 1 2 1 2012
5003 2 2 1 2012
5003 1 2 1 2012
end

gen gender = 1
expand 2, generate(ethnicity)
replace gender = 0 if ethnicity == 1
expand 2, generate(locality)
drop if gender == 1 & locality == 1
replace ethnicity = 0 if locality == 1

I hope that helps!

Last edited by Chris Daigle; 04 Jan 2018, 06:03.

Announcement