Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Cohorts in Panel Data Where Populations Enter and Exit the Data

    Software:
    OSX, Stata 13.1

    Problem:
    To run regressions on my data, generating a unique identifier is imperative. My data is defined as panel, covering a period from year 2007 to 2013, grades 3 through 12, and districts of multiple numbers. I want to create a unique identifier based on the three listed variables: year, grade, and district. I hope to regress my data on the variables year and cohort; these variables are described below.

    Data Information and problem elaborated:
    In my data, cohorts enter and exit the data set. Cohorts exiting the data can be seen as any district's grade 12 in the first year is not present in the same district in year 2 (year 1, grade 12, district 1 does not indicate the same people in year 2, grade 12, district 1). Cohorts entering the data can be seen as the lowest numbered grade (3rd grade) in any district in one year not being the same people as the next year (those people in grade 3 in one year go to grade 4 in the next year and thus year+1 needs a new cohort identifier for the lowest grade, grade 3).

    An example (how I want it to look):
    Year: 2007, District: 1, Grade: 3: Cohort: 1
    Year: 2007, District: 1, Grade: 4: Cohort: 2
    Year: 2007, District: 1, Grade: 5: Cohort: 3
    Year: 2007, District: 1, Grade: 6: Cohort: 4
    Year: 2007, District: 1, Grade: 7: Cohort: 5
    Year: 2007, District: 1, Grade: 8: Cohort: 6
    Year: 2007, District: 1, Grade: 9: Cohort: 7
    Year: 2007, District: 1, Grade: 10: Cohort: 8
    Year: 2007, District: 1, Grade: 11, Cohort: 9
    Year: 2007, District: 1, Grade: 12: Cohort: 10
    Year: 2008, District: 1, Grade: 3, Cohort: 11
    Year: 2008, District: 1, Grade: 4, Cohort: 1
    Year: 2008, District: 1, Grade: 5, Cohort: 2
    Year: 2008, District: 1, Grade: 6, Cohort: 3
    Year: 2008, District: 1, Grade: 7, Cohort: 4
    Year: 2008, District: 1, Grade: 8, Cohort: 5
    Year: 2008, District: 1, Grade: 9, Cohort: 6
    Year: 2008, District: 1, Grade: 10, Cohort: 7
    Year: 2008, District: 1, Grade: 11, Cohort: 8
    Year: 2008, District: 1, Grade: 12, Cohort: 9

    Variables:
    Year: signifies the year an observation takes occurs, ranges from 2007 to 2013 (7 unique observation: 2007:2013, sequentially)
    District: signifies the district, a numerical value signifying a specific district's identity (numbers are not sequential, they don't count 1 2 3 4...; there is a different quantity in each year)
    Grade: signifies the grade number of students (covers grades 3:12, sequentially)
    Cohort (desired variable to create): will be some variable that uniquely identifies a population, over time, throughout the dataset's time span

    Any help is greatly appreciated and I am open to answering any questions I can.

    Thank you!

  • #2
    I believe the following works, provided that all grades 3-12 are represented in each year. It isn't clear if you want District 2's cohorts to restart numbering at 1 or to pick up where District 1 leaves off. The code below begins by generating numbers that restart at 1 with each district. The last few lines will adjust that to continue consecutive numbers if that's what you want. In the data below the district numbers are generated at random to satisfy your description that they are not simply 1, 2, .... The code at the end then generates sequential numbers starting with 1 to correspond, but you can drop that n_district variable later.

    Code:
    clear*
    
    // GENERATE DATA SET WITH 2 DISTRICTS AND
    // 4 YEARS TO ILLUSTRATE
    set obs 10
    set seed 54321
    gen byte grade = _n + 2
    expand 4
    by grade, sort: gen year = 2006 + _n
    expand 2
    by grade year, sort: gen n_district = _n
    by n_district, sort: gen district = rpoisson(30) if _n == 1
    by n_district: replace district = district[1]
    drop n_district
    list, noobs clean
    
    
    // GENERATE COHORT NUMBERS, STARTING AT 1
    // IN EACH DISTRICT
    sort district year grade
    gen int cohort = (grade-year) + 2005
    replace cohort = 11-cohort if cohort < 1
    
    // IF COHORTS NEED TO HAVE DISTINCTIVE NUMBERING
    // IN DIFFERENT DISTRICTS THEN ALSO DO THE FOLLOWING:
    by district, sort: gen n_district = 1 if _n == 1
    replace n_district = sum(n_district)
    quietly summ cohort
    replace cohort = cohort + (n_district-1)*`r(max)'
    
    list, noobs clean

    Comment


    • #3
      Clyde,

      Thank you so very much! That did it!

      Thank you,
      Chris

      Comment


      • #4
        Dear Clyde,
        dear Statalist,

        I have some troubles to create cohorts and would appreciate any help. Please find below an extract of my household data (pseudo-panel; not the same individuals) for x years. I want to create cohorts from several variables, e.g. gender (1,2), ethnic background (1,2,3) and locality (1,2,3). Theoretically, I would create 18 cohorts for each year. Is there a way that Stata can do that for me?

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input double(id h1 h2 h3) int year
        1950 1 3 1 2008
        1950 2 3 1 2008
        1950 2 2 1 2009
        1950 1 2 1 2009
        1950 2 3 1 2009
        1950 1 3 1 2009
        5001 1 2 1 2010
        5001 1 2 1 2010
        5001 2 1 1 2010
        5001 2 2 1 2010
        5001 1 2 1 2010
        5001 2 1 1 2011
        5002 2 1 1 2011
        5002 1 2 1 2011
        5002 2 1 1 2011
        5002 1 2 1 2011
        5003 2 2 1 2012
        5003 1 2 1 2012
        5003 2 2 1 2012
        5003 1 2 1 2012
        end
        format %ty year
        label values id household id
        label values h1 gender
        label values h2 ethnicity
        label values h3 locality
        Last edited by Dani Vasquez; 25 Dec 2017, 10:11.

        Comment


        • #5
          Hi Dani,

          Have you looked into the expand command?

          Maybe this will be of help toward duplicate specific observations: https://www.stata.com/statalist/arch.../msg01039.html

          Code:
          sysuse auto, clear
          sort make
          l in 5/7
          *duplicate number 6
          expand 2 in 6
          sort make
          l in 5/8
          For your case, though:

          Code:
          clear
          input double(id h1 h2 h3) int year
          1950 1 3 1 2008
          1950 2 3 1 2008
          1950 2 2 1 2009
          1950 1 2 1 2009
          1950 2 3 1 2009
          1950 1 3 1 2009
          5001 1 2 1 2010
          5001 1 2 1 2010
          5001 2 1 1 2010
          5001 2 2 1 2010
          5001 1 2 1 2010
          5001 2 1 1 2011
          5002 2 1 1 2011
          5002 1 2 1 2011
          5002 2 1 1 2011
          5002 1 2 1 2011
          5003 2 2 1 2012
          5003 1 2 1 2012
          5003 2 2 1 2012
          5003 1 2 1 2012
          end
          
          gen gender = 1
          expand 2, generate(ethnicity)
          replace gender = 0 if ethnicity == 1
          expand 2, generate(locality)
          drop if gender == 1 & locality == 1
          replace ethnicity = 0 if locality == 1

          I hope that helps!
          Last edited by Chris Daigle; 04 Jan 2018, 06:03.

          Comment

          Working...
          X