Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summing up durations over a 5-year-window

    Hello,

    I'm using episode-based data with daily information (see screenshot below, or download sample file here: http://doku.iab.de/fdz/siab/SIAB7510...aten_Stata.zip) . Broadly speaking, the database contains employment and unemployment spells for many different individuals over a period of about 35 years.

    Now, I would like to generate a variable that contains the days of unemployment over a 5-year-window before the corresponding observation. I already created a variable which shows the unemployment duration of a specific unemployment spell ("unempl_dauer" in the screenshot) and a variable which sums up the duration of all unemployment spells up to this point ("tage_unempl" in the screenshot). Unfortunately, I'm struggling to generate a variable that is like "unempl_dauer" but ignores the duration of unemployment spells older than 5 years.

    I would be very grateful for any suggestions how I can solve my problem.

    Best regards,
    Sebastian

  • #2
    Please do read http://www.statalist.org/forums/help#stata and act upon it to give us a data example file using dataex (SSC) . It's asking a lot to expect people to open up a .zip file and to flip back and forth between a screenshot photo and your post. I can't easily read the screenshot any way.

    Comment


    • #3
      Thank you very much for your quick answer. You're right, the output of dataex is much better than the screenshot:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
      90593000       .n 14077 14172  96  96
      90593000       .n 14173 14847   0  96
      90593024       .n 15212 15255   0   0
      90593024       .n 15300 15333  34  34
      90593030 50284864 14757 14930   0   0
      90593030 50139714 14947 14975   0   0
      90593030 50139714 14976 15070   0   0
      90593030 50139714 16118 16172   0   0
      90593030       .n 16191 16436 246 246
      90593030 50017841 16437 16539   0 246
      90593030       .n 16437 16539   0 246
      90593030 50017841 16540 16801   0 246
      90593030 50017841 16802 17166   0 246
      90593030 50017841 17167 17439   0 246
      90593030 50017841 17440 17531   0 246
      90593030 50017841 17532 17897   0 246
      90593030 50017841 17898 18262   0 246
      90593030 50017841 18263 18370   0 246
      90593030 50435853 18371 18420   0 246
      90593030 50017841 18421 18442   0 246
      90593030 50017841 18443 18572   0 246
      end
      format %d begepi
      format %d endepi
      label values betnr miss_de
      label def miss_de .n ".n  fehlt syst.", modify

      begepi = the start of the spell (in days)
      endepi = the end of the spell (in days)
      persnr = an id for each individuals
      betnr = an id for each firm; if it is not an employment spell the value is missing (.n)
      unempl_dauer = the duration of the unemployment spell (in days), it is 0 if it is not an unemployment spell (but an employment spell)
      tage_unempl = the sum of the days in unemployment before the start of the current observation

      tage_unempl is almost what I want. but I'd like to generate an additional variable which sums up only the unemployment duration within a five-year window before the observation and not for the entire career of each individual.

      Thanks again.

      Best regards,
      Sebastian

      Comment


      • #4
        Thanks for posting with -dataex-. This is a one-liner if you use -rangestat-, a new command by Robert Picard, Roberto Ferrer, and Nick Cox. You can get it with -ssc install rangestat-. Then all you need is:

        Code:
        local five_years_ago = int(-5*365.25)
        
        rangestat (sum) unempl_dauer, interval(begepi `five_years_ago' 0) by(persnr)
        I interpreted "five year window before the observation" to mean starting from the begepi date and working back five years, including the current episode. See the help for rangestat for details of how the -interval()- option works, and you can modify this if you the reference date should be the endepi, or if you want to exclude that observation itself, etc.

        Comment


        • #5
          Clyde, I'm not sure this correctly manages the time windows. More than likely, there will be some observations where the 5 year window starts after the beginning of the employment spell and before the end of it. You can't target this directly with rangestat without filling in the data with daily observations. Here's an alternative that uses rangejoin, rangestat's little cousin. To install it, type in Stata's Command window:

          Code:
          ssc install rangejoin
          I tweaked the example to split the employment spell with 246 days. I also added an observation that require pro-rating the days.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
          90593000       .n 14077 14172  96  96
          90593000       .n 14173 14847   0  96
          90593024       .n 15212 15255   0   0
          90593024       .n 15300 15333  34  34
          90593030 50284864 14757 14930   0   0
          90593030 50139714 14947 14975   0   0
          90593030 50139714 14976 15070   0   0
          90593030 50139714 16118 16172   0   0
          90593030       .n 16191 16400 210 246
          90593030       .n 16401 16436  36 246
          90593030 50017841 16437 16539   0 246
          90593030       .n 16437 16539   0 246
          90593030 50017841 16540 16801   0 246
          90593030 50017841 16802 17166   0 246
          90593030 50017841 17167 17439   0 246
          90593030 50017841 17440 17531   0 246
          90593030 50017841 17532 17897   0 246
          90593030 50017841 17898 18200   0 246
          90593030 50017841 18201 18370   0 246
          90593030 50017841 18263 18370   0 246
          90593030 50435853 18371 18420   0 246
          90593030 50017841 18421 18442   0 246
          90593030 50017841 18443 18572   0 246
          end
          format %d begepi
          format %d endepi
          label values betnr miss_de
          label def miss_de .n ".n  fehlt syst.", modify
          
          * you should an observation identifier
          gen obsid = _n
          
          * save observations with spells of unemployments
          preserve
          keep obsid persnr endepi unempl_dauer
          keep if unempl_dauer != 0
          save "unempl_dauer.dta", replace
          restore
          
          * pair each observations with employment spells in 5 year window
          gen low = begepi - 365 * 5
          gen high = begepi - 1
          format %td low high
          rangejoin endepi low high using "unempl_dauer.dta", by(persnr)
          
          * the number of days between endepi_U and the target window
          gen days = endepi_U - low
          
          * the number of days to count
          gen days2use = cond(days > unempl_dauer_U, unempl_dauer_U, days)
          
          * add up employment spells within the window and reduce to the
          * original observations
          bysort obsid: egen unemp5yr = total(days2use)
          by obsid: keep if _n == 1

          Comment


          • #6
            I think it's unclear exactly what Sebastian wants. Since he described his problem as involving summing up values of unempl_dauer, I took his goal to be simply restricting that sum to those episodes that began within the five year window. (Even that is somewhat ambiguous, because he doesn't say whether he wants to include the current observation, but that is a refinement detail.) Robert Picard's approach is more fine-grained and counts actual days of unemployment in the five year period, including days from episodes that began prior to 5 years ago, but extend into it. I think for many purposes Robert Picard's approach produces a more useful statistic, but I can also think of situations where the beginning-of-episode based total would be desired.

            Well, ultimately, it's up to Sebastian which he needs.

            Comment


            • #7
              Sorry for being ambiguous in my question. However, both of your replies were very helpful nevertheless. Robert's code gives me exactly what I want. Thank you both!

              Comment


              • #8
                Unfortunately, I have to re-open this thread. After using the code on my test file, everything was fine. However, I tried to run my analysis on the real data now, and since this dataset is large (and by that I mean very large - about 60 million observations) my operating system does not provide enough memory for the calculation. I cannot modify the available memory because I have to run the analysis on a computer/server of the German Federal Employment Agency.

                My observation period ranges from 2000 to 2014. I already dropped every individual that has no observations in this period. I also excluded spells ending before the 1st January of 1995 from the "unempl_dauer.dta" file because earlier information will not be needed for the 5-year window. The -rangejoin- command, though, still needs to much memory.

                I cannot exclude all older observations from the main file at this stage because information from, for example, 1980 will be necessary to extrapolate/impute values for some variables later.

                Maybe someone here knows an approach which solves the problem.

                Any answer will be appreciated. Thanks in advance!

                Best regards,
                Sebastian
                Last edited by Sebastian Geiger; 08 Jun 2016, 04:10.

                Comment


                • #9
                  Assuming you have enouigh memory to hold the final result and that the problem arises due to need for temporary memory during the calculations, what about breaking up the file? For some appropriate value of N, put the first N people into one file, the second N people into the next, etc. Then process each file separately, and at the end, append all the results together.

                  Comment


                  • #10
                    I concur with Clyde, the rangejoin approach can be done in parts.

                    I had a second look at the problem and I think that I came up with a solution that works with rangestat. The approach isolates dates from unemployment spells. The c(minfloat) value is used for observations with employment spells. This way, these observations are never picked-up by rangestat as they are never within the 5 year window. The unemployments spells are summed based on the end date of the spell. The earliest unemployment spell start date is also collected. If an unemployment spell starts before the start of the 5 year window, then the days to remove can be calculated using the offset:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
                    90593000       .n 14077 14172  96  96
                    90593000       .n 14173 14847   0  96
                    90593024       .n 15212 15255   0   0
                    90593024       .n 15300 15333  34  34
                    90593030 50284864 14757 14930   0   0
                    90593030 50139714 14947 14975   0   0
                    90593030 50139714 14976 15070   0   0
                    90593030 50139714 16118 16172   0   0
                    90593030       .n 16191 16400 210 246
                    90593030       .n 16401 16436  36 246
                    90593030 50017841 16437 16539   0 246
                    90593030       .n 16437 16539   0 246
                    90593030 50017841 16540 16801   0 246
                    90593030 50017841 16802 17166   0 246
                    90593030 50017841 17167 17439   0 246
                    90593030 50017841 17440 17531   0 246
                    90593030 50017841 17532 17897   0 246
                    90593030 50017841 17898 18200   0 246
                    90593030 50017841 18201 18370   0 246
                    90593030 50017841 18263 18370   0 246
                    90593030 50435853 18371 18420   0 246
                    90593030 50017841 18421 18442   0 246
                    90593030 50017841 18443 18572   0 246
                    end
                    format %d begepi
                    format %d endepi
                    label values betnr miss_de
                    label def miss_de .n ".n  fehlt syst.", modify
                    
                    * the start and end of unemployment spells
                    gen ustart = cond(unempl_dauer > 0, begepi, c(minfloat))
                    gen uend = cond(unempl_dauer > 0, endepi, c(minfloat))
                    format %td ustart uend
                    
                    * define a 5 year window for each observation
                    gen low = begepi - 365 * 5
                    gen high = begepi - 1
                    format %td low high
                    
                    * find all unemployment spells that end within the 5 year window;
                    * sum the unemployed days and pick-up the start of the earliest unemployed spell
                    rangestat (sum) unemp5yr = unempl_dauer (min) ustart1 = ustart, interval(uend low high) by(persnr)
                    format %td ustart1
                    
                    * the offset in days between the 5 year window start and the earliest unemployment spell start
                    gen offset = low - ustart1
                    replace unemp5yr = unemp5yr - offset - 1 if offset > 0 & !mi(offset)
                    replace unemp5yr = 0 if mi(unemp5yr)

                    Comment


                    • #11
                      Clyde and Robert, thank you both for your answers. I implemented Robert's approach with my test data first. I ran both approaches and compared both variables. They are almost identical. There is, however, a small difference for some observations. To be exact, the -rangejoin- approach yields one day less for some observations. As far as I can see, the difference occurs when the first (i.e. oldest) spell that counts to the 5-year window starts just one calendar day after the spell for which the value is calculated. To give you an example: When I calculate the 5-year window for a spell starting on the 15th of October 2010, the -rangestat- approach includes the duration of a spell starting on the 16th of October 2005 completely (as I would say it is correct). On the other side, the approach using -rangejoin- appears to cut one day off from those spells. Even though, -rangestat- is apparently doing the right thing, I was just wondering why -rangejoin- leads to different results?!

                      If the approach using -rangestat- also works with the original data will be determined in a few hours when the server of the Federal Employment Agency is done .

                      Comment


                      • #12
                        Picky, picky, picky...

                        When I crafted #10, the results replicated the ones in #5 so I stopped there. The one day difference you report has to do with how to adjust the results if the earliest unemployment spell starts before the start of the 5 year window. I think that both #5 and #10 could be off by one day and need tweaking. Here's a short example that incorporates the case you report in #11 and a few others from the original data example:

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input long(persnr) int(begepi endepi) float(unempl_dauer)
                        1        16725 16820  96
                        1        18645 18650   0
                        90593000 16725 16820  96
                        90593000 18550 10000   0
                        90593030 16191 16400 210
                        90593030 18201 18370   0
                        end
                        format %d begepi
                        format %d endepi
                        
                        * you should an observation identifier
                        gen obsid = _n
                        
                        * save observations with spells of unemployments
                        preserve
                        keep obsid persnr endepi unempl_dauer
                        keep if unempl_dauer != 0
                        save "unempl_dauer.dta", replace
                        restore
                        
                        * pair each observations with employment spells in 5 year window
                        gen low = begepi - 365 * 5
                        gen high = begepi - 1
                        format %td low high
                        rangejoin endepi low high using "unempl_dauer.dta", by(persnr)
                        
                        * the number of days between endepi_U and the target window
                        gen days = endepi_U - low + 1
                        
                        * the number of days to count
                        gen days2use = cond(days > unempl_dauer_U, unempl_dauer_U, days)
                        
                        * add up employment spells within the window and reduce to the
                        * original observations
                        bysort obsid: egen unemp5yr1 = total(days2use)
                        by obsid: keep if _n == 1
                        
                        list persnr begepi endepi unempl_dauer low high days endepi_U unempl_dauer_U days2use unemp5yr
                        
                        
                        * repeat, this time using -rangestat-
                        keep persnr begepi endepi unempl_dauer unemp5yr1
                        
                        * the start and end of unemployment spells
                        gen ustart = cond(unempl_dauer > 0, begepi, c(minfloat))
                        gen uend = cond(unempl_dauer > 0, endepi, c(minfloat))
                        format %td ustart uend
                        
                        * define a 5 year window for each observation
                        gen low = begepi - 365 * 5
                        gen high = begepi - 1
                        format %td low high
                        
                        * find all unemployment spells that end within the 5 year window;
                        * sum the unemployed days and pick-up the start of the earliest unemployed spell
                        rangestat (sum) unemp5yr = unempl_dauer (min) ustart1 = ustart, interval(uend low high) by(persnr)
                        format %td ustart1
                        
                        * the offset in days between the 5 year window start and the earliest unemployment spell start
                        gen offset = low - ustart1
                        replace unemp5yr = unemp5yr - offset if offset > 0 & !mi(offset)
                        replace unemp5yr = 0 if mi(unemp5yr)
                        
                        list persnr-unemp5yr, noobs
                        Last edited by Robert Picard; 09 Jun 2016, 16:22.

                        Comment


                        • #13
                          Sorry for my late response. I implemented your suggestion and it appears to work very well. Thank you, Robert.

                          Unfortunately, I'm not done yet. I work with two different types of datasets. One is spell-based and one is a panel with one observation per person per month. The latter dataset is generated from the spell-based dataset by keeping only one spell for each person at every reference date. Since the variables of the panel should refer to the reference date and not to the start of a spell, I need to create the 5-year-windows for the panel separately.

                          I took Robert's approach as an example and wrote the code for the panel. Maybe someone could check whether I did it correctly or if I missed something. Any remarks are appreciated.

                          Here is the code (I use a loop to calculate the 5-year-windows for employment, unemployment, and temporary agency employment):

                          Code:
                          [...]
                          
                          scalar cdate = date("`c'-`y'", "MDY")
                          
                          [...]
                          
                          foreach type in empl unempl temp {
                          
                              * Define for what kind of spells the calculation is done
                              global type `type'                                                            // Calculation must be done for the panel seperately because the 5-year-window shifts to the reference date and does not refer to the start of the spell anymore
                              if "${type}"=="empl"    local ifcond "quelle ==1 & trainee!=1"                // Define if conditions to identify "live" spells
                              if "${type}"=="unempl"    local ifcond "unempl==1"
                              if "${type}"=="temps"    local ifcond "temps==1"
                              
                              * Drop variable from the spell-based dataset
                              drop ${type}5yr
                              
                              * The start and end of the spells (but only for the kind we run the calcuation here, i.e. ${type})
                              sort persnr spell
                              gen start = cond(${type}_dauer > 0, begepi, c(minfloat))
                              gen end = cond(${type}_dauer > 0, endepi, c(minfloat))
                              format %td start end
                          
                              * Define a 5-year window for each observation (note: here it refers to the reference date of the cross-section = cdate)
                              gen low = cdate - 365.25 * 5 + 1                                // Procedure counts day of reference date to the window (see also correction for the right end of the window), but excludes the reference point 5-years ago                
                              gen high = cdate - 1
                              format %td low high
                          
                              * Sum the days within the 5-year window & pick up the start of the earliest spell that is of ${type} kind
                              rangestat (sum) ${type}5yr = ${type}_dauer (min) start1 = start, interval(end low high) by(persnr)
                              format %td start1
                          
                              * Add time of current/live spell until the reference date (right end of the window)
                              replace ${type}5yr = ${type}5yr + ${type}_dauer - (endepi - cdate)    if `ifcond'    & begepi<=cdate & endepi>=cdate            // Current value + duration of current spell - days that the current spell exceeds the reference date; only if live spell (see if condition)
                              
                              * Offset in days between the 5-year window start and the earliest ${type} spell start (left end of the window)
                              gen offset = low - start1
                              replace ${type}5yr = ${type}5yr - offset if offset>0 & !missing(offset)
                              replace ${type}5yr = 0 if missing(${type}5yr)
                              
                              * Replace all spell for which the calculations makes no sense (i.e. those with no intersection with the reference date) with missings
                              replace ${type}5yr = . if begepi>cdate | endepi<cdate                        // Spells that start after the reference date (begepi>cdate) or end before the reference date (endepi<cdate) are dropped later anyway
                              
                              * Drop aux. variables
                              drop start end low high start1 offset
                              }
                          Later, as already mentioned, only those spells with an intersection with the reference date

                          Code:
                          [...]
                          keep if begepi <= cdate & endepi >= cdate
                          [...]

                          Comment


                          • #14
                            I already recognized one mistake. I should set the missings of empl5yr to zero directly after the -rangestat- command. Otherwise the correction procedure on the right end of the window (i.e. adding up the time of the current spell until the reference date) would not work, since it is not possible to add anything to a missing value.

                            The new code would be:
                            Code:
                                foreach type in empl unempl temp {
                                
                                * Define for what kind of spells the calculation is done
                                global type `type'                                                            // Calculation must be done for the panel seperately because the 5-year-window shifts to the reference date and does not refer to the start of the spell anymore
                                if "${type}"=="empl"    local ifcond "quelle ==1 & trainee!=1"                // Define if conditions to identify "live" spells
                                if "${type}"=="unempl"    local ifcond "unempl==1"
                                if "${type}"=="temps"    local ifcond "temps==1"
                                
                                * Drop variable from the spell-based dataset
                                drop ${type}5yr
                                
                                * The start and end of the spells (but only for the kind we run the calcuation here, i.e. ${type})
                                sort persnr spell
                                gen start = cond(${type}_dauer > 0, begepi, c(minfloat))
                                gen end = cond(${type}_dauer > 0, endepi, c(minfloat))
                                format %td start end
                            
                                * Define a 5-year window for each observation (note: here it refers to the reference date of the cross-section = cdate)
                                gen low = cdate - 365.25 * 5 + 1                                // Procedure counts day of reference date to the window (see also correction for the right end of the window), but excludes the reference point 5-years ago                
                                gen high = cdate - 1
                                format %td low high
                            
                                * Sum the days within the 5-year window & pick up the start of the earliest spell that is of ${type} kind
                                rangestat (sum) ${type}5yr = ${type}_dauer (min) start1 = start, interval(end low high) by(persnr)
                                format %td start1
                            
                                * Set to zero if no values have been picked up by rangestat (yet)
                                replace ${type}5yr = 0 if missing(${type}5yr)
                                
                                * Add time of current/live spell until the reference date (right end of the window)
                                replace ${type}5yr = ${type}5yr + ${type}_dauer - (endepi - cdate)    if `ifcond'    & begepi<=cdate & endepi>=cdate            // Current value + duration of current spell - days that the current spell exceeds the reference date; only if live spell (see if condition)
                                
                                * Offset in days between the 5-year window start and the earliest ${type} spell start (left end of the window)
                                gen offset = low - start1
                                replace ${type}5yr = ${type}5yr - offset if offset>0 & !missing(offset)
                                
                                * Replace all spell for which the calculations makes no sense (i.e. those with no intersection with the reference date) with missings
                                replace ${type}5yr = . if begepi>cdate | endepi<cdate                        // Spells that start after the reference date (begepi>cdate) or end before the reference date (endepi<cdate) are dropped later anyway
                                
                                * Drop aux. variables
                                drop start end low high start1 offset
                                }

                            Comment

                            Working...
                            X