Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep IDs which are within a range in date format?

    Hi everyone,

    From my dataset, I would like to keep only those IDs for which we have contracts that run between 1 January 2021 and November 2023 without interruption, please.

    I want to keep the IDs that have no interruption during the interlude above (i.e. that remain in my sample from start to finish).

    I don't want to keep the IDs that have some interruption or that enter in my sample later than 1 January 2021.

    Here's a -dataex- as an example:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(id idcontrato) double(date_contract_start date_contract_end)
    1253    1253 19100 19266
    1253    1410 19267 22431
    1253 1055841 22432 22645
    1253 1815013 22646 22676
    1253 1904434 22677 22735
    1253 2413928 22736 23010
    1253 2880202 23011 23100
    1253 3189647 23101     .
    2195    2195 19465 20458
    2195   80657 20459 21258
    2195 1365966 22544 22645
    2195 1726472 22646 22676
    2195 2073891 22677 22735
    2195 2277485 22736 22918
    2195 2613231 22919 23010
    2195 2863016 23011 23100
    2195 3175831 23101 23314
    2195 3343382 23315     .
    2239    2239 19477 21608
    2239  351357 21609 21615
    end
    format %td date_contract_start
    format %td date_contract_end
    • -id- refers to household IDs,
    • -idcontrato- refers to contract ID,
    • -date_contract_start- is the beginning of a given contract,
    • -date_contract_end- is the end of a given contract.
    The missing values in the "date_contract_end" variable mean that the contract in question is still in force and we should keep them.


    Thank you in advance for your help.
    All the best,

    Michael

  • #2
    Code:
    by id (date_contract_start), sort: egen byte gap = ///
        max(date_contract_start[_n+1] > date_contract_end + 1 & _n < _N)
        
    by id (date_contract_start): drop if gap | (date_contract_start[1] > td(1jan2021)) ///
         | (date_contract_end[_N] < td(1nov2023))

    Comment


    • #3
      Hi Clyde Schechter,

      Beautiful! It works perfectly well. Thank you so much for your help!

      All the best,
      Michael

      Comment


      • #4
        I have just a small question, by curiosity and for my understanding in the code provided in #2:
        • Could you explain me the last part of the first chunk please, i.e. the part in bold and green below:
        Code:
          
         by id (date_contract_start), sort: egen byte gap = ///    
             max(date_contract_start[_n+1] > date_contract_end + 1 & _n < _N)
        Thank you in advance again for your help and time.
        All the best,

        Michael

        Comment


        • #5
          There is a gap if the next contract start date is more than 1 day after the current contract's end date. As for the _n < _N part, consider what happens when _n == _N. That is the last observation for the given id, and therefore date_contract_start[_n+1] does not exist, which means that its value is missing. In Stata, a missing value is always greater than any real number. So the date_contract_start[_n+1] > date_contract_end + 1 part of the condition is always true when _n == _N. But that does not constitute a gap because it is the end of the data.

          Comment


          • #6
            Hi Clyde Schechter,

            Beautiful! I couldn't ask for anything clearer than that. Thank you for your time!
            Lovely day,

            Michael

            Comment

            Working...
            X