Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge with if keep condition ?

    Dear all,

    I have two datasets that I would like to merge condition on that the TOM_DATUM fall between startdatum and avslutsdatum. I am not sure how to go about it. I tried with the following code:

    use first_dataset, clear
    sort lopnr_personid

    merge m:1 lopnr_personid using second_dataset, keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

    The first_dataset includes

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long lopnr_personid float(startdatum avslutsdatum)
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    end
    format %tdCCYYNNDD startdatum
    format %tdCCYYNNDD avslutsdatum

    And the second_dataset includes

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(lopnr_personid FROM_DATUM TOM_DATUM)
    1 16375 16635
    1 21915 22645
    1 21754 21914
    1 20353 21753
    1 16801 17166
    1 18649 19456
    1 12053 14783
    1 16017 16374
    1 20151 20352
    1 17532 17624
    end
    format %tdD_m_Y FROM_DATUM
    format %tdD_m_Y TOM_DATUM
    Thanks,
    Tharshini

  • #2
    Why do you have duplicates in the second dataset?

    merge m:1 lopnr_personid using second_dataset, keep if inrange(TOM_DATUM, startdatum, avslutsdatum)
    Should have these as two commands:

    Code:
    clear
    input long(lopnr_personid FROM_DATUM TOM_DATUM)
    1 16375 16635
    1 21915 22645
    1 21754 21914
    1 20353 21753
    1 16801 17166
    1 18649 19456
    1 12053 14783
    1 16017 16374
    1 20151 20352
    1 17532 17624
    end
    format %tdD_m_Y FROM_DATUM
    format %tdD_m_Y TOM_DATUM
    tempfile two
    save `two'
    
    clear
    input long lopnr_personid float(startdatum avslutsdatum)
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    1 20921 21000
    end
    format %tdCCYYNNDD startdatum
    format %tdCCYYNNDD avslutsdatum
    duplicates drop *, force
    
    merge 1:m lopnr_personid using `two',nogen
    keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

    Comment


    • #3
      I get the following error message when using the two codes that you suggested.

      variable lopnr_personid does not uniquely identify observations in the master data

      Code:
       merge 1:m lopnr_personid using `two',nogen
      Code:
      keep if inrange(TOM_DATUM, startdatum, avslutsdatum)
      Why do you have duplicates in the second dataset?
      There is no duplicates in the second dataset? Perhaps you meant the first dataset? But there is no duplicates in either of the datasets. This is the whole first dataset. Perhaps now it makes more sense?

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long lopnr_personid float(startdatum avslutsdatum) double år_månad float(månad_nr år år_start år_avslut år_nr)
      1 20921 21000 639 -49 2013 2017 2017  .
      1 20921 21000 640 -48 2013 2017 2017  .
      1 20921 21000 641 -47 2013 2017 2017  .
      1 20921 21000 642 -46 2013 2017 2017  .
      1 20921 21000 643 -45 2013 2017 2017  .
      1 20921 21000 644 -44 2013 2017 2017  .
      1 20921 21000 645 -43 2013 2017 2017  .
      1 20921 21000 646 -42 2013 2017 2017  .
      1 20921 21000 647 -41 2013 2017 2017  .
      1 20921 21000 648 -40 2014 2017 2017 -4
      end
      format %tdCCYYNNDD startdatum
      format %tdCCYYNNDD avslutsdatum
      format %tmCCYYNN år_månad

      Comment


      • #4
        Tharshini Thangavelu you are trying to merge using lopnr_personid as the key variable, but you have multiple observations in both datasets with identical (duplicated) lopnr_personid. This is also why the merge fails, since the code tries to do a 1:m merge, which requires lopnr_personid to be unique in the master data.

        Also, as far as I can make out, none of your observations satisfy your inrange condition, so the resulting dataset is empty.

        Could you construct a better example, and also show us what you expect in the final dataset so we're clearer on what you're trying to do?

        Comment


        • #5
          Thanks, I'll try to make sense of what outcome that I wish to have.

          It is true that both datasets have same lopnr_personid that is because each dataset includes different individual specific information. The first dataset includes individuals (lopnr_personid) period of rejection (startdatum and avslutdatum) of sick pension.


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long lopnr_personid float(startdatum avslutsdatum)
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          1 20921 21000
          end
          format %tdCCYYNNDD startdatum
          format %tdCCYYNNDD avslutsdatum
          While the second dataset includes information such as municipality, county and partnership. The FROM_DATUM TOM_DATUM are date variables indicating the duration of stay in municipality ( KOMMUN_KOD).


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long(lopnr_personid FROM_DATUM TOM_DATUM) str4 KOMMUN_KOD
          129051 11687 12052 "0180"
          129051 12053 12417 "0182"
          129051 12418 12767 "0182"
          129051 16658 16672 "0182"
          129051 16673 17216 "0180"
          129051 17217 17531 "0182"
          129051 17532 17595 "0182"
          129051 17596 17790 "0182"
          129051 17791 18079 "0180"
          129051 18080 18730 "0180"
          end
          format %tdD_m_Y FROM_DATUM
          format %tdD_m_Y TOM_DATUM
          With these two dataset, I want to merge so that I have a dataset that tells in which municipality each unique individual belonged to during the period of rejection of sick pension.

          Comment


          • #6
            OK, this cannot be done with -merge-. This is a job for -joinby-. Let's call the first data set rejection_intervals.dta, and the second one residence_intervals.dta. First, note that there is not necessarily a single municipality in which an individual lived during the period of rejection: they can move during that period of rejection, or change their rejection status while living in one KOMMUN_KOD. So the code below will pair each rejection period with all residence periods that overlap with that rejection period.

            Then you can do this:
            Code:
            use rejection_intervals, clear
            joinby lopnr_personid using residence_intervals, unmatched(master)
            keep if min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM)
            Note: This code is not tested because the example data shown has no lopnr_personid that appears in both data sets: therefore no matches can be found in the example data. Presumably this will not be true of the real data. Beware of typos in the code.

            Comment


            • #7
              Thanks for the great explanation and it worked out well. I have a follow up question. I have identified so that the individuals residence period overlaps with rejection period (= denote this as rejection period) ( as in the previous reply). In the second step, I would like to make sure that these individuals are registered citizen during the rejection period, thus I would like to see if these individuals are registered during the rejection period, AVREG_KOD= 00.

              And in the second step, identify if the same individuals are still registered in the following period up to 12, 24 and 36 months after the identified period and if not registered AVREG_KOD ≠ 0
              Further, if they are not registered in the following period upto three years after the rejection period, WHEN did they got unregistered.


              The first data included the rejection data.


              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input long lopnr float(startdatum avslutsdatum) double år_månad
              1 20921 21000 639
              1 20921 21000 640
              1 20921 21000 641
              1 20921 21000 642
              1 20921 21000 643
              1 20921 21000 644
              1 20921 21000 645
              1 20921 21000 646
              1 20921 21000 647
              1 20921 21000 648
              end
              format %tdCCYYNNDD startdatum
              format %tdCCYYNNDD avslutsdatum
              format %tmCCYYNN år_månad

              The second data includes the register data.



              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
              129051 11687 12052 "00"
              129051 12053 12417 "00"
              129051 12418 12767 "UV"
              129051 16658 16672 "00"
              129051 16673 17216 "00"
              129051 17217 17531 "00"
              129051 17532 17595 "00"
              129051 17596 17790 "00"
              129051 17791 18079 "00"
              129051 18080 18730 "00"
              end
              format %tdD_m_Y FROM_DATUM
              format %tdD_m_Y TOM_DATUM

              Comment


              • #8
                Your example data still does not include any matches between the two data sets. In fact, the two data sets don't even contain any of the same people. This makes it impossible to test the code properly. Here is what I have come up with. It is untested; use at your own risk. If you need assistance troubleshooting this code, be sure to post new example data which contains at least some matchable and overlapping observations in the two data sets.

                Code:
                use rejection_data_set, clear
                gen long obs_no = _n
                
                //  EXTEND THE DATA WITH EXTRA OBSERVATIONS HAVING AVSLUTSDATUM POSTPONED
                //  BY 1, 2, OR 3 YEARS
                expand 4
                by obs_no, sort: gen extension = _n-1
                replace avslutsdatum = mdy(month(avslutsdatum), day(avslutsdatum), ///
                    year(avslutsdatum) + extension)
                replace startdatum = mdy(month(startdatum), day(startdatum), ///
                    year(startdatum) + extension)
                
                //  JOIN WITH THE REGISTRATION DATA
                rename lopnr lopnr_personid
                joinby lopnr_personid using registration_data_set
                
                //  KEEP THE OBSERVATIONS WITH OVERLAPPING PERIODS
                //  BUT RETAIN AT LEAST ONE OBSERVATION PER ORIGINAL OBS_NO
                gen byte overlap = min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM)
                by obs_no (overlap FROM_DATUM), sort: keep if overlap | (_n == _N)
                
                //  IDENTIFY FIRST YEAR OF NON-REGISTRATION
                gen byte unregistered = (AVREG_KOD != "00") | !overlap
                by obs_no (extension), sort: egen first_non_registration = ///
                    min(cond(unregistered, extension, .))
                
                //  GO TO WIDE LAYOUT
                keep obs_no lopnr_personid startdatum avslutsdatum extension unregistered first_non_registration
                by obs_no (extension), sort: replace avslutsdatum = avslutsdatum[1]
                by obs_no (extension): replace startdatum = startdatum[1]
                reshape wide unregistered, i(obs_no) j(extension)
                In the code, I have dealt with 12, 24, and 36 months as 1, 2, and 3 years. At the end of this code, if it runs, the data in memory will contain, for each record in the original rejection data set, 5 new variables. The variable unregistered0 will tell you whether or not the period of rejection overlaps with their period of registration. The variables unregistered1-unregistered3 will tell you whether or not, the first, second, and third years following the end of the rejection period, respectively, overlaps with the period of registration. The variable first_non_registration will range from 0 to 3 or missing and will tell you the earliest of these periods in which there is no overlap with the registration period (missing value if all of the periods do overlap with the registration period.)

                Depending on what you plan to do with this data from this point on, it may be better to keep it in long layout. If that is the case, just skip everything from // GO TO WIDE LAYOUT to the end.
                Last edited by Clyde Schechter; 26 Aug 2022, 09:58.

                Comment


                • #9
                  Thanks! I am not sure if I am following all the way in your syntax or it might be in my explanation.

                  This is the matched dataset where the rejection period overlaps with the latest residence period as explained in #6.


                  ----------------------- copy starting from the next line -----------------------
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input long lopnr_personid float(startdatum avslutsdatum) byte lev_befolkning long(FROM_DATUM TOM_DATUM avreg_kod)
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  1 20921 21000 3 20353 21753 1
                  end
                  format %tdCCYYNNDD startdatum
                  format %tdCCYYNNDD avslutsdatum
                  format %tdD_m_Y FROM_DATUM
                  format %tdD_m_Y TOM_DATUM
                  label values lev_befolkning __MERGE
                  label def __MERGE 3 "both in master and using data", modify
                  label values avreg_kod avreg_kod_label
                  label def avreg_kod_label 1 "saknas/okänd", modify

                  So what I want in the next step is to create a variable (=registered_during_overlap ) that identifies if avreg_kod ==1 (registered or not) when the rejection and the latest residence data overlaps.
                  Then create another variable that identifies those individuals that becomes unregistered from the time of overlap period, 1, 2, 3 years from the variable registered_during_overlap.


                  The above data is joined by the two following data that is need to identify the explanation given here above.

                  The rejection data
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input long lopnr float(startdatum avslutsdatum) double år_månad
                  1 20921 21000 639
                  1 20921 21000 640
                  1 20921 21000 641
                  1 20921 21000 642
                  1 20921 21000 643
                  1 20921 21000 644
                  1 20921 21000 645
                  1 20921 21000 646
                  1 20921 21000 647
                  1 20921 21000 648
                  end
                  format %tdCCYYNNDD startdatum
                  format %tdCCYYNNDD avslutsdatum
                  format %tmCCYYNN år_månad


                  The residence data
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
                  1 11687 12052 "00"
                  1 12053 14783 "00"
                  1 14784 15460 "00"
                  1 15461 16016 "00"
                  1 16017 16374 "00"
                  1 16375 16635 "00"
                  1 16636 16789 "00"
                  1 16790 16800 "00"
                  1 16801 17166 "00"
                  1 17167 17258 "00"
                  end
                  format %tdD_m_Y FROM_DATUM
                  format %tdD_m_Y TOM_DATUM

                  Your example data still does not include any matches between the two data sets.
                  Perhaps this is resolved now?

                  Comment


                  • #10
                    Perhaps this is resolved now?
                    No, it isn't. While we now have the same personid in both data sets, there are still no observations in the two data sets where the date intervals overlap. All of the date intervals in the residence data set precede all of the intervals (which, by the way, are all the same--what's that about?) in the rejection data set. I'm sorry if I wasn't clear enough in describing what I need. I need some personids where there is an overlap between at least some of his/her rejection intervals and residence intervals, and some other personids where there isn't.

                    Comment


                    • #11
                      Residence dataset: gives the residence dates and AVREG_KOD identifies whether the person is still registrered = "00".

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
                      1 11687 12052 "00"
                      1 12053 14783 "00"
                      1 14784 15460 "00"
                      1 15461 16016 "00"
                      1 16017 16374 "00"
                      1 16375 16635 "00"
                      1 16636 16789 "00"
                      1 16790 16800 "00"
                      1 16801 17166 "00"
                      1 17167 17258 "00"
                      1 17259 17531 "00"
                      1 17532 17624 "00"
                      1 17625 18079 "00"
                      1 18080 18648 "00"
                      1 18649 19456 "00"
                      1 19457 19844 "00"
                      1 19845 20150 "00"
                      1 20151 20352 "00"
                      1 20353 21753 "00"
                      1 21754 21914 "00"
                      1 21915 22645 "00"
                      end
                      format %tdD_m_Y FROM_DATUM
                      format %tdD_m_Y TOM_DATUM


                      Rejection dataset: This data includes personid = lopnr and rejection start and end dates =(startdatum, avslutdatum) and år_månad tells the number of year and month from the rejection dates upto 3 years. år_månad variable goes to -49 and +48 for each lopnr.

                      . All of the date intervals in the residence data set precede all of the intervals (which, by the way, are all the same--what's that about?)
                      Now that I added år_månad, you might understand why its the same.

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input long lopnr float(startdatum avslutsdatum) double år_månad
                      1 20921 21000 639
                      1 20921 21000 640
                      1 20921 21000 641
                      1 20921 21000 642
                      1 20921 21000 643
                      1 20921 21000 644
                      1 20921 21000 645
                      1 20921 21000 646
                      1 20921 21000 647
                      1 20921 21000 648
                      1 20921 21000 649
                      1 20921 21000 650
                      1 20921 21000 651
                      1 20921 21000 652
                      1 20921 21000 653
                      1 20921 21000 654
                      1 20921 21000 655
                      1 20921 21000 656
                      1 20921 21000 657
                      1 20921 21000 658
                      1 20921 21000 659
                      end
                      format %tdCCYYNNDD startdatum
                      format %tdCCYYNNDD avslutsdatum
                      format %tmCCYYNN år_månad

                      These two datasets are matched according to #6 which resulted into the following dataset

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input long lopnr_personid float(startdatum avslutsdatum) long(FROM_DATUM TOM_DATUM avreg_kod)
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      1 20921 21000 20353 21753 1
                      end
                      format %tdCCYYNNDD startdatum
                      format %tdCCYYNNDD avslutsdatum
                      format %tdD_m_Y FROM_DATUM
                      format %tdD_m_Y TOM_DATUM
                      label values avreg_kod avreg_kod_label
                      label def avreg_kod_label 1 "saknas/okänd", modify

                      Comment


                      • #12
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear*
                        input long lopnr float(startdatum avslutsdatum) double år_månad
                        1 20921 21000 639
                        1 20921 21000 640
                        1 20921 21000 641
                        1 20921 21000 642
                        1 20921 21000 643
                        1 20921 21000 644
                        1 20921 21000 645
                        1 20921 21000 646
                        1 20921 21000 647
                        1 20921 21000 648
                        1 20921 21000 649
                        1 20921 21000 650
                        1 20921 21000 651
                        1 20921 21000 652
                        1 20921 21000 653
                        1 20921 21000 654
                        1 20921 21000 655
                        1 20921 21000 656
                        1 20921 21000 657
                        1 20921 21000 658
                        1 20921 21000 659
                        end
                        format %tdCCYYNNDD startdatum
                        format %tdCCYYNNDD avslutsdatum
                        format %tmCCYYNN år_månad
                        tempfile rejection
                        save `rejection'
                        
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
                        1 11687 12052 "00"
                        1 12053 14783 "00"
                        1 14784 15460 "00"
                        1 15461 16016 "00"
                        1 16017 16374 "00"
                        1 16375 16635 "00"
                        1 16636 16789 "00"
                        1 16790 16800 "00"
                        1 16801 17166 "00"
                        1 17167 17258 "00"
                        1 17259 17531 "00"
                        1 17532 17624 "00"
                        1 17625 18079 "00"
                        1 18080 18648 "00"
                        1 18649 19456 "00"
                        1 19457 19844 "00"
                        1 19845 20150 "00"
                        1 20151 20352 "00"
                        1 20353 21753 "00"
                        1 21754 21914 "00"
                        1 21915 22645 "00"
                        end
                        format %tdD_m_Y FROM_DATUM
                        format %tdD_m_Y TOM_DATUM
                        tempfile residence
                        save `residence'
                        
                        browse
                        use `rejection', clear
                        drop år_månad
                        duplicates drop
                        gen long obs_no = _n
                        tempfile holding
                        save `holding'
                        expand 4
                        by obs_no, sort: gen byte years_later = _n-1
                        replace avslutsdatum = mdy(month(avslutsdatum), day(avslutsdatum), ///
                            year(avslutsdatum) + years_later)
                        by obs_no (years_later), sort: replace startdatum = avslutsdatum[_n-1] + 1 ///
                            if _n > 1
                        
                        rename lopnr lopnr_personid
                        joinby lopnr_personid using `residence', unmatched(master)
                        gen byte overlap = min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM)
                        gen byte registered = (AVREG_KOD == "00") & overlap
                        collapse (max) registered, by(obs_no years_later)
                        merge m:1 obs_no using `holding', assert(match) nogenerate
                        
                        //  IDENTIFY EARLIEST YEAR NOT REGISTERED (IF ANY)
                        by obs_no (years_later), sort: egen first_year_unregistered = ///
                            min(cond(!registered, years_later, .))
                        
                        //  RESHAPE WIDE
                        reshape wide registered, i(obs_no) j(years_later)
                        rename registered0 registered_during_rejection
                        rename (registered#) registered_#_years_later

                        Comment


                        • #13
                          Thanks Clyde!

                          The variable first_year_unregistered take 0-3 values, what does these values indicates?
                          I have a follow-up question: I would like to identify, given that avslutsdatum found in the rejection data falls between FROM_DATUM and TOM_DATUM in the residence data, a variable that indicate whenever the difference between TOM_DATUM and avslutsdatum is between 0-12months and the AVREG_KOD == "AV"

                          I tried with the following code:
                          Code:
                          gen avslut = mofd(avslutsdatum)
                          format % tm avslut
                          
                          format %tdCCYYNNDD TOM_DATUM
                          gen tom = mofd(TOM_DATUM)
                          format % tm tom
                          
                          * Here I create a variable that indicates if the difference between avslutsdatum and TOM_DATUM is less than or equal to 12 months and the AVREG_KOD should indicate =="AV"  
                          gen diff_avslut_tom = 0
                          replace diff_avslut_tom = 1 if (tom) - (avslut) <= 365 & AVREG_KOD=="AV"

                          Comment


                          • #14
                            The variable first_year_unregistered take 0-3 values, what does these values indicates?
                            In #7 you said
                            Further, if they are not registered in the following period upto three years after the rejection period, WHEN did they got unregistered.
                            This variable answers that question. If it is 0, then the person was unregistered in the rejection period itself. If it is 1, the person first became unregistered in the year after the rejection period. If 2, the person first became unregistered in the second year after the rejection period. If 3, the person first became unregistered in the third year after the rejection period. If it is missing, the person was never unregistered in any of those periods--that is, they were registered during the rejection period itself and in each of the three following years.

                            Concerning the new variable you speak of creating in #13, the only problem I see with that code is that you have overlooked the fact that in Stata variable names are case sensitive. So, you cannot refer to TOM_DATUM as tom--you must use upper case TOM. I will add as a matter of style, that it is not necessary to use two commands to create the variable diff_avslut_tom. You will get the same result with one line:
                            Code:
                            gen diff_avslut_tom = (TOM_DATUM - avslutsdatum <= 365) & AVREG_KOD == "AV"

                            Comment

                            Working...
                            X