Merge with if keep condition ?

Tharshini Thangavelu

Join Date: Oct 2015
Posts: 78

Merge with if keep condition ?

22 Aug 2022, 07:35

Dear all,

I have two datasets that I would like to merge condition on that the TOM_DATUM fall between startdatum and avslutsdatum. I am not sure how to go about it. I tried with the following code:

use first_dataset, clear
sort lopnr_personid

merge m:1 lopnr_personid using second_dataset, keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

The first_dataset includes

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr_personid float(startdatum avslutsdatum)
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum

And the second_dataset includes

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(lopnr_personid FROM_DATUM TOM_DATUM)
1 16375 16635
1 21915 22645
1 21754 21914
1 20353 21753
1 16801 17166
1 18649 19456
1 12053 14783
1 16017 16374
1 20151 20352
1 17532 17624
end
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM

Thanks,
Tharshini

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10481

22 Aug 2022, 09:42

Why do you have duplicates in the second dataset?

merge m:1 lopnr_personid using second_dataset, keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

Should have these as two commands:

Code:

clear
input long(lopnr_personid FROM_DATUM TOM_DATUM)
1 16375 16635
1 21915 22645
1 21754 21914
1 20353 21753
1 16801 17166
1 18649 19456
1 12053 14783
1 16017 16374
1 20151 20352
1 17532 17624
end
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM
tempfile two
save `two'

clear
input long lopnr_personid float(startdatum avslutsdatum)
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
1 20921 21000
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
duplicates drop *, force

merge 1:m lopnr_personid using `two',nogen
keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

Comment

Tharshini Thangavelu

Join Date: Oct 2015
Posts: 78

22 Aug 2022, 13:58

I get the following error message when using the two codes that you suggested.

variable lopnr_personid does not uniquely identify observations in the master data

Code:

 merge 1:m lopnr_personid using `two',nogen

Code:

keep if inrange(TOM_DATUM, startdatum, avslutsdatum)

Why do you have duplicates in the second dataset?

There is no duplicates in the second dataset? Perhaps you meant the first dataset? But there is no duplicates in either of the datasets. This is the whole first dataset. Perhaps now it makes more sense?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr_personid float(startdatum avslutsdatum) double år_månad float(månad_nr år år_start år_avslut år_nr)
1 20921 21000 639 -49 2013 2017 2017  .
1 20921 21000 640 -48 2013 2017 2017  .
1 20921 21000 641 -47 2013 2017 2017  .
1 20921 21000 642 -46 2013 2017 2017  .
1 20921 21000 643 -45 2013 2017 2017  .
1 20921 21000 644 -44 2013 2017 2017  .
1 20921 21000 645 -43 2013 2017 2017  .
1 20921 21000 646 -42 2013 2017 2017  .
1 20921 21000 647 -41 2013 2017 2017  .
1 20921 21000 648 -40 2014 2017 2017 -4
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tmCCYYNN år_månad

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1548
#4

22 Aug 2022, 14:10

Tharshini Thangavelu you are trying to merge using lopnr_personid as the key variable, but you have multiple observations in both datasets with identical (duplicated) lopnr_personid. This is also why the merge fails, since the code tries to do a 1:m merge, which requires lopnr_personid to be unique in the master data.

Also, as far as I can make out, none of your observations satisfy your inrange condition, so the resulting dataset is empty.

Could you construct a better example, and also show us what you expect in the final dataset so we're clearer on what you're trying to do?
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#5

22 Aug 2022, 15:36

Thanks, I'll try to make sense of what outcome that I wish to have.

It is true that both datasets have same lopnr_personid that is because each dataset includes different individual specific information. The first dataset includes individuals (lopnr_personid) period of rejection (startdatum and avslutdatum) of sick pension.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long lopnr_personid float(startdatum avslutsdatum) 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 1 20921 21000 end format %tdCCYYNNDD startdatum format %tdCCYYNNDD avslutsdatum

While the second dataset includes information such as municipality, county and partnership. The FROM_DATUM TOM_DATUM are date variables indicating the duration of stay in municipality ( KOMMUN_KOD).

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(lopnr_personid FROM_DATUM TOM_DATUM) str4 KOMMUN_KOD 129051 11687 12052 "0180" 129051 12053 12417 "0182" 129051 12418 12767 "0182" 129051 16658 16672 "0182" 129051 16673 17216 "0180" 129051 17217 17531 "0182" 129051 17532 17595 "0182" 129051 17596 17790 "0182" 129051 17791 18079 "0180" 129051 18080 18730 "0180" end format %tdD_m_Y FROM_DATUM format %tdD_m_Y TOM_DATUM

With these two dataset, I want to merge so that I have a dataset that tells in which municipality each unique individual belonged to during the period of rejection of sick pension.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

22 Aug 2022, 16:26

OK, this cannot be done with -merge-. This is a job for -joinby-. Let's call the first data set rejection_intervals.dta, and the second one residence_intervals.dta. First, note that there is not necessarily a single municipality in which an individual lived during the period of rejection: they can move during that period of rejection, or change their rejection status while living in one KOMMUN_KOD. So the code below will pair each rejection period with all residence periods that overlap with that rejection period.

Then you can do this:

Code:

use rejection_intervals, clear joinby lopnr_personid using residence_intervals, unmatched(master) keep if min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM)

Note: This code is not tested because the example data shown has no lopnr_personid that appears in both data sets: therefore no matches can be found in the example data. Presumably this will not be true of the real data. Beware of typos in the code.
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#7

26 Aug 2022, 08:59

Thanks for the great explanation and it worked out well. I have a follow up question. I have identified so that the individuals residence period overlaps with rejection period (= denote this as rejection period) ( as in the previous reply). In the second step, I would like to make sure that these individuals are registered citizen during the rejection period, thus I would like to see if these individuals are registered during the rejection period, AVREG_KOD= 00.

And in the second step, identify if the same individuals are still registered in the following period up to 12, 24 and 36 months after the identified period and if not registered AVREG_KOD ≠ 0
Further, if they are not registered in the following period upto three years after the rejection period, WHEN did they got unregistered.

The first data included the rejection data.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long lopnr float(startdatum avslutsdatum) double år_månad 1 20921 21000 639 1 20921 21000 640 1 20921 21000 641 1 20921 21000 642 1 20921 21000 643 1 20921 21000 644 1 20921 21000 645 1 20921 21000 646 1 20921 21000 647 1 20921 21000 648 end format %tdCCYYNNDD startdatum format %tdCCYYNNDD avslutsdatum format %tmCCYYNN år_månad

The second data includes the register data.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD 129051 11687 12052 "00" 129051 12053 12417 "00" 129051 12418 12767 "UV" 129051 16658 16672 "00" 129051 16673 17216 "00" 129051 17217 17531 "00" 129051 17532 17595 "00" 129051 17596 17790 "00" 129051 17791 18079 "00" 129051 18080 18730 "00" end format %tdD_m_Y FROM_DATUM format %tdD_m_Y TOM_DATUM
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#8

26 Aug 2022, 09:51

Your example data still does not include any matches between the two data sets. In fact, the two data sets don't even contain any of the same people. This makes it impossible to test the code properly. Here is what I have come up with. It is untested; use at your own risk. If you need assistance troubleshooting this code, be sure to post new example data which contains at least some matchable and overlapping observations in the two data sets.

Code:

use rejection_data_set, clear gen long obs_no = _n // EXTEND THE DATA WITH EXTRA OBSERVATIONS HAVING AVSLUTSDATUM POSTPONED // BY 1, 2, OR 3 YEARS expand 4 by obs_no, sort: gen extension = _n-1 replace avslutsdatum = mdy(month(avslutsdatum), day(avslutsdatum), /// year(avslutsdatum) + extension) replace startdatum = mdy(month(startdatum), day(startdatum), /// year(startdatum) + extension) // JOIN WITH THE REGISTRATION DATA rename lopnr lopnr_personid joinby lopnr_personid using registration_data_set // KEEP THE OBSERVATIONS WITH OVERLAPPING PERIODS // BUT RETAIN AT LEAST ONE OBSERVATION PER ORIGINAL OBS_NO gen byte overlap = min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM) by obs_no (overlap FROM_DATUM), sort: keep if overlap | (_n == _N) // IDENTIFY FIRST YEAR OF NON-REGISTRATION gen byte unregistered = (AVREG_KOD != "00") | !overlap by obs_no (extension), sort: egen first_non_registration = /// min(cond(unregistered, extension, .)) // GO TO WIDE LAYOUT keep obs_no lopnr_personid startdatum avslutsdatum extension unregistered first_non_registration by obs_no (extension), sort: replace avslutsdatum = avslutsdatum[1] by obs_no (extension): replace startdatum = startdatum[1] reshape wide unregistered, i(obs_no) j(extension)

In the code, I have dealt with 12, 24, and 36 months as 1, 2, and 3 years. At the end of this code, if it runs, the data in memory will contain, for each record in the original rejection data set, 5 new variables. The variable unregistered0 will tell you whether or not the period of rejection overlaps with their period of registration. The variables unregistered1-unregistered3 will tell you whether or not, the first, second, and third years following the end of the rejection period, respectively, overlaps with the period of registration. The variable first_non_registration will range from 0 to 3 or missing and will tell you the earliest of these periods in which there is no overlap with the registration period (missing value if all of the periods do overlap with the registration period.)

Depending on what you plan to do with this data from this point on, it may be better to keep it in long layout. If that is the case, just skip everything from // GO TO WIDE LAYOUT to the end.

Last edited by Clyde Schechter; 26 Aug 2022, 09:58.
Comment

Tharshini Thangavelu

Join Date: Oct 2015
Posts: 78

28 Aug 2022, 04:05

Thanks! I am not sure if I am following all the way in your syntax or it might be in my explanation.

This is the matched dataset where the rejection period overlaps with the latest residence period as explained in #6.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr_personid float(startdatum avslutsdatum) byte lev_befolkning long(FROM_DATUM TOM_DATUM avreg_kod)
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
1 20921 21000 3 20353 21753 1
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM
label values lev_befolkning __MERGE
label def __MERGE 3 "both in master and using data", modify
label values avreg_kod avreg_kod_label
label def avreg_kod_label 1 "saknas/okänd", modify

So what I want in the next step is to create a variable (=registered_during_overlap ) that identifies if avreg_kod ==1 (registered or not) when the rejection and the latest residence data overlaps.
Then create another variable that identifies those individuals that becomes unregistered from the time of overlap period, 1, 2, 3 years from the variable registered_during_overlap.

The above data is joined by the two following data that is need to identify the explanation given here above.

The rejection data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr float(startdatum avslutsdatum) double år_månad
1 20921 21000 639
1 20921 21000 640
1 20921 21000 641
1 20921 21000 642
1 20921 21000 643
1 20921 21000 644
1 20921 21000 645
1 20921 21000 646
1 20921 21000 647
1 20921 21000 648
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tmCCYYNN år_månad

The residence data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
1 11687 12052 "00"
1 12053 14783 "00"
1 14784 15460 "00"
1 15461 16016 "00"
1 16017 16374 "00"
1 16375 16635 "00"
1 16636 16789 "00"
1 16790 16800 "00"
1 16801 17166 "00"
1 17167 17258 "00"
end
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM

Your example data still does not include any matches between the two data sets.

Perhaps this is resolved now?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#10

28 Aug 2022, 10:37

Perhaps this is resolved now?

No, it isn't. While we now have the same personid in both data sets, there are still no observations in the two data sets where the date intervals overlap. All of the date intervals in the residence data set precede all of the intervals (which, by the way, are all the same--what's that about?) in the rejection data set. I'm sorry if I wasn't clear enough in describing what I need. I need some personids where there is an overlap between at least some of his/her rejection intervals and residence intervals, and some other personids where there isn't.
Comment

Tharshini Thangavelu

Join Date: Oct 2015
Posts: 78

#11

29 Aug 2022, 02:38

Residence dataset: gives the residence dates and AVREG_KOD identifies whether the person is still registrered = "00".

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
1 11687 12052 "00"
1 12053 14783 "00"
1 14784 15460 "00"
1 15461 16016 "00"
1 16017 16374 "00"
1 16375 16635 "00"
1 16636 16789 "00"
1 16790 16800 "00"
1 16801 17166 "00"
1 17167 17258 "00"
1 17259 17531 "00"
1 17532 17624 "00"
1 17625 18079 "00"
1 18080 18648 "00"
1 18649 19456 "00"
1 19457 19844 "00"
1 19845 20150 "00"
1 20151 20352 "00"
1 20353 21753 "00"
1 21754 21914 "00"
1 21915 22645 "00"
end
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM

Rejection dataset: This data includes personid = lopnr and rejection start and end dates =(startdatum, avslutdatum) and år_månad tells the number of year and month from the rejection dates upto 3 years. år_månad variable goes to -49 and +48 for each lopnr.

. All of the date intervals in the residence data set precede all of the intervals (which, by the way, are all the same--what's that about?)

Now that I added år_månad, you might understand why its the same.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr float(startdatum avslutsdatum) double år_månad
1 20921 21000 639
1 20921 21000 640
1 20921 21000 641
1 20921 21000 642
1 20921 21000 643
1 20921 21000 644
1 20921 21000 645
1 20921 21000 646
1 20921 21000 647
1 20921 21000 648
1 20921 21000 649
1 20921 21000 650
1 20921 21000 651
1 20921 21000 652
1 20921 21000 653
1 20921 21000 654
1 20921 21000 655
1 20921 21000 656
1 20921 21000 657
1 20921 21000 658
1 20921 21000 659
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tmCCYYNN år_månad

These two datasets are matched according to #6 which resulted into the following dataset

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long lopnr_personid float(startdatum avslutsdatum) long(FROM_DATUM TOM_DATUM avreg_kod)
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
1 20921 21000 20353 21753 1
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM
label values avreg_kod avreg_kod_label
label def avreg_kod_label 1 "saknas/okänd", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

#12

29 Aug 2022, 11:20

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input long lopnr float(startdatum avslutsdatum) double år_månad
1 20921 21000 639
1 20921 21000 640
1 20921 21000 641
1 20921 21000 642
1 20921 21000 643
1 20921 21000 644
1 20921 21000 645
1 20921 21000 646
1 20921 21000 647
1 20921 21000 648
1 20921 21000 649
1 20921 21000 650
1 20921 21000 651
1 20921 21000 652
1 20921 21000 653
1 20921 21000 654
1 20921 21000 655
1 20921 21000 656
1 20921 21000 657
1 20921 21000 658
1 20921 21000 659
end
format %tdCCYYNNDD startdatum
format %tdCCYYNNDD avslutsdatum
format %tmCCYYNN år_månad
tempfile rejection
save `rejection'

* Example generated by -dataex-. For more info, type help dataex
clear
input long(lopnr_personid FROM_DATUM TOM_DATUM) str2 AVREG_KOD
1 11687 12052 "00"
1 12053 14783 "00"
1 14784 15460 "00"
1 15461 16016 "00"
1 16017 16374 "00"
1 16375 16635 "00"
1 16636 16789 "00"
1 16790 16800 "00"
1 16801 17166 "00"
1 17167 17258 "00"
1 17259 17531 "00"
1 17532 17624 "00"
1 17625 18079 "00"
1 18080 18648 "00"
1 18649 19456 "00"
1 19457 19844 "00"
1 19845 20150 "00"
1 20151 20352 "00"
1 20353 21753 "00"
1 21754 21914 "00"
1 21915 22645 "00"
end
format %tdD_m_Y FROM_DATUM
format %tdD_m_Y TOM_DATUM
tempfile residence
save `residence'

browse
use `rejection', clear
drop år_månad
duplicates drop
gen long obs_no = _n
tempfile holding
save `holding'
expand 4
by obs_no, sort: gen byte years_later = _n-1
replace avslutsdatum = mdy(month(avslutsdatum), day(avslutsdatum), ///
    year(avslutsdatum) + years_later)
by obs_no (years_later), sort: replace startdatum = avslutsdatum[_n-1] + 1 ///
    if _n > 1

rename lopnr lopnr_personid
joinby lopnr_personid using `residence', unmatched(master)
gen byte overlap = min(avslutsdatum, TOM_DATUM) >= max(startdatum, FROM_DATUM)
gen byte registered = (AVREG_KOD == "00") & overlap
collapse (max) registered, by(obs_no years_later)
merge m:1 obs_no using `holding', assert(match) nogenerate

//  IDENTIFY EARLIEST YEAR NOT REGISTERED (IF ANY)
by obs_no (years_later), sort: egen first_year_unregistered = ///
    min(cond(!registered, years_later, .))

//  RESHAPE WIDE
reshape wide registered, i(obs_no) j(years_later)
rename registered0 registered_during_rejection
rename (registered#) registered_#_years_later

Comment

Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#13

05 Sep 2022, 14:06

Thanks Clyde!

The variable first_year_unregistered take 0-3 values, what does these values indicates?
I have a follow-up question: I would like to identify, given that avslutsdatum found in the rejection data falls between FROM_DATUM and TOM_DATUM in the residence data, a variable that indicate whenever the difference between TOM_DATUM and avslutsdatum is between 0-12months and the AVREG_KOD == "AV"

I tried with the following code:

Code:

gen avslut = mofd(avslutsdatum) format % tm avslut format %tdCCYYNNDD TOM_DATUM gen tom = mofd(TOM_DATUM) format % tm tom * Here I create a variable that indicates if the difference between avslutsdatum and TOM_DATUM is less than or equal to 12 months and the AVREG_KOD should indicate =="AV" gen diff_avslut_tom = 0 replace diff_avslut_tom = 1 if (tom) - (avslut) <= 365 & AVREG_KOD=="AV"
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#14

05 Sep 2022, 17:48

The variable first_year_unregistered take 0-3 values, what does these values indicates?

In #7 you said

Further, if they are not registered in the following period upto three years after the rejection period, WHEN did they got unregistered.

This variable answers that question. If it is 0, then the person was unregistered in the rejection period itself. If it is 1, the person first became unregistered in the year after the rejection period. If 2, the person first became unregistered in the second year after the rejection period. If 3, the person first became unregistered in the third year after the rejection period. If it is missing, the person was never unregistered in any of those periods--that is, they were registered during the rejection period itself and in each of the three following years.

Concerning the new variable you speak of creating in #13, the only problem I see with that code is that you have overlooked the fact that in Stata variable names are case sensitive. So, you cannot refer to TOM_DATUM as tom--you must use upper case TOM. I will add as a matter of style, that it is not necessary to use two commands to create the variable diff_avslut_tom. You will get the same result with one line:

Code:

gen diff_avslut_tom = (TOM_DATUM - avslutsdatum <= 365) & AVREG_KOD == "AV"
Comment

Announcement

Merge with if keep condition ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment