Questions on overlapping times in data, and merging in the presence of duplicates

Paichen Li

Join Date: Aug 2023

Posts: 2
#1

Questions on overlapping times in data, and merging in the presence of duplicates

16 Aug 2023, 17:09

Hi Statalisters,

As in the title, I have two separate questions
1. Overlapping times in data: I have a dataset X on household enrollment in electricity rate plans. Some households switch around during the time I study, and certain data entries are not recorded properly. For example, certain households have plan A from 2012/01 to 2013/12, plan B from 2014/01 to 2015/12, and switched back to plan A from 2016/01 to 2017/12. In the dataset, the household is registered in two rows as having plan A from 2012/01 to 2017/12, and plan B from 2014/01 to 2015/12. I would like to find a way to register the household in three rows in the time order going from plans A to B, and to A.

2. Merging duplicates: In the same dataset X, I have two IDs for households, an account ID and a service agreement ID (or SA ID). Each household will have their unique account ID that they keep throughout the sample period, while SA IDs may change for households if their requested service agreement changes. There are usually fewer than five rows for each household because an additional row is only needed when enrollment status changes during the time period. For households that did experience this change, there would be multiple rows: some households have multiple rows with the same SA IDs (differing in other variables), while other households may have two rows with the same SA ID and a third one with a different SA ID. The dataset I am merging this with, dataset Y, is coded by SA ID only and includes monthly electricity billing information for each household (the issue is dataset Y does not have Account ID). The problem of a standard m:m merge on SA ID is that SA ID level observations in Y will be matched z times to observations in X, z being the number of rows that have the same SA ID. Ideally, I would like to find a way to get rid of the "bad matches", and merge not just on SA ID, but on time/billing cycles. The challenge is, however, that dataset X has time variables coded over longer periods (months or years), while dataset Y has monthly date that report the start and end date of each billing cycle. I would greatly appreciate any input on this.

Many thanks,
Paichen
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

16 Aug 2023, 22:14

Both of these are pretty complicated. And I doubt anybody can help you out when you do not show example data that will enable them to try out some code, get it working with the data organized as you have it, and check its performance. Please post back showing example data from both of the data sets you are working with. And use the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Regarding your second question, the problem of a "standard m:m merge" is that in almost all circumstances it produces data salad. (The thought that anyone might consider m:m merge to be standard frightens me!) You should never use -merge m:m-. When you think you need to use -merge m:m- either you don't understand your data and how to get what you need with -merge- 1:1, 1:m, or m:1, or what you need is not a -merge- at all, but rather a -joinby-, -cross-, or -rangejoin-. While I don't entirely follow your explanation for your second question, I'm pretty confident that the solution will be with -joinby- or -rangejoin-, not _merge-.
1 like
Comment

Paichen Li

Join Date: Aug 2023
Posts: 2

25 Aug 2023, 12:02

Dear Clyde,

Thank you so much for your patient reply. That was my first time posting, and I very much appreciate your explaining good posting habits.

I am posting three datasets: "Data X", which includes overlapping times across start to end dates (period from one observation fully cover period from another observation of the same account ID); "Data Y" (full monthly data for each account ID), and "mergeXY", which is the desired product of the merge of X and Y.

Data X

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str1(AcctID SAID) int(Start End) str1(Var1 Var2)
"A" "X" 18993 19023 "J" "K"
"A" "Y" 19024 19052 "J" "L"
"A" "Y" 19053 19083 "H" "L"
"B" "P" 18996 19085 "J" "K"
"B" "Q" 19026 19055 "H" "K"
"C" "M" 18994 19084 "H" "L"
"D" "N" 18995 19086 "J" "L"
"D" "N" 19025 19054 "J" "K"
end
format %tddd-Mon-YY Start
format %tddd-Mon-YY End

Data Y

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str1 SAID int(Start End Var3)
"X" 18993 19023 111
"Y" 19024 19052 222
"Y" 19053 19083 333
"P" 18996 19026 121
"Q" 19026 19055 212
"P" 19055 19085 323
"M" 18994 19024 122
"M" 19024 19053 233
"M" 19053 19084 311
"N" 18995 19025 131
"N" 19025 19054 213
"N" 19054 19086 321
end
format %tddd-Mon-YY Start
format %tddd-Mon-YY End

MergeXY

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str1 SAID int(Start End) str1(Var1 Var2) int Var3
"X" 18993 19023 "J" "K" 111
"Y" 19024 19052 "J" "L" 222
"Y" 19053 19083 "H" "L" 333
"P" 18996 19026 "J" "K" 121
"Q" 19026 19055 "H" "K" 212
"P" 19055 19085 "J" "K" 323
"M" 18994 19024 "H" "L" 122
"M" 19024 19053 "H" "L" 233
"M" 19053 19084 "H" "L" 311
"N" 18995 19025 "J" "L" 131
"N" 19025 19054 "J" "K" 213
"N" 19054 19086 "J" "L" 321
end
format %tddd-Mon-YY Start
format %tddd-Mon-YY End

I would greatly appreciate guidance on how to address the time-related issues in data X, so that it could be successfully merged with data Y to produce "mergeXY".

Side note: I masked variable values as required by the NDA.

Attached Files

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

25 Aug 2023, 13:44

Well, this is pretty complicated. The code I am going to show gets you part of the way there, but does not completely solve the problem. Among the limitations that I recognize (there may be others that surface in the full data set that have not cropped up in my testing):

1. It handles the case where one interval in X lies inside another, but it will not handle a situation where yet a third interval lies inside the second. This doesn't happen in the example data, but if it does happen in the full data set, the code will break at the -assert depth <= 2- command. I think I know how to write it in greater generality, but if that's not needed, I'd rather not put in the effort to do that. If it is needed, I wouldn't want to do it without example data that includes such a case to check that my more general code handles it correctly.
2. It does not exactly produce the results you show for your desired XY result. My code produces some additional observations that are not in yours. I have not attempted to modify my code to match yours because based on my understanding of what you are trying to do, I believe that these extra observations actually belong in the result and that your example is incorrect for having omitted them. That is, I have matched each observation in Y with every observation in the expanded version of X, where the start to end ranges for the observations in the two data sets overlap. That's what I understand you to be going for. But perhaps I misunderstood that. If you stick to your example XY and need help in modifying the code to get that result, when posting back please point out which of the extra observations are to be dropped and explain why those.

Code:

clear* * Example generated by -dataex-. For more info, type help dataex clear input str1(AcctID SAID) int(Start End) str1(Var1 Var2) "A" "X" 18993 19023 "J" "K" "A" "Y" 19024 19052 "J" "L" "A" "Y" 19053 19083 "H" "L" "B" "P" 18996 19085 "J" "K" "B" "Q" 19026 19055 "H" "K" "C" "M" 18994 19084 "H" "L" "D" "N" 18995 19086 "J" "L" "D" "N" 19025 19054 "J" "K" end format %tddd-Mon-YY Start format %tddd-Mon-YY End tempfile x save `x' * Example generated by -dataex-. For more info, type help dataex clear input str1 SAID int(Start End Var3) "X" 18993 19023 111 "Y" 19024 19052 222 "Y" 19053 19083 333 "P" 18996 19026 121 "Q" 19026 19055 212 "P" 19055 19085 323 "M" 18994 19024 122 "M" 19024 19053 233 "M" 19053 19084 311 "N" 18995 19025 131 "N" 19025 19054 213 "N" 19054 19086 321 end format %tddd-Mon-YY Start format %tddd-Mon-YY End tempfile y save `y' use `x', clear assert !missing(Start, End) gen `c(obs_t)' obs_no = _n rename (Start End) date= reshape long date, i(obs_no) j(event) string isid AcctID date event, sort by AcctID (date event): gen depth = sum((event == "Start")-(event == "End")) assert depth <= 2 gen expander = 1 by AcctID (date event): replace expander = 2 if depth == 2 | depth[_n-1] == 2 expand expander, gen(new) egen grouper = group(AcctID date event) by grouper (new), sort: replace event = "End" if depth == 2 & new by grouper (new): replace date = date-1 if depth == 2 & new by grouper (new): replace event = "Start" if depth == 1 & new by grouper (new): replace date = date + 1 if depth == 1 & new isid AcctID date event, sort replace obs_no = sum(event == "Start") foreach v of varlist SAID Var1 Var2 { by AcctID (date event): replace `v' = `v'[_n-1] if event == "End" & new by AcctID (date event): replace `v' = `v'[_n+1] if event == "Start" & new } drop depth expander new grouper reshape wide order dateStart, before(dateEnd) count joinby SAID using `y' keep if max(Start, dateStart) <= min(End, dateEnd) drop obs_no rename date* *_x order Start End, after(AcctID) sort AcctID Start End

Even if we end up modifying this code for the reasons outlined above, or for other reasons, a couple of key points:
1. Pay attention to the creation and use of the variable depth in the above code. It identifies when one observation's start to end interval is contained in another's. This is an important technique that comes up in many applications.
2. Note the use of -joinby-, not -merge m:m-. NEVER use -merge m:m-. When people are tempted to use -merge m:m-, either they don't understand their data (or -merge-) and are not perceiving the -merge 1:1, merge 1:m, or merge m:1- that will work in their data, or, they really need to use -joinby- (or, less frequently, -cross-).
3. Note the -keep- command immediately following the -joinby-. This is how to select out the observations where the start-end intervals in the two data sets overlap each other. This is also a handy method that has frequent application.

Last edited by Clyde Schechter; 25 Aug 2023, 13:46.
Comment

Announcement