Hi Statalisters,
As in the title, I have two separate questions
1. Overlapping times in data: I have a dataset X on household enrollment in electricity rate plans. Some households switch around during the time I study, and certain data entries are not recorded properly. For example, certain households have plan A from 2012/01 to 2013/12, plan B from 2014/01 to 2015/12, and switched back to plan A from 2016/01 to 2017/12. In the dataset, the household is registered in two rows as having plan A from 2012/01 to 2017/12, and plan B from 2014/01 to 2015/12. I would like to find a way to register the household in three rows in the time order going from plans A to B, and to A.
2. Merging duplicates: In the same dataset X, I have two IDs for households, an account ID and a service agreement ID (or SA ID). Each household will have their unique account ID that they keep throughout the sample period, while SA IDs may change for households if their requested service agreement changes. There are usually fewer than five rows for each household because an additional row is only needed when enrollment status changes during the time period. For households that did experience this change, there would be multiple rows: some households have multiple rows with the same SA IDs (differing in other variables), while other households may have two rows with the same SA ID and a third one with a different SA ID. The dataset I am merging this with, dataset Y, is coded by SA ID only and includes monthly electricity billing information for each household (the issue is dataset Y does not have Account ID). The problem of a standard m:m merge on SA ID is that SA ID level observations in Y will be matched z times to observations in X, z being the number of rows that have the same SA ID. Ideally, I would like to find a way to get rid of the "bad matches", and merge not just on SA ID, but on time/billing cycles. The challenge is, however, that dataset X has time variables coded over longer periods (months or years), while dataset Y has monthly date that report the start and end date of each billing cycle. I would greatly appreciate any input on this.
Many thanks,
Paichen
As in the title, I have two separate questions
1. Overlapping times in data: I have a dataset X on household enrollment in electricity rate plans. Some households switch around during the time I study, and certain data entries are not recorded properly. For example, certain households have plan A from 2012/01 to 2013/12, plan B from 2014/01 to 2015/12, and switched back to plan A from 2016/01 to 2017/12. In the dataset, the household is registered in two rows as having plan A from 2012/01 to 2017/12, and plan B from 2014/01 to 2015/12. I would like to find a way to register the household in three rows in the time order going from plans A to B, and to A.
2. Merging duplicates: In the same dataset X, I have two IDs for households, an account ID and a service agreement ID (or SA ID). Each household will have their unique account ID that they keep throughout the sample period, while SA IDs may change for households if their requested service agreement changes. There are usually fewer than five rows for each household because an additional row is only needed when enrollment status changes during the time period. For households that did experience this change, there would be multiple rows: some households have multiple rows with the same SA IDs (differing in other variables), while other households may have two rows with the same SA ID and a third one with a different SA ID. The dataset I am merging this with, dataset Y, is coded by SA ID only and includes monthly electricity billing information for each household (the issue is dataset Y does not have Account ID). The problem of a standard m:m merge on SA ID is that SA ID level observations in Y will be matched z times to observations in X, z being the number of rows that have the same SA ID. Ideally, I would like to find a way to get rid of the "bad matches", and merge not just on SA ID, but on time/billing cycles. The challenge is, however, that dataset X has time variables coded over longer periods (months or years), while dataset Y has monthly date that report the start and end date of each billing cycle. I would greatly appreciate any input on this.
Many thanks,
Paichen
Comment