Merge and sum simultaneously

Adam King

Join Date: Aug 2015
Posts: 18

Merge and sum simultaneously

10 Aug 2015, 11:55

Hello,

I have two data sets that I would like to merge, but I would like a summation (from Dataset 2 below) to occur before the actual merger, and that value be added on (to Dataset 1). I have attached a fuller excel file that contains two example accounts, and the values that I'd like to be summed and merged.

Dataset 1 are monthly bills for a given account that have a start and end date (original data set has about 3000 accounts for 10 years)- each account is generally billed on different date intervals. Dataset 2 is a daily ET factor. The shared variable between the two data sets is microzone (of which there are about 50), and the date obviously plays a factor. For a given account and bill, I'd like to have the ET factor summed for those specific dates (or what falls between them), and merged to that account. Below is an example of what the data looks like- please let me know what you think is best, and thank you in advance.

Dataset 1

account	usage	perunitrate	startdate	enddate	microzone
30	22	1.12	10-Dec-04	11-Jan-05	10860762
30	14	1.12	11-Jan-05	10-Feb-05	10860762
30	15	1.12	10-Feb-05	14-Mar-05	10860762

Dataset 2

et_factor

microzone

effectivedate

0.25	10860762	8-Jan-05
0.21	10860762	9-Jan-05
0.19	10860762	10-Jan-05
0.14	10860762	11-Jan-05
0.08	10860762	12-Jan-05
0.07	10860762	13-Jan-05
0.08	10860762	14-Jan-05
0.09	10860762	15-Jan-05
0.1	10860762	16-Jan-05
0.09	10860762	17-Jan-05
0.1	10860762	18-Jan-05
0.1	10860762	19-Jan-05
0.11	10860762	20-Jan-05
0.09	10860762	21-Jan-05
0.1	10860762	22-Jan-05
0.09	10860762	23-Jan-05
0.09	10860762	24-Jan-05
0.09	10860762	25-Jan-05
0.08	10860762	26-Jan-05
0.06	10860762	27-Jan-05
0.07	10860762	28-Jan-05
0.07	10860762	29-Jan-05
0.08	10860762	30-Jan-05
0.08	10860762	31-Jan-05
0.05	10860762	1-Feb-05
0.09	10860762	2-Feb-05
0.1	10860762	3-Feb-05
0.2	10860762	4-Feb-05
0.15	10860762	5-Feb-05
0.18	10860762	6-Feb-05
0.26	10860762	7-Feb-05
0.24	10860762	8-Feb-05
0.22	10860762	9-Feb-05
0.17	10860762	10-Feb-05
0.13	10860762	11-Feb-05
0.08	10860762	12-Feb-05
0.07	10860762	13-Feb-05
0.08	10860762	14-Feb-05

Attached Files

Data_for_Statalist.xlsx (160.0 KB, 1 view)

Tags: None

E. David Aja

Join Date: Mar 2014

Posts: 58
#2

10 Aug 2015, 12:56

The flavor of solution I have in mind involves using the enddate variable in combination with egen, cut in order to generate a grouping variable, over which you could then collapse (sum) et_factor, by(endate_groups). This also seems like a case where you might be looking for joinby rather than merge. I will try to think of an actual solution later.
Comment
Adam King

Join Date: Aug 2015

Posts: 18
#3

13 Aug 2015, 10:02

I just wanted to follow up on this and see if anyone had any ideas?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

13 Aug 2015, 10:38

I'm kind of working on a command that would merge datasets using a range of values for a key variable but it's not ready for prime time. In the mean time, here's an approach that may do what you want. This is based on the idea that you can combine two very different datasets together, observation by observation.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(account usage) float perunitrate str9(startdate enddate) double microzone
30 22 1.12 "10-Dec-04" "11-Jan-05" 10860762 
30 14 1.12 "11-Jan-05" "10-Feb-05" 10860762 
30 15 1.12 "10-Feb-05" "14-Mar-05" 10860762 
end

gen date1 = date(startdate,"DM20Y")
gen date2 = date(enddate,"DM20Y")
format %td date*
tempfile data1
save "`data1'"


* Example generated by -dataex-. To install: ssc install dataex
clear
input float(et_factor) double microzone str9 effectivedate
.25 10860762 "8-Jan-05" 
.20999999 10860762 "9-Jan-05" 
.19 10860762 "10-Jan-05" 
.14 10860762 "11-Jan-05" 
.08 10860762 "12-Jan-05" 
.07 10860762 "13-Jan-05" 
.08 10860762 "14-Jan-05" 
.09 10860762 "15-Jan-05" 
.1 10860762 "16-Jan-05" 
.09 10860762 "17-Jan-05" 
.1 10860762 "18-Jan-05" 
.1 10860762 "19-Jan-05" 
.11 10860762 "20-Jan-05" 
.09 10860762 "21-Jan-05" 
.1 10860762 "22-Jan-05" 
.09 10860762 "23-Jan-05" 
.09 10860762 "24-Jan-05" 
.09 10860762 "25-Jan-05" 
.08 10860762 "26-Jan-05" 
.06 10860762 "27-Jan-05" 
.07 10860762 "28-Jan-05" 
.07 10860762 "29-Jan-05" 
.08 10860762 "30-Jan-05" 
.08 10860762 "31-Jan-05" 
.05 10860762 "1-Feb-05" 
.09 10860762 "2-Feb-05" 
.1 10860762 "3-Feb-05" 
.2 10860762 "4-Feb-05" 
.15000001 10860762 "5-Feb-05" 
.18000001 10860762 "6-Feb-05" 
.25999999 10860762 "7-Feb-05" 
.23999999 10860762 "8-Feb-05" 
.22 10860762 "9-Feb-05" 
.17 10860762 "10-Feb-05" 
.13 10860762 "11-Feb-05" 
.08 10860762 "12-Feb-05" 
.07 10860762 "13-Feb-05" 
.08 10860762 "14-Feb-05" 
end
gen edate = date(effectivedate,"DM20Y")
format %td edate

* combine with data1, I assume that variable names do not overlap
rename microzone microzone2
merge 1:1 _n using "`data1'", nogen
sort account startdate

* loop over each account (assumes that the account variable is numeric)
gen obs = _n
sum obs if !mi(account), meanonly
local naccounts = r(N)

gen etf_sum = .
forvalues i = 1/`naccounts' {
    sum et_factor if microzone2 == microzone[`i'] & ///
        inrange(edate,date1[`i'],date2[`i']), meanonly
    replace etf_sum = r(sum) in `i'
}

Comment

Adam King

Join Date: Aug 2015

Posts: 18
#5

15 Aug 2015, 18:46

Hi Robert,

Thanks a lot for your reply. It worked great partially, but I am having a little bit of trouble with its full implementation. If I isolate just one account in the equivalent of your 'data1' and the corresponding microzone in the et_factors file, and begin your code, only the first two etf_sums are correct, the third is partial/truncated at when the daily rates go beyond the accounts of 'data1' (missing values beyond three et_factor file columns), while the rest of the sums are zero. When I use more than one account, or the full data set, either all the etf_sums are zero, or something stranger occurs, only the first 3 observations for a given account are nonzero. Any thoughts on what I might be doing wrong, or how we could update the code? There are about 50 microzones and 3000 accounts with 140 observations each.

What I've done is attached a do file I ran, along with a partial file with two accounts and one with information for the two corresponding microzones. Thanks again, and please let me know if I can provide additional information.
Attached Files

ET_Work.do (460 Bytes, 1 view)

account_for_statalist.dta (5.9 KB, 1 view)

et_factors_statalist.dta (91.7 KB, 1 view)
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Aug 2015, 09:04

You made an error converting my example code to match your data. This type of error could have been avoided if you had used dataex (from SSC) to post your initial data example. Here's a revised version, see if you can spot the error:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(account Microzone_No) float(readdate_prior readdate perunitrate)
28 10860762 16415 16447 1.12 
28 10860762 16447 16477 1.12 
28 10860762 16477 16509 1.12 
10092 10880752 16425 16456 1.28 
10092 10880752 16456 16485 1.28 
10092 10880752 16485 16517 1.28 
end
format %td readdate_prior
format %td readdate
tempfile accounts
save "`accounts'"

* Example generated by -dataex-. To install: ssc install dataex
clear
input long microzone float(effectivedate et_factor)
10860762 16437 .05 
10860762 16440 .19 
10860762 16443 .25999999 
10860762 16445 .20999999 
10860762 16451 .09 
10860762 16452 .1 
10860762 16456 .11 
10860762 16468 .05 
10860762 16478 .13 
10880752 16441 .15000001 
10880752 16448 .08 
10880752 16451 .09 
10880752 16454 .1 
10880752 16479 .08 
10880752 16480 .07 
end
format %td effectivedate

merge 1:1 _n using "`accounts'", nogen
sort account readdate
gen obs = _n
sum obs if !mi(account), meanonly
local naccounts = r(N)
gen etf_sum = .
forvalues i = 1/`naccounts' {
    sum et_factor if Microzone_No[`i'] == microzone & ///
        inrange(effectivedate,readdate_prior[`i'],readdate[`i']), meanonly
    replace etf_sum = r(sum) in `i'
}

* always a good idea to do a spot check; use the fourth observation
list Microzone_No readdate_prior readdate etf_sum in 4
sum et_factor if microzone == 10880752 & ///
    inrange(effectivedate,date("20dec2004","DMY"),date("20jan2005","DMY"))
dis r(sum)

Comment

Adam King

Join Date: Aug 2015
Posts: 18

16 Aug 2015, 19:52

Its the difference with where the 'i' is relative to microzone- I thought I sorted it, but loops are all new to me, so thanks for your patience and time, Robert. Great stuff.

If I may, just one more question. For about two years in the middle of my data set, there is only one date (say enddate). Is there a way to pull the enddate from one row to be the startdate in the following row for a given account if its missing?

The complicating factor, which can be seen below, is that there are (necessary) duplicate dates which represent tiers/different charges within the billamount (and other variables). Let me know when you have a chance, and thanks again.

account	billamount	startdate	enddate
30	90.67	16-Apr-09	19-May-09
30	102.04	19-May-09	16-Jun-09
30	128.78	16-Jun-09	21-Jul-09
30	113.74	21-Jul-09	18-Aug-09
30	150.5	18-Aug-09	15-Sep-09
30	181.7		20-Oct-09
30	110.6		17-Nov-09
30	52.14		22-Dec-09
30	42.66		19-Jan-10
30	15.8		16-Feb-10
30	23.7		16-Mar-10
30	91.64		19-Apr-10
30	104.28		17-May-10
30	181.7		15-Jun-10
30	105.86		21-Jul-10
30	61.62		17-Aug-10
30	46.7		1-Sep-10
30	65.52		20-Sep-10
30	2.62		20-Sep-10
30	4.06		18-Oct-10
30	101.5		18-Oct-10
30	3.64		15-Nov-10
30	91		15-Nov-10

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

17 Aug 2015, 10:05

Please use dataex (from SSC) to prepare your data examples, it will increase your chances of getting a response and reduce the probability that you will incorrectly apply the solution to your situation. Here's one way of doing this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte account float billamount int(startdate enddate)
30 90.669998 18003 18036 
30 102.04 18036 18064 
30 128.78 18064 18099 
30 113.74 18099 18127 
30 150.5 18127 18155 
30 181.7 . 18190 
30 110.6 . 18218 
30 52.139999 . 18253 
30 42.66 . 18281 
30 15.8 . 18309 
30 23.700001 . 18337 
30 91.639999 . 18371 
30 104.28 . 18399 
30 181.7 . 18428 
30 105.86 . 18464 
30 61.619999 . 18491 
30 46.700001 . 18506 
30 65.519997 . 18525 
30 2.6199999 . 18525 
30 4.0599999 . 18553 
30 101.5 . 18553 
30 3.6400001 . 18581 
30 91 . 18581 
end
format %td startdate
format %td enddate

* identify the first observation within a group
bysort account enddate (startdate): gen first = _n == 1

* pick-up the end date from the previous obs, but only for the first
gen startdate2 = enddate[_n-1] if first

* replicate the start date for all obs in the group
by account enddate: replace startdate2 = startdate2[1]
format %td startdate2

For more information, see

and the Stata Journal article

SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: step
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/02 SJ 2(1):86--102 (no commands)
explains the use of the by varlist : construct to tackle
a variety of problems with group structure, ranging from
simple calculations for each of several groups to more
advanced manipulations that use the built-in _n and _N

Comment

Adam King

Join Date: Aug 2015

Posts: 18
#9

20 Aug 2015, 10:23

Sorry for the delay in thanking you, Richard, but just got back, and it worked great. I appreciate it- I will certainly use dataex moving forward too, and thanks for the references.
Comment

Jeff Tree

Join Date: May 2019
Posts: 33

#10

23 Nov 2019, 20:46

Originally posted by Robert Picard View Post

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(account Microzone_No) float(readdate_prior readdate perunitrate)
28 10860762 16415 16447 1.12
28 10860762 16447 16477 1.12
28 10860762 16477 16509 1.12
10092 10880752 16425 16456 1.28
10092 10880752 16456 16485 1.28
10092 10880752 16485 16517 1.28
end
format %td readdate_prior
format %td readdate
tempfile accounts
save "`accounts'"

* Example generated by -dataex-. To install: ssc install dataex
clear
input long microzone float(effectivedate et_factor)
10860762 16437 .05
10860762 16440 .19
10860762 16443 .25999999
10860762 16445 .20999999
10860762 16451 .09
10860762 16452 .1
10860762 16456 .11
10860762 16468 .05
10860762 16478 .13
10880752 16441 .15000001
10880752 16448 .08
10880752 16451 .09
10880752 16454 .1
10880752 16479 .08
10880752 16480 .07
end
format %td effectivedate

merge 1:1 _n using "`accounts'", nogen
sort account readdate
gen obs = _n
sum obs if !mi(account), meanonly
local naccounts = r(N)
gen etf_sum = .
forvalues i = 1/`naccounts' {
sum et_factor if Microzone_No[`i'] == microzone & ///
inrange(effectivedate,readdate_prior[`i'],readdate[`i']), meanonly
replace etf_sum = r(sum) in `i'
}

* always a good idea to do a spot check; use the fourth observation
list Microzone_No readdate_prior readdate etf_sum in 4
sum et_factor if microzone == 10880752 & ///
inrange(effectivedate,date("20dec2004","DMY"),date("20jan2005","DMY"))
dis r(sum)

Hi Robert,

If I may be so bold to revisit this old post, how would one modify this code to take the mean value (not the sum) in this example? I tried toying with it but my etf_mean variable resulted in "0 real changes made". I entered the following but obviously wrong,

Code:

merge 1:1 _n using "`accounts'", nogen
sort account readdate
gen obs = _n
sum obs if !mi(account), meanonly
local naccounts = r(N)
gen etf_mean = .
forvalues i = 1/`naccounts' {
    mean et_factor if Microzone_No[`i'] == microzone & ///
        inrange(effectivedate,readdate_prior[`i'],readdate[`i'])
    replace etf_mean = r(mean) in `i'
}

Any suggestions would be greatly appreciated! I'm sure future users will also benefit from this thread as I have!

Last edited by Jeff Tree; 23 Nov 2019, 20:57.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

24 Nov 2019, 09:57

It should be as simple as replacing r(sum) with r(mean). However, you can now use rangejoin (from SSC) to do this. Using the two datasets from #5, here's how to calculate both the sum and mean:

Code:

use "account_for_statalist.dta", clear

* align variable name and confirm that observations are uniquely identified
rename Microzone_No microzone
isid microzone account readdate, sort

* match each observation to observations in range from the using dataset (same microzone)
rangejoin effectivedate readdate_prior readdate using "et_factors_statalist.dta", by(microzone)

* sum et_factor from observations in range
bysort microzone account readdate: egen res_sum  = total(et_factor)
by microzone account readdate: egen res_mean = mean(et_factor)

* return to original observations
by microzone account readdate: keep if _n == 1

Announcement

Merge and sum simultaneously

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment