Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge and sum simultaneously

    Hello,

    I have two data sets that I would like to merge, but I would like a summation (from Dataset 2 below) to occur before the actual merger, and that value be added on (to Dataset 1). I have attached a fuller excel file that contains two example accounts, and the values that I'd like to be summed and merged.

    Dataset 1 are monthly bills for a given account that have a start and end date (original data set has about 3000 accounts for 10 years)- each account is generally billed on different date intervals. Dataset 2 is a daily ET factor. The shared variable between the two data sets is microzone (of which there are about 50), and the date obviously plays a factor. For a given account and bill, I'd like to have the ET factor summed for those specific dates (or what falls between them), and merged to that account. Below is an example of what the data looks like- please let me know what you think is best, and thank you in advance.

    Dataset 1
    account usage perunitrate startdate enddate microzone
    30 22 1.12 10-Dec-04 11-Jan-05 10860762
    30 14 1.12 11-Jan-05 10-Feb-05 10860762
    30 15 1.12 10-Feb-05 14-Mar-05 10860762
    Dataset 2
    et_factor microzone effectivedate
    0.25 10860762 8-Jan-05
    0.21 10860762 9-Jan-05
    0.19 10860762 10-Jan-05
    0.14 10860762 11-Jan-05
    0.08 10860762 12-Jan-05
    0.07 10860762 13-Jan-05
    0.08 10860762 14-Jan-05
    0.09 10860762 15-Jan-05
    0.1 10860762 16-Jan-05
    0.09 10860762 17-Jan-05
    0.1 10860762 18-Jan-05
    0.1 10860762 19-Jan-05
    0.11 10860762 20-Jan-05
    0.09 10860762 21-Jan-05
    0.1 10860762 22-Jan-05
    0.09 10860762 23-Jan-05
    0.09 10860762 24-Jan-05
    0.09 10860762 25-Jan-05
    0.08 10860762 26-Jan-05
    0.06 10860762 27-Jan-05
    0.07 10860762 28-Jan-05
    0.07 10860762 29-Jan-05
    0.08 10860762 30-Jan-05
    0.08 10860762 31-Jan-05
    0.05 10860762 1-Feb-05
    0.09 10860762 2-Feb-05
    0.1 10860762 3-Feb-05
    0.2 10860762 4-Feb-05
    0.15 10860762 5-Feb-05
    0.18 10860762 6-Feb-05
    0.26 10860762 7-Feb-05
    0.24 10860762 8-Feb-05
    0.22 10860762 9-Feb-05
    0.17 10860762 10-Feb-05
    0.13 10860762 11-Feb-05
    0.08 10860762 12-Feb-05
    0.07 10860762 13-Feb-05
    0.08 10860762 14-Feb-05
    Attached Files

  • #2
    The flavor of solution I have in mind involves using the enddate variable in combination with egen, cut in order to generate a grouping variable, over which you could then collapse (sum) et_factor, by(endate_groups). This also seems like a case where you might be looking for joinby rather than merge. I will try to think of an actual solution later.

    Comment


    • #3
      I just wanted to follow up on this and see if anyone had any ideas?

      Comment


      • #4
        I'm kind of working on a command that would merge datasets using a range of values for a key variable but it's not ready for prime time. In the mean time, here's an approach that may do what you want. This is based on the idea that you can combine two very different datasets together, observation by observation.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(account usage) float perunitrate str9(startdate enddate) double microzone
        30 22 1.12 "10-Dec-04" "11-Jan-05" 10860762 
        30 14 1.12 "11-Jan-05" "10-Feb-05" 10860762 
        30 15 1.12 "10-Feb-05" "14-Mar-05" 10860762 
        end
        
        gen date1 = date(startdate,"DM20Y")
        gen date2 = date(enddate,"DM20Y")
        format %td date*
        tempfile data1
        save "`data1'"
        
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(et_factor) double microzone str9 effectivedate
        .25 10860762 "8-Jan-05" 
        .20999999 10860762 "9-Jan-05" 
        .19 10860762 "10-Jan-05" 
        .14 10860762 "11-Jan-05" 
        .08 10860762 "12-Jan-05" 
        .07 10860762 "13-Jan-05" 
        .08 10860762 "14-Jan-05" 
        .09 10860762 "15-Jan-05" 
        .1 10860762 "16-Jan-05" 
        .09 10860762 "17-Jan-05" 
        .1 10860762 "18-Jan-05" 
        .1 10860762 "19-Jan-05" 
        .11 10860762 "20-Jan-05" 
        .09 10860762 "21-Jan-05" 
        .1 10860762 "22-Jan-05" 
        .09 10860762 "23-Jan-05" 
        .09 10860762 "24-Jan-05" 
        .09 10860762 "25-Jan-05" 
        .08 10860762 "26-Jan-05" 
        .06 10860762 "27-Jan-05" 
        .07 10860762 "28-Jan-05" 
        .07 10860762 "29-Jan-05" 
        .08 10860762 "30-Jan-05" 
        .08 10860762 "31-Jan-05" 
        .05 10860762 "1-Feb-05" 
        .09 10860762 "2-Feb-05" 
        .1 10860762 "3-Feb-05" 
        .2 10860762 "4-Feb-05" 
        .15000001 10860762 "5-Feb-05" 
        .18000001 10860762 "6-Feb-05" 
        .25999999 10860762 "7-Feb-05" 
        .23999999 10860762 "8-Feb-05" 
        .22 10860762 "9-Feb-05" 
        .17 10860762 "10-Feb-05" 
        .13 10860762 "11-Feb-05" 
        .08 10860762 "12-Feb-05" 
        .07 10860762 "13-Feb-05" 
        .08 10860762 "14-Feb-05" 
        end
        gen edate = date(effectivedate,"DM20Y")
        format %td edate
        
        * combine with data1, I assume that variable names do not overlap
        rename microzone microzone2
        merge 1:1 _n using "`data1'", nogen
        sort account startdate
        
        * loop over each account (assumes that the account variable is numeric)
        gen obs = _n
        sum obs if !mi(account), meanonly
        local naccounts = r(N)
        
        gen etf_sum = .
        forvalues i = 1/`naccounts' {
            sum et_factor if microzone2 == microzone[`i'] & ///
                inrange(edate,date1[`i'],date2[`i']), meanonly
            replace etf_sum = r(sum) in `i'
        }

        Comment


        • #5
          Hi Robert,

          Thanks a lot for your reply. It worked great partially, but I am having a little bit of trouble with its full implementation. If I isolate just one account in the equivalent of your 'data1' and the corresponding microzone in the et_factors file, and begin your code, only the first two etf_sums are correct, the third is partial/truncated at when the daily rates go beyond the accounts of 'data1' (missing values beyond three et_factor file columns), while the rest of the sums are zero. When I use more than one account, or the full data set, either all the etf_sums are zero, or something stranger occurs, only the first 3 observations for a given account are nonzero. Any thoughts on what I might be doing wrong, or how we could update the code? There are about 50 microzones and 3000 accounts with 140 observations each.

          What I've done is attached a do file I ran, along with a partial file with two accounts and one with information for the two corresponding microzones. Thanks again, and please let me know if I can provide additional information.
          Attached Files

          Comment


          • #6
            You made an error converting my example code to match your data. This type of error could have been avoided if you had used dataex (from SSC) to post your initial data example. Here's a revised version, see if you can spot the error:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long(account Microzone_No) float(readdate_prior readdate perunitrate)
            28 10860762 16415 16447 1.12 
            28 10860762 16447 16477 1.12 
            28 10860762 16477 16509 1.12 
            10092 10880752 16425 16456 1.28 
            10092 10880752 16456 16485 1.28 
            10092 10880752 16485 16517 1.28 
            end
            format %td readdate_prior
            format %td readdate
            tempfile accounts
            save "`accounts'"
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long microzone float(effectivedate et_factor)
            10860762 16437 .05 
            10860762 16440 .19 
            10860762 16443 .25999999 
            10860762 16445 .20999999 
            10860762 16451 .09 
            10860762 16452 .1 
            10860762 16456 .11 
            10860762 16468 .05 
            10860762 16478 .13 
            10880752 16441 .15000001 
            10880752 16448 .08 
            10880752 16451 .09 
            10880752 16454 .1 
            10880752 16479 .08 
            10880752 16480 .07 
            end
            format %td effectivedate
            
            merge 1:1 _n using "`accounts'", nogen
            sort account readdate
            gen obs = _n
            sum obs if !mi(account), meanonly
            local naccounts = r(N)
            gen etf_sum = .
            forvalues i = 1/`naccounts' {
                sum et_factor if Microzone_No[`i'] == microzone & ///
                    inrange(effectivedate,readdate_prior[`i'],readdate[`i']), meanonly
                replace etf_sum = r(sum) in `i'
            }
            
            * always a good idea to do a spot check; use the fourth observation
            list Microzone_No readdate_prior readdate etf_sum in 4
            sum et_factor if microzone == 10880752 & ///
                inrange(effectivedate,date("20dec2004","DMY"),date("20jan2005","DMY"))
            dis r(sum)

            Comment


            • #7
              Its the difference with where the 'i' is relative to microzone- I thought I sorted it, but loops are all new to me, so thanks for your patience and time, Robert. Great stuff.

              If I may, just one more question. For about two years in the middle of my data set, there is only one date (say enddate). Is there a way to pull the enddate from one row to be the startdate in the following row for a given account if its missing?

              The complicating factor, which can be seen below, is that there are (necessary) duplicate dates which represent tiers/different charges within the billamount (and other variables). Let me know when you have a chance, and thanks again.
              account billamount startdate enddate
              30 90.67 16-Apr-09 19-May-09
              30 102.04 19-May-09 16-Jun-09
              30 128.78 16-Jun-09 21-Jul-09
              30 113.74 21-Jul-09 18-Aug-09
              30 150.5 18-Aug-09 15-Sep-09
              30 181.7 20-Oct-09
              30 110.6 17-Nov-09
              30 52.14 22-Dec-09
              30 42.66 19-Jan-10
              30 15.8 16-Feb-10
              30 23.7 16-Mar-10
              30 91.64 19-Apr-10
              30 104.28 17-May-10
              30 181.7 15-Jun-10
              30 105.86 21-Jul-10
              30 61.62 17-Aug-10
              30 46.7 1-Sep-10
              30 65.52 20-Sep-10
              30 2.62 20-Sep-10
              30 4.06 18-Oct-10
              30 101.5 18-Oct-10
              30 3.64 15-Nov-10
              30 91 15-Nov-10

              Comment


              • #8
                Please use dataex (from SSC) to prepare your data examples, it will increase your chances of getting a response and reduce the probability that you will incorrectly apply the solution to your situation. Here's one way of doing this

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte account float billamount int(startdate enddate)
                30 90.669998 18003 18036 
                30 102.04 18036 18064 
                30 128.78 18064 18099 
                30 113.74 18099 18127 
                30 150.5 18127 18155 
                30 181.7 . 18190 
                30 110.6 . 18218 
                30 52.139999 . 18253 
                30 42.66 . 18281 
                30 15.8 . 18309 
                30 23.700001 . 18337 
                30 91.639999 . 18371 
                30 104.28 . 18399 
                30 181.7 . 18428 
                30 105.86 . 18464 
                30 61.619999 . 18491 
                30 46.700001 . 18506 
                30 65.519997 . 18525 
                30 2.6199999 . 18525 
                30 4.0599999 . 18553 
                30 101.5 . 18553 
                30 3.6400001 . 18581 
                30 91 . 18581 
                end
                format %td startdate
                format %td enddate
                
                * identify the first observation within a group
                bysort account enddate (startdate): gen first = _n == 1
                
                * pick-up the end date from the previous obs, but only for the first
                gen startdate2 = enddate[_n-1] if first
                
                * replicate the start date for all obs in the group
                by account enddate: replace startdate2 = startdate2[1]
                format %td startdate2
                For more information, seeand the Stata Journal article
                SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: step
                . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
                Q1/02 SJ 2(1):86--102 (no commands)
                explains the use of the by varlist : construct to tackle
                a variety of problems with group structure, ranging from
                simple calculations for each of several groups to more
                advanced manipulations that use the built-in _n and _N

                Comment


                • #9
                  Sorry for the delay in thanking you, Richard, but just got back, and it worked great. I appreciate it- I will certainly use dataex moving forward too, and thanks for the references.

                  Comment


                  • #10
                    Originally posted by Robert Picard View Post
                    You made an error converting my example code to match your data. This type of error could have been avoided if you had used dataex (from SSC) to post your initial data example. Here's a revised version, see if you can spot the error:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input long(account Microzone_No) float(readdate_prior readdate perunitrate)
                    28 10860762 16415 16447 1.12
                    28 10860762 16447 16477 1.12
                    28 10860762 16477 16509 1.12
                    10092 10880752 16425 16456 1.28
                    10092 10880752 16456 16485 1.28
                    10092 10880752 16485 16517 1.28
                    end
                    format %td readdate_prior
                    format %td readdate
                    tempfile accounts
                    save "`accounts'"
                    
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input long microzone float(effectivedate et_factor)
                    10860762 16437 .05
                    10860762 16440 .19
                    10860762 16443 .25999999
                    10860762 16445 .20999999
                    10860762 16451 .09
                    10860762 16452 .1
                    10860762 16456 .11
                    10860762 16468 .05
                    10860762 16478 .13
                    10880752 16441 .15000001
                    10880752 16448 .08
                    10880752 16451 .09
                    10880752 16454 .1
                    10880752 16479 .08
                    10880752 16480 .07
                    end
                    format %td effectivedate
                    
                    merge 1:1 _n using "`accounts'", nogen
                    sort account readdate
                    gen obs = _n
                    sum obs if !mi(account), meanonly
                    local naccounts = r(N)
                    gen etf_sum = .
                    forvalues i = 1/`naccounts' {
                    sum et_factor if Microzone_No[`i'] == microzone & ///
                    inrange(effectivedate,readdate_prior[`i'],readdate[`i']), meanonly
                    replace etf_sum = r(sum) in `i'
                    }
                    
                    * always a good idea to do a spot check; use the fourth observation
                    list Microzone_No readdate_prior readdate etf_sum in 4
                    sum et_factor if microzone == 10880752 & ///
                    inrange(effectivedate,date("20dec2004","DMY"),date("20jan2005","DMY"))
                    dis r(sum)
                    Hi Robert,

                    If I may be so bold to revisit this old post, how would one modify this code to take the mean value (not the sum) in this example? I tried toying with it but my etf_mean variable resulted in "0 real changes made". I entered the following but obviously wrong,

                    Code:
                    merge 1:1 _n using "`accounts'", nogen
                    sort account readdate
                    gen obs = _n
                    sum obs if !mi(account), meanonly
                    local naccounts = r(N)
                    gen etf_mean = .
                    forvalues i = 1/`naccounts' {
                        mean et_factor if Microzone_No[`i'] == microzone & ///
                            inrange(effectivedate,readdate_prior[`i'],readdate[`i'])
                        replace etf_mean = r(mean) in `i'
                    }
                    Any suggestions would be greatly appreciated! I'm sure future users will also benefit from this thread as I have!
                    Last edited by Jeff Tree; 23 Nov 2019, 20:57.

                    Comment


                    • #11
                      It should be as simple as replacing r(sum) with r(mean). However, you can now use rangejoin (from SSC) to do this. Using the two datasets from #5, here's how to calculate both the sum and mean:

                      Code:
                      use "account_for_statalist.dta", clear
                      
                      * align variable name and confirm that observations are uniquely identified
                      rename Microzone_No microzone
                      isid microzone account readdate, sort
                      
                      * match each observation to observations in range from the using dataset (same microzone)
                      rangejoin effectivedate readdate_prior readdate using "et_factors_statalist.dta", by(microzone)
                      
                      * sum et_factor from observations in range
                      bysort microzone account readdate: egen res_sum  = total(et_factor)
                      by microzone account readdate: egen res_mean = mean(et_factor)
                      
                      * return to original observations
                      by microzone account readdate: keep if _n == 1

                      Comment

                      Working...
                      X