Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging monthly and yearly data for respective last 12months

    Hello,

    I have two data sets which I want to merge. In the first dataset A ,I do have the following variables: id, year, month (and further variables which are not of interest in this case.)
    In the second dataset B, I have the variables: id, year, month, costs.

    My problem is the following: The costs in dataset B refer to the average costs over the last 12 months and are reported for each id only once a year, but at different months. For one id they are reported yearly in september, for another id they are reported yearly in december etc.

    If I now match the datasets using the following code
    Code:
    merge m:m id year using "B"
    the costs are matched on a yearly basis. Meaning costs reported in September2018 for a given id are merged with each month in 2018. But I want every value for costs from dataset B to be merged with the respective last 12 months in dataset A. In this case that means that costs reported in September 2018 are matched with the months October 2017-September2018.

    I did search in the forum but did not find any solutions.

    Does anyone have an idea how to solve this?

    Thank you for your help

    Tim Wolf

  • #2
    -merge m:m- just produces data salad. Whenever you even think of using it, you are thinking incorrectly about your problem.

    What you need here is to -merge- the two data sets on id month and year, and then spread the cost value over the year.

    Code:
    use datasetA, clear
    merge 1:1 id year month using datasetB
    by id year (costs), sort: assert costs == costs[1] | missing(costs)
    by id year (costs): replace costs = costs[1]

    Comment


    • #3
      Or here's another approach:

      Code:
      use dataset B, clear
      keep if !missing(costs)
      isid id year, sort
      drop month
      merge 1:m id year using datasetA

      Comment


      • #4
        Clyde Schechter provides an answer that I started to write, but rereading post #1 convinced me I had missed a key point.

        But I want every value for costs from dataset B to be merged with the respective last 12 months in dataset A. In this case that means that costs reported in September 2018 are matched with the months October 2017-September2018.
        I think this is a job for rangejoin (user-written, see the output of ssc describe rangejoin, but SSC is not responding as I write this) and I have no experience with that command and without data to test on, I'll leave it to experts to assess this. It will certainly involve creating Stata SIF monthly dates in both dataset A and dataset B following the guidance in help datetime.

        Comment


        • #5
          Thank you, William Lisowski, for pointing out my misreading of the original request. My code in both #2 and #3 will put the value of costs from datasetB into every month of the same calendar year, not the preceding 12 months. He is also correct that this is, with a little extra, a job for -rangejoin-.

          Code:
          //    CREATE TOY DATA TO DEMONSTRATE CODE
          clear*
          set obs 200
          gen firm = ceil(_n/100)
          by firm, sort: gen mdate = tm(1999m12) + _n
          format mdate %tm
          gen year = year(dofm(mdate))
          gen month = month(dofm(mdate))
          set seed 1234
          gen costs = rgamma(5, 10) if (firm == 1 & month == 9) ///
              | (firm == 2 & month == 12)
          drop mdate
          tempfile datasetB
          save `datasetB'
          describe
          drop costs
          tempfile datasetA
          save `datasetA'
          describe
          
          
          //    CODE TO SOLVE PROBLEM BEGINS HERE
          use `datasetB', clear
          keep if !missing(costs)
          gen mdate = ym(year, month)
          format mdate %tm
          tempfile cost_data
          save `cost_data'
          
          use `datasetA', clear
          gen mdate = ym(year, month)
          format mdate %tm
          rangejoin mdate 0 11 using `cost_data', by(firm)
          -rangejoin- is by Robert Picard and is available from SSC. To use it, you must also have -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

          Added: I have interpreted "past 12 months" to mean including the month in which the costs are reported and extending back 11 additional months from there. So if the costs are reported in September 2005, I assume they cover the period from October 2004 through September 2005. If it was intended that they should cover September 2004 through August 2005, change 0 11 in the -rangejoin- command to 1 12.
          Last edited by Clyde Schechter; 26 Mar 2019, 15:33.

          Comment


          • #6
            Thank you very much for your help,Clyde Schechter and William Lisowski, the code works perfect.

            I do have two small follow-up questions, but I do not think they are each worth a new thread. If you think otherwise, let me know and I will open a new thread.

            Question 1:
            Within Dataset A I do sometimes have observations for a certain id lets say till December2018. But the latest costs from Dataset B for this id are lets say reported for October2018. For now, after using rangestat this leaves my observations for November2018 and December2018 without matched costs.

            Is there a way to tell stata to take the latest reported costs (in this case October2018) and roll them forward if there are observations for the same id in Dataset A (in this case November2018 and December2018)?

            Question 2:
            Within Dataset B sometimes I do have the following observations including a duplicate in lines 2 and 3.
            Code:
            id   year      month    costs
            1    2004       10         5
            1    2005       10         6
            1    2005       10         6
            1    2007       11         8
            Is there a way to tell stata to adjust the year in line 3 by +1? So that in this case everything remains the same, except year in line 3 changes from 2005 to 2006.


            Again thank you very much for your help.

            Tim Wolf

            Comment


            • #7
              I feel like we're on a slippery slope here. With Question 2, I notice that in 2007 the month becomes 11 instead of 10. That means that even after adjusting the year to 2006 in the duplicate observation there will be no cost assigned to month 11 of 2006. When you have gaps like that interior to your data, do you want to extend the value from October 2006 forward, as you describe in Question 1 for gaps at the end of the series? And also with Question 2, it is possible you have data with two duplicates of the original observation?

              I wonder if you wouldn't be better served by doing a simple merge 1:1 id year (having first corrected duplicates in Dataset B) and then using ipolate (with the epolate option) to smoothly fill in the missing values.

              Comment


              • #8
                I am sorry William Lisowski, the month 11 in line 4 was a mistake, it should be month 10, too. The duplicates are based on the fact, that for some years there is no observation of costs (in this case in year 2006), which is why the database where I got the data from just wrote year 2005 as duplicates.
                Thank you for suggesting the ipolate function. However, the costs are a past-12-months-average and therefore I would like them to remain the same for each month in a given period. But I do like the idea to smooth the missing observations and I am wondering, if it is possible to tell stata to change the costs in line 3 to be the average of line 2 and 4. So that the data would end up looking the following:
                Code:
                 id   year      month    costs  
                 1    2004       10         5  
                 1    2005       10         6   
                 1    2006       10         7   
                 1    2007       10         8
                Then I could use rangejoin again and my Question 2 would be perfectly solved.
                Last edited by Tim Wolf; 27 Mar 2019, 09:33.

                Comment


                • #9
                  Perhaps this (untested) change to part of Clyde's code from post #5 will correct the problem in Question 2.
                  Code:
                  //    CODE TO SOLVE PROBLEM BEGINS HERE
                  use `datasetB', clear
                  keep if !missing(costs)
                  sort id year, stable
                  // make sure there is at most one duplicate for any given id/year
                  by id (year): assert year!=year[_n-1] | year!=year[_n+1]
                  by id (year): replace costs = (costs[_n-1]+costs[_n+1])/2 if year==year[_n-1] & _n!=_N
                  by id (year): replace costs = costs[_n-1] if year==year[_n-1] & _n==_N
                  by id (year): replace year = year[_n-1]+1 if year==year[_n-1]
                  sort id year
                  gen mdate = ym(year, month)
                  format mdate %tm
                  tempfile cost_data
                  save `cost_data'

                  Comment


                  • #10
                    Thank you very much, it works.

                    Comment

                    Working...
                    X