Merging monthly and yearly data for respective last 12months

Tim Wolf

Join Date: Mar 2019

Posts: 25
#1

Merging monthly and yearly data for respective last 12months

26 Mar 2019, 07:42

Hello,

I have two data sets which I want to merge. In the first dataset A ,I do have the following variables: id, year, month (and further variables which are not of interest in this case.)
In the second dataset B, I have the variables: id, year, month, costs.

My problem is the following: The costs in dataset B refer to the average costs over the last 12 months and are reported for each id only once a year, but at different months. For one id they are reported yearly in september, for another id they are reported yearly in december etc.

If I now match the datasets using the following code

Code:

merge m:m id year using "B"

the costs are matched on a yearly basis. Meaning costs reported in September2018 for a given id are merged with each month in 2018. But I want every value for costs from dataset B to be merged with the respective last 12 months in dataset A. In this case that means that costs reported in September 2018 are matched with the months October 2017-September2018.

I did search in the forum but did not find any solutions.

Does anyone have an idea how to solve this?

Thank you for your help

Tim Wolf
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

26 Mar 2019, 13:15

-merge m:m- just produces data salad. Whenever you even think of using it, you are thinking incorrectly about your problem.

What you need here is to -merge- the two data sets on id month and year, and then spread the cost value over the year.

Code:

use datasetA, clear merge 1:1 id year month using datasetB by id year (costs), sort: assert costs == costs[1] | missing(costs) by id year (costs): replace costs = costs[1]
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

26 Mar 2019, 13:16

Or here's another approach:

Code:

use dataset B, clear keep if !missing(costs) isid id year, sort drop month merge 1:m id year using datasetA
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

26 Mar 2019, 14:59

Clyde Schechter provides an answer that I started to write, but rereading post #1 convinced me I had missed a key point.

But I want every value for costs from dataset B to be merged with the respective last 12 months in dataset A. In this case that means that costs reported in September 2018 are matched with the months October 2017-September2018.

I think this is a job for rangejoin (user-written, see the output of ssc describe rangejoin, but SSC is not responding as I write this) and I have no experience with that command and without data to test on, I'll leave it to experts to assess this. It will certainly involve creating Stata SIF monthly dates in both dataset A and dataset B following the guidance in help datetime.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

26 Mar 2019, 15:24

Thank you, William Lisowski, for pointing out my misreading of the original request. My code in both #2 and #3 will put the value of costs from datasetB into every month of the same calendar year, not the preceding 12 months. He is also correct that this is, with a little extra, a job for -rangejoin-.

Code:

// CREATE TOY DATA TO DEMONSTRATE CODE clear* set obs 200 gen firm = ceil(_n/100) by firm, sort: gen mdate = tm(1999m12) + _n format mdate %tm gen year = year(dofm(mdate)) gen month = month(dofm(mdate)) set seed 1234 gen costs = rgamma(5, 10) if (firm == 1 & month == 9) /// | (firm == 2 & month == 12) drop mdate tempfile datasetB save `datasetB' describe drop costs tempfile datasetA save `datasetA' describe // CODE TO SOLVE PROBLEM BEGINS HERE use `datasetB', clear keep if !missing(costs) gen mdate = ym(year, month) format mdate %tm tempfile cost_data save `cost_data' use `datasetA', clear gen mdate = ym(year, month) format mdate %tm rangejoin mdate 0 11 using `cost_data', by(firm)

-rangejoin- is by Robert Picard and is available from SSC. To use it, you must also have -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Added: I have interpreted "past 12 months" to mean including the month in which the costs are reported and extending back 11 additional months from there. So if the costs are reported in September 2005, I assume they cover the period from October 2004 through September 2005. If it was intended that they should cover September 2004 through August 2005, change 0 11 in the -rangejoin- command to 1 12.

Last edited by Clyde Schechter; 26 Mar 2019, 15:33.
1 like
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#6

27 Mar 2019, 04:46

Thank you very much for your help,Clyde Schechter and William Lisowski, the code works perfect.

I do have two small follow-up questions, but I do not think they are each worth a new thread. If you think otherwise, let me know and I will open a new thread.

Question 1:
Within Dataset A I do sometimes have observations for a certain id lets say till December2018. But the latest costs from Dataset B for this id are lets say reported for October2018. For now, after using rangestat this leaves my observations for November2018 and December2018 without matched costs.

Is there a way to tell stata to take the latest reported costs (in this case October2018) and roll them forward if there are observations for the same id in Dataset A (in this case November2018 and December2018)?

Question 2:
Within Dataset B sometimes I do have the following observations including a duplicate in lines 2 and 3.

Code:

id year month costs 1 2004 10 5 1 2005 10 6 1 2005 10 6 1 2007 11 8

Is there a way to tell stata to adjust the year in line 3 by +1? So that in this case everything remains the same, except year in line 3 changes from 2005 to 2006.

Again thank you very much for your help.

Tim Wolf
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

27 Mar 2019, 06:07

I feel like we're on a slippery slope here. With Question 2, I notice that in 2007 the month becomes 11 instead of 10. That means that even after adjusting the year to 2006 in the duplicate observation there will be no cost assigned to month 11 of 2006. When you have gaps like that interior to your data, do you want to extend the value from October 2006 forward, as you describe in Question 1 for gaps at the end of the series? And also with Question 2, it is possible you have data with two duplicates of the original observation?

I wonder if you wouldn't be better served by doing a simple merge 1:1 id year (having first corrected duplicates in Dataset B) and then using ipolate (with the epolate option) to smoothly fill in the missing values.
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#8

27 Mar 2019, 09:22

I am sorry William Lisowski, the month 11 in line 4 was a mistake, it should be month 10, too. The duplicates are based on the fact, that for some years there is no observation of costs (in this case in year 2006), which is why the database where I got the data from just wrote year 2005 as duplicates.
Thank you for suggesting the ipolate function. However, the costs are a past-12-months-average and therefore I would like them to remain the same for each month in a given period. But I do like the idea to smooth the missing observations and I am wondering, if it is possible to tell stata to change the costs in line 3 to be the average of line 2 and 4. So that the data would end up looking the following:

Code:

id year month costs 1 2004 10 5 1 2005 10 6 1 2006 10 7 1 2007 10 8

Then I could use rangejoin again and my Question 2 would be perfectly solved.

Last edited by Tim Wolf; 27 Mar 2019, 09:33.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

27 Mar 2019, 13:19

Perhaps this (untested) change to part of Clyde's code from post #5 will correct the problem in Question 2.

Code:

//    CODE TO SOLVE PROBLEM BEGINS HERE
use `datasetB', clear
keep if !missing(costs)
sort id year, stable
// make sure there is at most one duplicate for any given id/year
by id (year): assert year!=year[_n-1] | year!=year[_n+1]
by id (year): replace costs = (costs[_n-1]+costs[_n+1])/2 if year==year[_n-1] & _n!=_N
by id (year): replace costs = costs[_n-1] if year==year[_n-1] & _n==_N
by id (year): replace year = year[_n-1]+1 if year==year[_n-1]
sort id year
gen mdate = ym(year, month)
format mdate %tm
tempfile cost_data
save `cost_data'

Comment

Tim Wolf

Join Date: Mar 2019

Posts: 25
#10

28 Mar 2019, 07:12

Thank you very much, it works.
Comment

Announcement

Merging monthly and yearly data for respective last 12months

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment