Calculate Consensus Forecast for the current and past quarters

james xu

Join Date: Aug 2024

Posts: 21
#1

Calculate Consensus Forecast for the current and past quarters

13 Aug 2024, 11:32

Dear forum,
I'm currently working on a project regarding equity analyst forecasts from the IBES database. I encountered some issues when computing the consensus forecast and I hope I could get some help on it.

The consensus forecast calculation defined in my project is the mean of all forecasts (excluding the focal forecast) from the current and past quarters for each firm's forecast period ending dates and for the very first forecast issued for each firm in each forecast ending period date, the consensus forecast for it should be skipped as missing. Please see the example data below. Each observation is at the forecast level, and only the most recent forecast of each analyst is kept in each quarter for each company (i.e., no duplicated forecasts in each quarter). Each variable is defined as: "cusip" is the firm identifier; "analys" is the analyst identifier; "value" is the forecast estimate issued by each analyst; "fpedats" is the forecast period ending dates (i.e., the fiscal period ending dates for which the forecasts are issued); "anndats" is the analyst forecast announcement dates, which is used to sort the issuing time of forecasts; "yq" is the abbr. for year-quarter, used as the time variable; "firm_fpe" is generated using

Code:

egen firm_fpe=group(cusip fpedats)

as a unique identifier for firm and fpedats group; "rolling_avg" is the consensus forecast computed using

Code:

rangestat (mean) rolling_avg=value, interval(yq -1 0) by(firm_fpe) excludeself

The issue I encountered is the incorrect results from the rolling_avg, even though I didn't see anywhere wrong using the rangestat command.

Basically, for the example below, the consensus forecast for the first observation should be missing because it is the very first forecast for the 66541 firm_fpe group. This is the first incorrect observation.
Then, for the second observation, its consensus forecast should be -3.7, which is the only forecast issued before it.
Then, for the third observation, its consensus forecast should be the mean of -3.7 and -5.3, which is -4.5, and this is correct.
Lastly, for the fourth observation, its consensus forecast should be -4.76 instead of -5.61. Honestly, I have no idea how this number is calculated on earth.

For the observation who -yq- is 182 (last observation), its consensus forecast should be the mean of forecasts from -yq- 181 (because we only consider the current the past quarter and exclude the focal forecast), which should be -6.25 instead of -5.36.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str9 cusip long analys float value str10 fpedats float(anndats yq firm_fpe) double rolling_avg "92047K10" 25170 -3.7 "2006-01-31" 16524 180 66541 -5.300000190734863 "92047K10" 71346 -5.3 "2006-01-31" 16524 180 66541 -4.5000001192092896 "92047K10" 52780 -5.3 "2006-01-31" 16525 180 66541 -4.5000001192092896 "92047K10" 78488 -5.6 "2006-01-31" 16561 181 66541 -5.616666754086812 "92047K10" 25170 -5.3 "2006-01-31" 16561 181 66541 -5.666666706403096 "92047K10" 52780 -7 "2006-01-31" 16576 181 66541 -5.383333404858907 "92047K10" 71346 -7.1 "2006-01-31" 16576 181 66541 -5.366666754086812 "92047K10" 71346 -4.3 "2006-01-31" 16658 182 66541 -5.187499940395355 end format %d anndats

I couldn't figure out whether the incorrect calculation is due to my coding or my use of wrong command. Hopefully, I could use some help.

I hope the questions is clear and I would very much appreciate your help. Please let me know if you need any clarification. Thanks in advance!
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3153
#2

13 Aug 2024, 11:58

Something like this maybe.

Code:

asrol value , stat(mean) window(anndats 2) min(2) g(valuex) by(cusip)

Are you averaging by cusip or across cusip (as in the dataex)? And all the dates are the same too, making it a bit tricky to use the dataex.
Comment
George Ford

Join Date: Aug 2014

Posts: 3153
#3

13 Aug 2024, 12:00

rangestate might work better by(cusip) rather than firm_fpe (which includes the date).
Comment
james xu

Join Date: Aug 2024

Posts: 21
#4

13 Aug 2024, 12:11

Hi George,
Thank you for your reply. I will give -asrol- a try.
I am averaging within cusip and forecast ending period dates (fpedats) because I am trying to compute the consensus forecast for each firm for each fpedats.
I thought creating a firm_fpe would make things a lot easier since I am computing consensus forecast for each firm and each fpe_dats.
In my dataset, I have various "cusip" and dates. My dataex is only illustrating several incorrect calculation from rangestat
Comment
George Ford

Join Date: Aug 2014

Posts: 3153
#5

13 Aug 2024, 13:32

egenmore might work, but not sure you can by ma.

HTML Code:

https://www.stata.com/support/faqs/statistics/moving-averages-and-panel-data/
Comment
james xu

Join Date: Aug 2024

Posts: 21
#6

13 Aug 2024, 13:46

Originally posted by George Ford View Post

egenmore might work, but not sure you can by ma.

HTML Code:

https://www.stata.com/support/faqs/statistics/moving-averages-and-panel-data/

Hi George,
Thank you. Strictly speaking, my data is not panel data since there exists repeated time values (because multiple analysts may issue forecasts in the same quarter).
Comment
George Ford

Join Date: Aug 2014

Posts: 3153
#7

13 Aug 2024, 14:03

be easier to collapse those to a single number? or are you interested in the variation? (could collapse the sd too).
Comment
james xu

Join Date: Aug 2024

Posts: 21
#8

13 Aug 2024, 14:49

Originally posted by George Ford View Post

be easier to collapse those to a single number? or are you interested in the variation? (could collapse the sd too).

Hi George,
The data is at the forecast level and different analysts may issue forecasts on the same day. Collapsing the data would distort the data itself.
Plus, I tried the -asrol- command. It still generates the same error as using the -rangestat-. -asrol- doesn't require to declare the data format.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

13 Aug 2024, 16:21

I am not surprised that -asrol- gave you the same undesired results as -rangestat-. I think you have not clearly understood your problem. While you have described it as a rolling mean, excluding the self, if we examine the details of how you would calculate the first four observations it becomes clear that this is not a rolling mean and it cannot be computed with either of -asrol- or -rangestat-. In addition to the rolling time window for the yq variable, in your step by step you also invoke a second condition: you want do not want to include forecasts that are preceded by the focal forecast. This introduces a second interval restriction into the mix, and neither -rangestat- nor -asrol- can accommodate that.

I spent some time trying to develop a solution, but was not completely able to do so, because while I can infer that you only want to include forecasts that precede the focal forecast, you are not consistent in how you handle forecasts (of the same firm and of the present or preceding quarter) that have the same chronological order (anndats) as the focal forecast. It seems that you include some, but not all of those. It may be that there is information in the full data set that disambiguates which ones are to be included and which are not, but I do not recognize that information anywhere in the example data shown. I have tried a solution that include all the same-day forecasts and another that excludes all the same day forecasts. Each one gives the solution you are seeking for most of the observations, but noth give "wrong" answers for a proper subset of the forecasts that have same-day forecasts in the data.

If you can clarify this, I think I can solve your problem.

Last edited by Clyde Schechter; 13 Aug 2024, 16:23.
Comment
james xu

Join Date: Aug 2024

Posts: 21
#10

13 Aug 2024, 19:58

Originally posted by Clyde Schechter View Post

I am not surprised that -asrol- gave you the same undesired results as -rangestat-. I think you have not clearly understood your problem. While you have described it as a rolling mean, excluding the self, if we examine the details of how you would calculate the first four observations it becomes clear that this is not a rolling mean and it cannot be computed with either of -asrol- or -rangestat-. In addition to the rolling time window for the yq variable, in your step by step you also invoke a second condition: you want do not want to include forecasts that are preceded by the focal forecast. This introduces a second interval restriction into the mix, and neither -rangestat- nor -asrol- can accommodate that.

I spent some time trying to develop a solution, but was not completely able to do so, because while I can infer that you only want to include forecasts that precede the focal forecast, you are not consistent in how you handle forecasts (of the same firm and of the present or preceding quarter) that have the same chronological order (anndats) as the focal forecast. It seems that you include some, but not all of those. It may be that there is information in the full data set that disambiguates which ones are to be included and which are not, but I do not recognize that information anywhere in the example data shown. I have tried a solution that include all the same-day forecasts and another that excludes all the same day forecasts. Each one gives the solution you are seeking for most of the observations, but noth give "wrong" answers for a proper subset of the forecasts that have same-day forecasts in the data.

If you can clarify this, I think I can solve your problem.

Hi Clyde,
Thank you so much for your thoughtful reply. Please allow me to rephrase the calculation of the consensus forecast in my question:
The big premise of the following steps is that all observations are already sorted by cusip, fpedats, and the corresponding announcement dates (anndats not included in example below)

Preceding forecasts within the same cusip-fpedats (firm_fpe) group as the focal forecast are only considered.

Only consider the preceding forecasts in the current and past "yq" of the focal forecast.

the consensus forecast should be missing if
it is the very first forecast of the "yq" and "firm_fpe" pair (consider this observation from below: {"03837J10" -1.3 "2015-12-31" 223 5198 . -.87 -.87})

the preceding forecasts of the focal forecasts is out of the range (i.e., current and past yq) and it is the very first observation (consider this observation from below: {"03837J10" -.49 "2016-12-31" 225 5199 . -.6 -.6}

Exclude the focal forecast

I probably could've used a better data example in my question. Here is a more comprehensive one:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str9 cusip float value str10 fpedats float(yq firm_fpe for_loop_mean) double(rangestat_mean asrol_mean) "03837C10" -.85 "2020-12-31" 240 5197 . -.91 -.91 "03837C10" -.73 "2020-12-31" 240 5197 -.85 -.95 -.95 "03837C10" -1.14 "2020-12-31" 240 5197 -.79 -.81 -.81 "03837C10" -.85 "2020-12-31" 240 5197 -.9066667 -.91 -.91 "03837C10" -.67 "2020-12-31" 241 5197 -.8925 -.79 -.66 "03837C10" -.72 "2020-12-31" 241 5197 -.848 -.79 -.64 "03837C10" -.63 "2020-12-31" 241 5197 -.8266667 -.8 -.67 "03837C10" -.63 "2020-12-31" 241 5197 -.7985715 -.8 -.67 "03837C10" -.55 "2020-12-31" 242 5197 -.6625 -.59 -.5 "03837C10" -.479 "2020-12-31" 242 5197 -.64 -.6 -.52 "03837C10" -.55 "2020-12-31" 242 5197 -.6131667 -.59 -.5 "03837C10" -.47 "2020-12-31" 242 5197 -.6041428 -.6 -.53 "03837C10" -.42 "2020-12-31" 243 5197 -.51225 -.48 -.43 "03837C10" -.43 "2020-12-31" 243 5197 -.4938 -.48 -.43 "03837C10" -.43 "2020-12-31" 243 5197 -.4831667 -.48 -.43 "03837C10" -.44 "2020-12-31" 243 5197 -.4755714 -.48 -.43 "03837J10" -1.3 "2015-12-31" 223 5198 . -.87 -.87 "03837J10" -.87 "2015-12-31" 223 5198 -1.3 -1.3 -1.3 "03837J10" -.49 "2016-12-31" 225 5199 . -.6 -.6 "03837J10" -.6 "2016-12-31" 225 5199 -.49 -.49 -.49 "03837J10" -.73 "2016-12-31" 226 5199 -.545 -.64 -.74 "03837J10" -.78 "2016-12-31" 226 5199 -.6066667 -.63 -.72 "03837J10" -.71 "2016-12-31" 226 5199 -.65 -.65 -.75 "03837J10" -.8 "2016-12-31" 227 5199 -.74 -.76 -.78 "03837J10" -.78 "2016-12-31" 227 5199 -.755 -.76 -.8 "03837J10" -.79 "2016-12-31" 227 5199 -.76 -.76 -.79 end

In this data example, all observations have been sorted by cusip, fpedats, and the corresponding announcement dates (anndats not included). This example includes two different firms and different fpedats. To better illustrate my calculation needs, for observation {"03837C10" -.43 "2020-12-31" 243 5197 -.4938 -.48 -.43}, its consensus forecast should be computed using preceding forecasts in yq 243 and 242 and in the same fpedats "2020-12-31". Thus, the consensus forecast should be the mean of (-.55, -.479, -.55, -.47, -.42), which is -.4938.

for_loop_mean is generated by using a for loop, which takes significantly longer time but gives much better results

Code:

forvalues i = 1407743/`=_N' { * Get the current yq and firm_fpe local cur_yq = yq[`i'] local cur_firm_fpe = firm_fpe[`i'] * If the current yq is the first quarter, skip the rolling average calculation if `i' > 1 & `cur_yq' > yq[1] { * Get the previous yq (current yq - 1) local prev_yq = `cur_yq' - 1 * Compute the rolling average for the current observation * This includes all previous observations within the same firm_fpe and the same or previous yq quietly summarize value if firm_fpe == `cur_firm_fpe' & (yq == `prev_yq' | (yq == `cur_yq' & _n < `i')) replace prevailing_consensus_1QTR = r(mean) in `i' } }

rangestat_mean follows

Code:

rangestat (mean) rangestat_mean=value, interval(yq -1 0) by(firm_fpe) excludeself

and asrol_mean follows

Code:

bys firm_fpe: asrol value, win(yq -1 0) stat(mean) xf(focal) g(asrol_mean)

Please let me know if you need more clarifications! Much appreciated!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

#11

13 Aug 2024, 20:33

OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:

Code:

//    MARK THE EXISTING SORT ORDER
gen `c(obs_t)' obs_no = _n

//    COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE
preserve
keep obs_no cusip yq value
tempfile holding
save `holding'

//    JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED
restore, preserve
rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION
keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW

//    CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO
collapse (mean) wanted = value_U, by(obs_no)
save `"`holding'"', replace

//    NOW MERGE THIS BACK TO THE ORIGINAL DATA
restore
merge 1:1 obs_no using `holding', assert(match master) nogenerate

-rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.

Comment

james xu

Join Date: Aug 2024

Posts: 21
#12

13 Aug 2024, 21:08

Originally posted by Clyde Schechter View Post

OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:

Code:

// MARK THE EXISTING SORT ORDER gen `c(obs_t)' obs_no = _n // COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE preserve keep obs_no cusip yq value tempfile holding save `holding' // JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED restore, preserve rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW // CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO collapse (mean) wanted = value_U, by(obs_no) save `"`holding'"', replace // NOW MERGE THIS BACK TO THE ORIGINAL DATA restore merge 1:1 obs_no using `holding', assert(match master) nogenerate

-rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.

Thank you so much, Clyde. I will give it a try. But before that, I have a question regarding your code:

Code:

rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION

In my data, each cusip corresponds to multiple fpedats, and my calculation of consensus forecast should be carried out under the same fpedats. That being said, should I swap -by(cusip)- to -by(firm_fpe)-?

Thank you in advance!
Comment
james xu

Join Date: Aug 2024

Posts: 21
#13

14 Aug 2024, 07:16

Originally posted by Clyde Schechter View Post

OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:

Code:

// MARK THE EXISTING SORT ORDER gen `c(obs_t)' obs_no = _n // COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE preserve keep obs_no cusip yq value tempfile holding save `holding' // JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED restore, preserve rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW // CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO collapse (mean) wanted = value_U, by(obs_no) save `"`holding'"', replace // NOW MERGE THIS BACK TO THE ORIGINAL DATA restore merge 1:1 obs_no using `holding', assert(match master) nogenerate

-rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.

Hi Clyde,
I am trying your code this morning. I realized that my laptop cannot handle rangejoin too well since my data has over 1.5 million observations and my laptop freezes when running the rangejoin command line. Do you think there could be a less computation-heavy way considering the size of my data? Thank you so much in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#14

14 Aug 2024, 09:04

Re #12: Yes, if the included values should match on fpedats as well as cusip, then change -by(cusip)- to -by(cusip fpedats)-.

Do try again with this. If -rangejoin- is truly freezing up your computer, it is due to memory issues. Doing this -by(cusip fpedats)- will somewhat decrease the memory needed for this, especially if many cusips have many different fpedats associated with them.

But don't be so quick to assume that your computer is freezing up. With a data set this size, the computational task for -rangejoin- is very large and it may just be taking a long time to do it. There is no interim output from -rangejoin- to show that it is still running. And the way it works, it pretty much locks out attempts to do other things in Stata. My experience on this Forum is that people often seriously underestimate how long some calculations can take. If you are accustomed to calculations that complete in seconds or small numbers of minutes, a calculation that needs to run for days or weeks may be hard to imagine. But such calculations exist, and calculations that require pairing up observations can fall into that range. If you are running on a Windows machine, here's a way to tell if your computer is truly frozen, as opposed to just grinding out a very long calculation: open the Task Manager. Find the information about Stata. If the memory or CPU usage numbers are changing several times a minute, then Stata is still running. If those numbers are not changing, then there is a good chance Stata is frozen and you need to kill the process.

I cannot think of a less computation-intensive way to do this. You might be able to speed up the -collapse- part (although I'm guessing you haven't even gotten to that yet, because the really slow part will be -rangejoin-) by replacing -collapse- with user written -gcollapse-, part of the -gtools- suite available from SSC. But I doubt that will make an enormous difference. So I think your options are either to be very patient, or find a faster machine to run on.

But I do think you may be up against a really difficult memory problem. That is, you may find yourself waiting a long time for this to run, only for it to end with a "op. sys. refuses to provide memory" error. So one thing that would solve that problem is to break up your data set into a single data set per cusip, or even per cusip fpedats combination. Then run the code separately on each of those smaller data sets, and at the end, -append- all the results together. This will overcome the memory problem. It might also save time: the calculations for 100 cusips in a single file take much more than 100 times the calculations for a single cusip in a single file (ditto for cusip-fpedats combination). So calculation time will be substantially reduced this way. But, you have to do a lot more disk thrashing, which will slow you down. I don't have a good feel for which of these effects will win out overall. In part it depends on your hardware.
Comment
james xu

Join Date: Aug 2024

Posts: 21
#15

14 Aug 2024, 11:16

Thank you, Clyde! I was also thinking about dividing up my data into several chunks and run on them separately. I will give it a try and update later. Thanks again!
Comment

Announcement

Calculate Consensus Forecast for the current and past quarters

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment