Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate Consensus Forecast for the current and past quarters

    Dear forum,
    I'm currently working on a project regarding equity analyst forecasts from the IBES database. I encountered some issues when computing the consensus forecast and I hope I could get some help on it.

    The consensus forecast calculation defined in my project is the mean of all forecasts (excluding the focal forecast) from the current and past quarters for each firm's forecast period ending dates and for the very first forecast issued for each firm in each forecast ending period date, the consensus forecast for it should be skipped as missing. Please see the example data below. Each observation is at the forecast level, and only the most recent forecast of each analyst is kept in each quarter for each company (i.e., no duplicated forecasts in each quarter). Each variable is defined as: "cusip" is the firm identifier; "analys" is the analyst identifier; "value" is the forecast estimate issued by each analyst; "fpedats" is the forecast period ending dates (i.e., the fiscal period ending dates for which the forecasts are issued); "anndats" is the analyst forecast announcement dates, which is used to sort the issuing time of forecasts; "yq" is the abbr. for year-quarter, used as the time variable; "firm_fpe" is generated using
    Code:
    egen firm_fpe=group(cusip fpedats)
    as a unique identifier for firm and fpedats group; "rolling_avg" is the consensus forecast computed using
    Code:
    rangestat (mean) rolling_avg=value, interval(yq -1 0) by(firm_fpe) excludeself
    The issue I encountered is the incorrect results from the rolling_avg, even though I didn't see anywhere wrong using the rangestat command.

    Basically, for the example below, the consensus forecast for the first observation should be missing because it is the very first forecast for the 66541 firm_fpe group. This is the first incorrect observation.
    Then, for the second observation, its consensus forecast should be -3.7, which is the only forecast issued before it.
    Then, for the third observation, its consensus forecast should be the mean of -3.7 and -5.3, which is -4.5, and this is correct.
    Lastly, for the fourth observation, its consensus forecast should be -4.76 instead of -5.61. Honestly, I have no idea how this number is calculated on earth.

    For the observation who -yq- is 182 (last observation), its consensus forecast should be the mean of forecasts from -yq- 181 (because we only consider the current the past quarter and exclude the focal forecast), which should be -6.25 instead of -5.36.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 cusip long analys float value str10 fpedats float(anndats yq firm_fpe) double rolling_avg
    "92047K10" 25170 -3.7 "2006-01-31" 16524 180 66541  -5.300000190734863
    "92047K10" 71346 -5.3 "2006-01-31" 16524 180 66541 -4.5000001192092896
    "92047K10" 52780 -5.3 "2006-01-31" 16525 180 66541 -4.5000001192092896
    "92047K10" 78488 -5.6 "2006-01-31" 16561 181 66541  -5.616666754086812
    "92047K10" 25170 -5.3 "2006-01-31" 16561 181 66541  -5.666666706403096
    "92047K10" 52780   -7 "2006-01-31" 16576 181 66541  -5.383333404858907
    "92047K10" 71346 -7.1 "2006-01-31" 16576 181 66541  -5.366666754086812
    "92047K10" 71346 -4.3 "2006-01-31" 16658 182 66541  -5.187499940395355
    end
    format %d anndats
    I couldn't figure out whether the incorrect calculation is due to my coding or my use of wrong command. Hopefully, I could use some help.

    I hope the questions is clear and I would very much appreciate your help. Please let me know if you need any clarification. Thanks in advance!

  • #2
    Something like this maybe.

    Code:
    asrol value , stat(mean) window(anndats 2) min(2) g(valuex) by(cusip)
    Are you averaging by cusip or across cusip (as in the dataex)? And all the dates are the same too, making it a bit tricky to use the dataex.

    Comment


    • #3
      rangestate might work better by(cusip) rather than firm_fpe (which includes the date).

      Comment


      • #4
        Hi George,
        Thank you for your reply. I will give -asrol- a try.
        I am averaging within cusip and forecast ending period dates (fpedats) because I am trying to compute the consensus forecast for each firm for each fpedats.
        I thought creating a firm_fpe would make things a lot easier since I am computing consensus forecast for each firm and each fpe_dats.
        In my dataset, I have various "cusip" and dates. My dataex is only illustrating several incorrect calculation from rangestat

        Comment


        • #5
          egenmore might work, but not sure you can by ma.
          HTML Code:
          https://www.stata.com/support/faqs/statistics/moving-averages-and-panel-data/

          Comment


          • #6
            Originally posted by George Ford View Post
            egenmore might work, but not sure you can by ma.
            HTML Code:
            https://www.stata.com/support/faqs/statistics/moving-averages-and-panel-data/
            Hi George,
            Thank you. Strictly speaking, my data is not panel data since there exists repeated time values (because multiple analysts may issue forecasts in the same quarter).

            Comment


            • #7
              be easier to collapse those to a single number? or are you interested in the variation? (could collapse the sd too).

              Comment


              • #8
                Originally posted by George Ford View Post
                be easier to collapse those to a single number? or are you interested in the variation? (could collapse the sd too).
                Hi George,
                The data is at the forecast level and different analysts may issue forecasts on the same day. Collapsing the data would distort the data itself.
                Plus, I tried the -asrol- command. It still generates the same error as using the -rangestat-. -asrol- doesn't require to declare the data format.

                Comment


                • #9
                  I am not surprised that -asrol- gave you the same undesired results as -rangestat-. I think you have not clearly understood your problem. While you have described it as a rolling mean, excluding the self, if we examine the details of how you would calculate the first four observations it becomes clear that this is not a rolling mean and it cannot be computed with either of -asrol- or -rangestat-. In addition to the rolling time window for the yq variable, in your step by step you also invoke a second condition: you want do not want to include forecasts that are preceded by the focal forecast. This introduces a second interval restriction into the mix, and neither -rangestat- nor -asrol- can accommodate that.

                  I spent some time trying to develop a solution, but was not completely able to do so, because while I can infer that you only want to include forecasts that precede the focal forecast, you are not consistent in how you handle forecasts (of the same firm and of the present or preceding quarter) that have the same chronological order (anndats) as the focal forecast. It seems that you include some, but not all of those. It may be that there is information in the full data set that disambiguates which ones are to be included and which are not, but I do not recognize that information anywhere in the example data shown. I have tried a solution that include all the same-day forecasts and another that excludes all the same day forecasts. Each one gives the solution you are seeking for most of the observations, but noth give "wrong" answers for a proper subset of the forecasts that have same-day forecasts in the data.

                  If you can clarify this, I think I can solve your problem.
                  Last edited by Clyde Schechter; 13 Aug 2024, 16:23.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    I am not surprised that -asrol- gave you the same undesired results as -rangestat-. I think you have not clearly understood your problem. While you have described it as a rolling mean, excluding the self, if we examine the details of how you would calculate the first four observations it becomes clear that this is not a rolling mean and it cannot be computed with either of -asrol- or -rangestat-. In addition to the rolling time window for the yq variable, in your step by step you also invoke a second condition: you want do not want to include forecasts that are preceded by the focal forecast. This introduces a second interval restriction into the mix, and neither -rangestat- nor -asrol- can accommodate that.

                    I spent some time trying to develop a solution, but was not completely able to do so, because while I can infer that you only want to include forecasts that precede the focal forecast, you are not consistent in how you handle forecasts (of the same firm and of the present or preceding quarter) that have the same chronological order (anndats) as the focal forecast. It seems that you include some, but not all of those. It may be that there is information in the full data set that disambiguates which ones are to be included and which are not, but I do not recognize that information anywhere in the example data shown. I have tried a solution that include all the same-day forecasts and another that excludes all the same day forecasts. Each one gives the solution you are seeking for most of the observations, but noth give "wrong" answers for a proper subset of the forecasts that have same-day forecasts in the data.

                    If you can clarify this, I think I can solve your problem.

                    Hi Clyde,
                    Thank you so much for your thoughtful reply. Please allow me to rephrase the calculation of the consensus forecast in my question:
                    1. The big premise of the following steps is that all observations are already sorted by cusip, fpedats, and the corresponding announcement dates (anndats not included in example below)
                    2. Preceding forecasts within the same cusip-fpedats (firm_fpe) group as the focal forecast are only considered.
                    3. Only consider the preceding forecasts in the current and past "yq" of the focal forecast.
                    4. the consensus forecast should be missing if
                      1. it is the very first forecast of the "yq" and "firm_fpe" pair (consider this observation from below: {"03837J10" -1.3 "2015-12-31" 223 5198 . -.87 -.87})
                      2. the preceding forecasts of the focal forecasts is out of the range (i.e., current and past yq) and it is the very first observation (consider this observation from below: {"03837J10" -.49 "2016-12-31" 225 5199 . -.6 -.6}
                    5. Exclude the focal forecast

                    I probably could've used a better data example in my question. Here is a more comprehensive one:
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str9 cusip float value str10 fpedats float(yq firm_fpe for_loop_mean) double(rangestat_mean asrol_mean)
                    "03837C10"  -.85 "2020-12-31" 240 5197         .               -.91               -.91
                    "03837C10"  -.73 "2020-12-31" 240 5197      -.85             -.95                -.95
                    "03837C10" -1.14 "2020-12-31" 240 5197      -.79               -.81               -.81
                    "03837C10"  -.85 "2020-12-31" 240 5197 -.9066667               -.91               -.91
                    "03837C10"  -.67 "2020-12-31" 241 5197    -.8925               -.79               -.66
                    "03837C10"  -.72 "2020-12-31" 241 5197     -.848               -.79               -.64
                    "03837C10"  -.63 "2020-12-31" 241 5197 -.8266667                -.8               -.67
                    "03837C10"  -.63 "2020-12-31" 241 5197 -.7985715                -.8               -.67
                    "03837C10"  -.55 "2020-12-31" 242 5197    -.6625               -.59                -.5
                    "03837C10" -.479 "2020-12-31" 242 5197      -.64                -.6               -.52
                    "03837C10"  -.55 "2020-12-31" 242 5197 -.6131667               -.59                -.5
                    "03837C10"  -.47 "2020-12-31" 242 5197 -.6041428                -.6               -.53
                    "03837C10"  -.42 "2020-12-31" 243 5197   -.51225               -.48               -.43
                    "03837C10"  -.43 "2020-12-31" 243 5197    -.4938               -.48               -.43
                    "03837C10"  -.43 "2020-12-31" 243 5197 -.4831667               -.48               -.43
                    "03837C10"  -.44 "2020-12-31" 243 5197 -.4755714               -.48               -.43
                    "03837J10"  -1.3 "2015-12-31" 223 5198         .               -.87               -.87
                    "03837J10"  -.87 "2015-12-31" 223 5198      -1.3               -1.3               -1.3
                    "03837J10"  -.49 "2016-12-31" 225 5199         .                -.6                -.6
                    "03837J10"   -.6 "2016-12-31" 225 5199      -.49               -.49               -.49
                    "03837J10"  -.73 "2016-12-31" 226 5199     -.545               -.64               -.74
                    "03837J10"  -.78 "2016-12-31" 226 5199 -.6066667               -.63               -.72
                    "03837J10"  -.71 "2016-12-31" 226 5199      -.65               -.65               -.75
                    "03837J10"   -.8 "2016-12-31" 227 5199      -.74               -.76               -.78
                    "03837J10"  -.78 "2016-12-31" 227 5199     -.755               -.76                -.8
                    "03837J10"  -.79 "2016-12-31" 227 5199      -.76               -.76               -.79
                    end
                    In this data example, all observations have been sorted by cusip, fpedats, and the corresponding announcement dates (anndats not included). This example includes two different firms and different fpedats. To better illustrate my calculation needs, for observation {"03837C10" -.43 "2020-12-31" 243 5197 -.4938 -.48 -.43}, its consensus forecast should be computed using preceding forecasts in yq 243 and 242 and in the same fpedats "2020-12-31". Thus, the consensus forecast should be the mean of (-.55, -.479, -.55, -.47, -.42), which is -.4938.

                    for_loop_mean is generated by using a for loop, which takes significantly longer time but gives much better results
                    Code:
                    forvalues i = 1407743/`=_N' {
                        * Get the current yq and firm_fpe
                        local cur_yq = yq[`i']
                        local cur_firm_fpe = firm_fpe[`i']
                    
                        * If the current yq is the first quarter, skip the rolling average calculation
                        if `i' > 1 & `cur_yq' > yq[1] {
                            * Get the previous yq (current yq - 1)
                            local prev_yq = `cur_yq' - 1
                    
                            * Compute the rolling average for the current observation
                            * This includes all previous observations within the same firm_fpe and the same or previous yq
                            quietly summarize value if firm_fpe == `cur_firm_fpe' & (yq == `prev_yq' | (yq == `cur_yq' & _n < `i'))
                            replace prevailing_consensus_1QTR = r(mean) in `i'
                        }
                    }
                    rangestat_mean follows
                    Code:
                    rangestat (mean) rangestat_mean=value, interval(yq -1 0) by(firm_fpe) excludeself
                    and asrol_mean follows
                    Code:
                    bys firm_fpe: asrol value, win(yq -1 0) stat(mean) xf(focal) g(asrol_mean)
                    Please let me know if you need more clarifications! Much appreciated!

                    Comment


                    • #11
                      OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:
                      Code:
                      //    MARK THE EXISTING SORT ORDER
                      gen `c(obs_t)' obs_no = _n
                      
                      //    COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE
                      preserve
                      keep obs_no cusip yq value
                      tempfile holding
                      save `holding'
                      
                      //    JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED
                      restore, preserve
                      rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION
                      keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW
                      
                      //    CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO
                      collapse (mean) wanted = value_U, by(obs_no)
                      save `"`holding'"', replace
                      
                      //    NOW MERGE THIS BACK TO THE ORIGINAL DATA
                      restore
                      merge 1:1 obs_no using `holding', assert(match master) nogenerate
                      -rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:
                        Code:
                        // MARK THE EXISTING SORT ORDER
                        gen `c(obs_t)' obs_no = _n
                        
                        // COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE
                        preserve
                        keep obs_no cusip yq value
                        tempfile holding
                        save `holding'
                        
                        // JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED
                        restore, preserve
                        rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION
                        keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW
                        
                        // CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO
                        collapse (mean) wanted = value_U, by(obs_no)
                        save `"`holding'"', replace
                        
                        // NOW MERGE THIS BACK TO THE ORIGINAL DATA
                        restore
                        merge 1:1 obs_no using `holding', assert(match master) nogenerate
                        -rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.
                        Thank you so much, Clyde. I will give it a try. But before that, I have a question regarding your code:
                        Code:
                        rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION

                        In my data, each cusip corresponds to multiple fpedats, and my calculation of consensus forecast should be carried out under the same fpedats. That being said, should I swap -by(cusip)- to -by(firm_fpe)-?

                        Thank you in advance!

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          OK, so the precedence relationship was implicit in the existing sort order of the data. Hiding in plain sight, as it were. Knowing that:
                          Code:
                          // MARK THE EXISTING SORT ORDER
                          gen `c(obs_t)' obs_no = _n
                          
                          // COPY THE ESSENTIAL VARIABLES TO A SEPARATE FILE
                          preserve
                          keep obs_no cusip yq value
                          tempfile holding
                          save `holding'
                          
                          // JOIN THE NEW FILE TO THE ORIGINAL ENFORCING THE CONSTRAINTS NEEDED
                          restore, preserve
                          rangejoin obs_no . -1 using `holding', by(cusip) // ENFORCE PRECEDENCE & SELF-EXCLUSION
                          keep if inlist(yq-yq_U, 0, 1) // ENFORCE 2 QUARTER WINDOW
                          
                          // CALCULATE THE MEANS OF THE SURVIVING PAIRINGS FOR EACH OBS_NO
                          collapse (mean) wanted = value_U, by(obs_no)
                          save `"`holding'"', replace
                          
                          // NOW MERGE THIS BACK TO THE ORIGINAL DATA
                          restore
                          merge 1:1 obs_no using `holding', assert(match master) nogenerate
                          -rangejoin- is written by Robert Picard and is available from SSC. Its use requires also having installed -rangestat-, which, based on the earlier posts in the thread, it appears you already have.
                          Hi Clyde,
                          I am trying your code this morning. I realized that my laptop cannot handle rangejoin too well since my data has over 1.5 million observations and my laptop freezes when running the rangejoin command line. Do you think there could be a less computation-heavy way considering the size of my data? Thank you so much in advance!

                          Comment


                          • #14
                            Re #12: Yes, if the included values should match on fpedats as well as cusip, then change -by(cusip)- to -by(cusip fpedats)-.

                            Do try again with this. If -rangejoin- is truly freezing up your computer, it is due to memory issues. Doing this -by(cusip fpedats)- will somewhat decrease the memory needed for this, especially if many cusips have many different fpedats associated with them.

                            But don't be so quick to assume that your computer is freezing up. With a data set this size, the computational task for -rangejoin- is very large and it may just be taking a long time to do it. There is no interim output from -rangejoin- to show that it is still running. And the way it works, it pretty much locks out attempts to do other things in Stata. My experience on this Forum is that people often seriously underestimate how long some calculations can take. If you are accustomed to calculations that complete in seconds or small numbers of minutes, a calculation that needs to run for days or weeks may be hard to imagine. But such calculations exist, and calculations that require pairing up observations can fall into that range. If you are running on a Windows machine, here's a way to tell if your computer is truly frozen, as opposed to just grinding out a very long calculation: open the Task Manager. Find the information about Stata. If the memory or CPU usage numbers are changing several times a minute, then Stata is still running. If those numbers are not changing, then there is a good chance Stata is frozen and you need to kill the process.

                            I cannot think of a less computation-intensive way to do this. You might be able to speed up the -collapse- part (although I'm guessing you haven't even gotten to that yet, because the really slow part will be -rangejoin-) by replacing -collapse- with user written -gcollapse-, part of the -gtools- suite available from SSC. But I doubt that will make an enormous difference. So I think your options are either to be very patient, or find a faster machine to run on.

                            But I do think you may be up against a really difficult memory problem. That is, you may find yourself waiting a long time for this to run, only for it to end with a "op. sys. refuses to provide memory" error. So one thing that would solve that problem is to break up your data set into a single data set per cusip, or even per cusip fpedats combination. Then run the code separately on each of those smaller data sets, and at the end, -append- all the results together. This will overcome the memory problem. It might also save time: the calculations for 100 cusips in a single file take much more than 100 times the calculations for a single cusip in a single file (ditto for cusip-fpedats combination). So calculation time will be substantially reduced this way. But, you have to do a lot more disk thrashing, which will slow you down. I don't have a good feel for which of these effects will win out overall. In part it depends on your hardware.

                            Comment


                            • #15
                              Thank you, Clyde! I was also thinking about dividing up my data into several chunks and run on them separately. I will give it a try and update later. Thanks again!

                              Comment

                              Working...
                              X