Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a reference price measure as a probability weighted sum of past stock prices

    Dear Statalist members,

    I am trying to create a reference price measure that is an average of past 250 days stock prices, weighted by the probabilities of a stock not being traded since the date of the corresponding price:
    reference_pricet=(pricet-250*turnover_ratiot-250*(1-turnover_ratiot-249)*(1-turnover_ratiot-248)*...*(1-turnover_ratiot-1)+pricet-249*turnover_ratiot-249*(1-turnover_ratiot-248)*(1-turnover_ratiot-247)*...*(1-turnover_ratiot-1)+...+pricet-1*turnover_ratiot-1)/k, where k is a constant that ensure price weights add up to 1 (see the formula below).
    Click image for larger version

Name:	reference price formula.png
Views:	1
Size:	3.2 KB
ID:	1329147

    I have written the following code, although as I am just a beginner in stata, it is far from optimized. Thus I would appreciate if you could give some tips on speeding up the code or even avoiding the loops, since with my fairly big data sample it takes quite some time to generate the variable. Attached I have enclosed a data sample for example.

    Code:
    qui{
    nois _dots 0, title(Loop running) reps(100)
    bys id (date): gen float volr=vol/freefloat            // generating turnover ratio
    bys id (date): gen float rev_volr=1-volr            
    bys id (date): gen total_volr=sum(log(rev_volr))
    
    forvalues i=250(-1)1 {
        bys id (date): gen ref_`i'=price[_n-`i']*vol[_n-`i']*exp(total_volr-total_volr[_n-`i'+1])
        nois _dots `i' 0
    }
    egen ref=rowtotal(ref_250-ref_1)
    
    forvalues i=1/250 {
        drop ref_`i'
    }
    
    forvalues i=250(-1)1 {
        bys id (date): gen k_`i'=vol[_n-`i']*exp(total_volr-total_volr[_n-`i'+1])
        nois _dots `i' 0
    }
    egen k=rowtotal(k_250-k_1)
    
    forvalues i=1/250 {
        drop k_`i'
    }
    
    gen float refprice=ref/k                                       // generating reference price measure
    drop ref k volr rev_volr total_volr
    bys id (date) gen float cgo=(price-refprice)/price   // generating capital gains overhang proxy
    }
    Thank you for your help in advance.

    Best regards,
    Artem Rusanov
    Attached Files

  • #2
    I can think of several tings to speed this up, but first I would like to clarify a few things that seem to me like possible errors.

    The commands that generate volr and rev_volr don't appear to involve the id grouping, so I don't know why you have them done with -by-. Either the -by- is unnecessary or you have left something out of those formulas. (The formula you pasted as a photo is too small to read on my computer, so I can't figure it out from there. Posting photos on this forum often works poorly.)

    Next, in -gen total_volr....- you are generating a sum of logarithms of rev_volr. So either you have used a somewhat confusing name for this variable or perhaps you meant to evaluate it some other way. Is this right hand side really what you wanted and you were just being very telegraphic with the variable name?

    Comment


    • #3
      Thank you for a quick reply, Clyde!

      You are right regarding generating the variables volr and rev_volr, grouping function is completely unnecessary. Moreover, I appologize for not naming the variable that generates sum of logarithms of rev_volr correctly, indeed it should have been "total_log_rev_volr". Please see below the bigger version of the formula picture and the corrected code.
      Click image for larger version

Name:	reference price formula.png
Views:	1
Size:	11.9 KB
ID:	1329171



      Code:
      qui{
      nois _dots 0, title(Loop running) reps(100)
      gen float volr=vol/freefloat
      gen float rev_volr=1-volr            
      bys id (date): gen total_log_rev_volr=sum(log(rev_volr))
      
      forvalues i=250(-1)1 {
          bys id (date): gen ref_`i'=price[_n-`i']*vol[_n-`i']*exp(total_log_rev_volr-total_log_rev_volr[_n-`i'+1])
          nois _dots `i' 0
      }
      egen ref=rowtotal(ref_250-ref_1)
      
      forvalues i=1/250 {
          drop ref_`i'
      }
      
      forvalues i=250(-1)1 {
          bys id (date): gen k_`i'=vol[_n-`i']*exp(total_log_rev_volr-total_log_rev_volr[_n-`i'+1])
          nois _dots `i' 0
      }
      egen k=rowtotal(k_250-k_1)
      
      forvalues i=1/250 {
          drop k_`i'
      }
      
      gen float refprice=ref/k
      bys id (date): gen float cgo=(price-refprice)/price
      drop ref k volr rev_volr total_volr refprice
      lab var cgo "Capital Gains Overhang Measure"
      }
      Last edited by Artem Rusanov; 03 Mar 2016, 12:38.

      Comment


      • #4
        The main inefficiency is that you calculate the same weight expression twice, once in the -gen ref_`i'price- loop and one in the -gen k_`i'- loop. You only need to do that once, to calculate k_`i', and then use k_`i' to calculate ref_`i'price. Since the expression you evaluate for k involves taking an exponential, it is fairly computationally intensive, so a noticeable savings should result from doing it only once instead of twice.

        The -_dots- commands are a slight time-waster as well. I understand that perhaps you need ongoing reassurance that Stata has not simply gotten hung up, but writing things to the display, while not quite as slow as writing them to a disk file, is still slow compared to computation. In each case where it appears, it may well be the most time-consuming command in its loop.

        Although it probably doesn't affect efficiency, I've rewritten the loop indices as i = 1/250. It's easier to understand the code that way, and I see nothing in the calculations that requires them to be done in reverse order.

        Using a single -drop- command probably doesn't save much time other than the time to parse the command repeatedly (which probably isn't noticeable in only 250 iterations), but, again I think it makes the code more understandable. So what I've come up with is:

        Code:
        quietly {
            gen float volr=vol/freefloat            // generating turnover ratio
            gen float rev_volr=1-volr            
            bys id (date): gen total_log_rev_volr=sum(log(rev_volr))
        }
        
        forvalues i = 1/250 {
            quietly {
                bys id (date): gen k_`i'=vol[_n-`i']*exp(total_log_rev_volr-total_log_rev_volr[_n-`i'+1])
                by id (date): gen ref_`i' = price[_n-`i']*k_`i'
            }
        }
        egen ref = rowtotal(ref_*)
        egen k = rowtotal(k_*)
        gen refprice = ref/k
        drop ref_* k_*
        There is one other possible area of improvement. Taking logs and exponentials is computationally expensive. Then again, addition is faster than multiplication. So it might be faster to calculate a running product directly, rather than summing logs and then exponentiating.

        Code:
        use "sample (2).dta", clear
        
        quietly {
            gen float volr=vol/freefloat            // generating turnover ratio
            gen float rev_volr=1-volr  
            bys id (date): gen product_rev_volr = 1 if _n == 1
            by id (date): replace product_rev_volr= rev_volr*product_rev_volr[_n-1] if _n > 1
        
        forvalues i = 1/250 {
            quietly {
                bys id (date): gen k_`i'=vol[_n-`i']*product_rev_volr[_n-`i'+1]
                by id (date): gen ref_`i' = price[_n-`i']*k_`i'
            }
        }
        egen ref = rowtotal(ref_*)
        egen k = rowtotal(k_*)
        gen refprice = ref/k
        drop ref_* k_*
        My instinct is that this should be appreciably faster in a very large data set. In the sample data it runs about 10% faster with running products, but with that small data set the difference is in hundredths of a second.
        Last edited by Clyde Schechter; 03 Mar 2016, 13:06. Reason: Correct error in explanation.

        Comment


        • #5
          Many thanks for the tips, Clyde!

          Comment


          • #6
            Hello. is there any way that we can further compress this code. It is taking a lot of time. I have a huge dataset of around 9 million observations. the system is running since 2 days but i have not yet got the results from above code.

            Comment


            • #7
              I have a question to Clyde Schechter . Why one bracket "}" is missing in the last code in #4 and if its a mistake then what is the appropriate position where we should put the missing bracket.

              Comment


              • #8
                Good pickup! It should be:

                Code:
                use "sample (2).dta", clear
                
                quietly {
                    gen float volr=vol/freefloat            // generating turnover ratio
                    gen float rev_volr=1-volr  
                    bys id (date): gen product_rev_volr = 1 if _n == 1
                    by id (date): replace product_rev_volr= rev_volr*product_rev_volr[_n-1] if _n > 1
                }
                
                forvalues i = 1/250 {
                    quietly {
                        bys id (date): gen k_`i'=vol[_n-`i']*product_rev_volr[_n-`i'+1]
                        by id (date): gen ref_`i' = price[_n-`i']*k_`i'
                    }
                }
                egen ref = rowtotal(ref_*)
                egen k = rowtotal(k_*)
                gen refprice = ref/k
                drop ref_* k_
                When there is no example data, I usually don't test the code, so little typos and omissions like that happen. That's one reason it's best to post example data with almost any question--you get better answers!

                Comment


                • #9
                  Where products are concerned I would use double types.

                  The egen calls are just going to entail loops that you have already been through. And you don't need the 500 new variables you are going to drop any way.

                  Hence, I would suggest (but this is also completely untested):

                  Code:
                  use "sample (2).dta", clear
                  
                  quietly {
                      gen float volr = vol/freefloat   
                      bys id (date): gen double product_rev_volr = 1 if _n == 1
                      by id (date): replace product_rev_volr= (1 - volr) * product_rev_volr[_n-1] if _n > 1
                  }
                  
                  gen double ref = 0 
                  gen double k = 0 
                  
                  quietly forvalues i = 1/250 {
                      by id (date): replace ref = ref + price[_n-`i'] * k_`i'
                      by id (date): replace k = k + vol[_n-`i'] * product_rev_volr[_n-`i'+1]
                  }
                  
                  gen double refprice = ref/k

                  Comment


                  • #10
                    Incomplete rewriting!

                    Code:
                    use "sample (2).dta", clear
                    
                    quietly {
                        gen float volr = vol/freefloat   
                        bys id (date): gen double product_rev_volr = 1 if _n == 1
                        by id (date): replace product_rev_volr= (1 - volr) * product_rev_volr[_n-1] if _n > 1
                    }
                    
                    gen double ref = 0 
                    gen double k = 0 
                    gen double work = . 
                    
                    quietly forvalues i = 1/250 {
                        by id (date): replace work = vol[_n-`i'] * product_rev_volr[_n-`i'+1]
                        by id (date): replace ref = ref + price[_n-`i'] * work 
                        by id (date): replace k = k + work                                    
                    }
                    
                    gen double refprice = ref/k

                    Comment


                    • #11
                      Thanks to both Nick Cox & Clyde Schechter. Just a minor addition of "*" in the end of Clyde Schechter's code in #8. The last line of the code should be this:
                      drop ref_* k_*

                      Comment


                      • #12
                        In the "sample.dta" dataset attached in #1, you see that there are lots of missing values for vol. Once a missing value is encountered, the rest of the running product becomes missing. I don't know what the implications are when a running sum of logs is used instead.

                        Comment


                        • #13
                          Without wanting to play here with data (the OP is no longer driving the questions any way) I note three possibilities:

                          Products going missing because a term is missing might be desired behaviour.

                          sum() ignores missings or equivalently treats them as zero.

                          If you wanted a homegrown product to ignore missings then each new term could be cond(missing(whatever), 1, whatever)

                          Comment


                          • #14
                            Hi everyone,

                            I have a similar issue with generating a variable that includes a sigma sign, as I did not find this as mathematical operator or function in STATA.
                            The formula I want to code for my data is :


                            Click image for larger version

Name:	Popp.PNG
Views:	2
Size:	32.2 KB
ID:	1438517





                            I hope the picture is sufficiently readable. Fundamentally, the question is wheter there is a way to tell STATA the sigma sign that should be run when generating the variable from the data.

                            I have a Panel dataset for yearly country data that I want to generate the variable for, to have a variable "Patent stock".


                            Patents year Country
                            80 2002 "AT"
                            97 2003 "AT"
                            101 2004 "AT"
                            119 2005 "AT"
                            115 2006 "AT"
                            152 2007 "AT"
                            135 2008 "AT"
                            177 2009 "AT"
                            217 2010 "AT"
                            226 2011 "AT"
                            11 2009 "BG"
                            2 2010 "BG"
                            9 2011 "BG"
                            13 2012 "BG"
                            7 2013 "BG"
                            3 2014 "BG"
                            1 2015 "BG"

                            For any helpful hints on how to efficiently generate this, I would be very grateful.

                            Thanks and best
                            Tobi

                            Comment

                            Working...
                            X