Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate median value of past 5 years observations

    I need to calculate the following. Generate variable “High” = 1 if median “inflation” over the past 5 years is higher than the median over the whole sample. For example, in the example below the median inflation for the sample is 15.5. For the year 2005 “High”=0 since median for the past 5 years (2000-2004) is less than the sample median (15.5). Basically, I don’t know how to calculate median of past 5 years inflation level for each year. Please, help me with this issue.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(gvkey fyear infl)
    1 2000 10
    1 2001 11
    1 2002 12
    1 2003 13
    1 2004 14
    1 2005 15
    1 2006 16
    1 2007 17
    1 2008 18
    1 2009 19
    1 2010 20
    1 2011 21
    end

  • #2
    Look for the -mvsumm- package on SSC (written by Nick Cox and Chris Baum).

    Here is how you could do that:

    Code:
    ssc install mvsumm
    
    * Some random stuff
    clear
    set obs 30
    gen yr=1979+_n
    tsset yr, yearly
    gen x=rexponential(1)
    
    * Compute the moving median
    mvsumm x, gen(y) s(med) w(5)
    gen z=L3.y
    
    * Compare with overall median
    sum x, detail
    gen high=z>r(p50)
    replace high=. in 1/5
    Notice the lag in -gen z=L3.y-, so that the median is aligned with the observation following the window.

    HTH

    Jean-Claude Arbaut
    Last edited by Jean-Claude Arbaut; 07 May 2018, 08:44.

    Comment


    • #3
      You can do this with rangestat (from SSC). I'll assume that you have panel data so you want high defined using observations within the same gvkey group.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(gvkey fyear infl)
      1 2000 10
      1 2001 11
      1 2002 12
      1 2003 13
      1 2004 14
      1 2005 15
      1 2006 16
      1 2007 17
      1 2008 18
      1 2009 19
      1 2010 20
      1 2011 21
      end
      
      egen whole_median = median(infl)
      rangestat (median) infl (count) infl, interval(fyear -4 0) by(gvkey)
      gen high = infl_median > whole_median if infl_count == 5
      and the results:
      Code:
      . list, sepby(gvkey)
      
           +--------------------------------------------------------------+
           | gvkey   fyear   infl   whole_~n   infl_m~n   infl_c~t   high |
           |--------------------------------------------------------------|
        1. |     1    2000     10       15.5         10          1      . |
        2. |     1    2001     11       15.5       10.5          2      . |
        3. |     1    2002     12       15.5         11          3      . |
        4. |     1    2003     13       15.5       11.5          4      . |
        5. |     1    2004     14       15.5         12          5      0 |
        6. |     1    2005     15       15.5         13          5      0 |
        7. |     1    2006     16       15.5         14          5      0 |
        8. |     1    2007     17       15.5         15          5      0 |
        9. |     1    2008     18       15.5         16          5      1 |
       10. |     1    2009     19       15.5         17          5      1 |
       11. |     1    2010     20       15.5         18          5      1 |
       12. |     1    2011     21       15.5         19          5      1 |
           +--------------------------------------------------------------+
      
      .

      Comment


      • #4
        For rolling window statistics, you can also use asrol (can be downloaded from SSC)

        Code:
        ssc install asrol
        bys gvkey : asrol infl, stat(median) window(fyear 5)
        . list gvkey fyear infl median5_infl
        
             +---------------------------------+
             | gvkey   fyear   infl   median~l |
             |---------------------------------|
          1. |     1    2000     10         10 |
          2. |     1    2001     11       10.5 |
          3. |     1    2002     12         11 |
          4. |     1    2003     13       11.5 |
          5. |     1    2004     14         12 |
             |---------------------------------|
          6. |     1    2005     15         13 |
          7. |     1    2006     16         14 |
          8. |     1    2007     17         15 |
          9. |     1    2008     18         16 |
         10. |     1    2009     19         17 |
             |---------------------------------|
         11. |     1    2010     20         18 |
         12. |     1    2011     21         19 |
             +---------------------------------+
        Regards
        --------------------------------------------------
        Attaullah Shah, PhD.
        Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
        FinTechProfessor.com
        https://asdocx.com
        Check out my asdoc program, which sends outputs to MS Word.
        For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

        Comment


        • #5
          Well if we are going for alternative solutions, you can also use tsegen (from SSC) to do the same:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(gvkey fyear infl)
          1 2000 10
          1 2001 11
          1 2002 12
          1 2003 13
          1 2004 14
          1 2005 15
          1 2006 16
          1 2007 17
          1 2008 18
          1 2009 19
          1 2010 20
          1 2011 21
          end
          
          tsset gvkey fyear
          tsegen median5 = rowmedian(L(0/4).infl)
          list
          and the results
          Code:
          . list
          
               +--------------------------------+
               | gvkey   fyear   infl   median5 |
               |--------------------------------|
            1. |     1    2000     10        10 |
            2. |     1    2001     11      10.5 |
            3. |     1    2002     12        11 |
            4. |     1    2003     13      11.5 |
            5. |     1    2004     14        12 |
               |--------------------------------|
            6. |     1    2005     15        13 |
            7. |     1    2006     16        14 |
            8. |     1    2007     17        15 |
            9. |     1    2008     18        16 |
           10. |     1    2009     19        17 |
               |--------------------------------|
           11. |     1    2010     20        18 |
           12. |     1    2011     21        19 |
               +--------------------------------+
          
          .
          As to which one to use, tsegen is going to be faster for shorter windows but I generally recommend rangestat (from SSC) as it provides a general solution to all problems of statistics over any data interval (of which a fixed window of time is a subset thereof).

          With respect to asrol as an alternative, let me point out that the complete source code for both tsegen and rangestat is fully viewable by any user, including Attaullah Shah. For a while, asrol was simply an adaptation without attribution of tsegen's code under a different interface and I called him on that here. About a year after rangestat came out, fully coded in Mata, asrol moved to Mata code. Since Attaullah Shah has decided not to make the source code for asrol available, no one can know for sure how much of it is original but, as far as I know, asrol still does nothing more than copy some of the functionality of rangestat.

          Attaullah Shah pulled the same stunt with asreg (from SSC), released after I showed many times on this list how to perform an OLS on a rolling window using and external Mata function and after I made it a built-in function in rangestat.

          Please use rangestat or tsegen for these types of problems.

          Comment


          • #6
            First of all, the tone you have used and the blatant allegations you have levelled against me are not the netiquettes I have observed in this forum. My reason for not showing the code is the same as that of the Stata Corporation's, hiding the trade secret. But that never means that I have plagiarized your code. There are indirect ways to see how both the programs work and how they are different. asrol uses several routines for rolling window calculations. Each routine is intelligently picked by asrol according to the data structure, while rangestat uses a fixed method. Therefore, rangestat falls behind in speed when compared on the basis of different data structures. Consider the following example when we use a balanced panel data:

            Code:
            clear
            set obs 1000
            gen industry=_n
            gen year=_n+1917
            expand 5
            bys industry: gen country=_n
            expand 1000
            bys ind: gen company=_n
            gen profit=uniform()
            timer clear
            
            . timer on 1
            
            . bys company: asrol profit, s(mean) w( year 100)
            
            . timer off 1
            
            .
            . timer on 2
            
            . rangestat (mean) profit, interval(year -99 0) by(company)
            
            .
            . timer off 2
            
            . timer list
               1:     37.31 /        1 =      37.3110 // asrol time
               2:     60.45 /        1 =      60.4530 // rangestat time
            This issue was also highlighted in this post, and then you people quickly took action to improve the rangestat speed for duplicate observations in its next update. Did I copy the idea before you could materialize it?

            Even though you worked to improve the speed of rangestat for handling duplicate observations, it still falls behind asrol for duplicate data in the majority of cases. I present some tests and their results below.

            Code:
             mat S = J(100,3,.)
            loc r = 1
            forv i = 100 (100) 10000 {
            clear
            set obs 2000
            gen company = _n
            expand `i'
            bys company: gen year = _n+1900
            gen profit = uniform()
            
            timer clear
            timer on 1
            bys company: asrol profit, s(mean) w( year 100)
            timer off 1
            
            timer on 2
            rangestat (mean) profit, interval(year -99 0) by(company)
            
            timer off 2
            timer list
            mat S[`r',1] = `i'
            mat S[`r',2] = `r(t1)'
            mat S[`r',3] = `r(t2)'
            loc `++r'
            }
            
            After 19, runs, I stopped the loop. Following are the results
            
                                   Time in Seconds          
            time_duplicated asrol rangestat diff_percentage
            100 1.208 2.023 -40.29%
            200 2.741 4.387 -37.52%
            300 4.247 6.818 -37.71%
            400 5.87 9.254 -36.57%
            500 7.332 11.693 -37.30%
            600 8.892 15.53 -42.74%
            700 10.664 17.033 -37.39%
            800 12.395 19.63 -36.86%
            900 14.012 21.898 -36.01%
            1000 15.034 23.967 -37.27%
            1100 16.764 27.073 -38.08%
            1200 18.899 29.464 -35.86%
            1300 20.151 31.938 -36.91%
            1400 21.782 34.654 -37.14%
            1500 23.232 36.856 -36.97%
            1600 24.404 38.995 -37.42%
            1700 26.065 41.685 -37.47%
            1800 28.354 44.27 -35.95%
            1900 29.468 46.428 -36.53%
            Off course, more variations are possible, which I have not tested. However, the point is that asrol is not a copy.
            Now coming to your allegation against asreg, there is nothing special that you have for me to copy from. You adopted the mata code from this blog post (you have mentioned yourself this several times) and so did I. There is no rocket science in making a mata code for OLS components. What is special about asreg is again its speed of rolling window, which is based on the asrol algorithm. Yes, asreg performs Fama and MacBeth procedure, newey standard errors, fitted values and residuals, which off course are not part of rangestat.

            Please note that my programs and I will stay here, no matter how much bullying you have in mind for me. I would humbly request to stop all blatant allegations and let the users choose for themselves from the given alternatives. Let us maintain the pleasant atmosphere here. Live and let live. Kind regards
            Last edited by Attaullah Shah; 08 May 2018, 14:39.
            Regards
            --------------------------------------------------
            Attaullah Shah, PhD.
            Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
            FinTechProfessor.com
            https://asdocx.com
            Check out my asdoc program, which sends outputs to MS Word.
            For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

            Comment


            • #7
              I said my piece and I have no interest in debating you on this.

              Comment


              • #8
                For completeness: you can get the moving median using egen mmed, found at net from http://www.stata.com/users/kcrow.

                Using the same dataset as in my answer above, do

                Code:
                egen y2=mmed(x), lag(4)
                gen z2=L.y2
                gen high2=z2>r(p50)
                Note: egen mmed had a little bug until yesterday, making it fail on Stata 15. You will need to reinstall the package if you have an older version.

                Hope this helps

                Jean-Claude Arbaut

                Comment

                Working...
                X