Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Group and label variables (perhaps using -rangejoin-)

    Hello,

    my following problem seems to me as being pretty complicated. However, I will try to explain it as goood and structured as possible. I hope you can help me.

    My dataset consists of the following variables: fund (identifier for each fund) mdate (yearmonth) number (number of stocks the fund holds) stock_id (identifier for each stock) markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).
    So if fund A holds lets say 5 stocks at January2005, there will be 5 observations for fund A in January2005, given that each stock is considered seperately within the stock_id column.

    My goal: Each year in June (overall period is January2005 till December2018), use the median of markcap to split the stocks in two groups, named "small" and "big". If markcap > median = "big". If markcap < median = "small".

    If markcap > median in June(t), the stock should then be labelled as "big" from July(t) till June(t+1). Then in June (t+1) stocks are divided again in "small"/"big" and then labelled from July(t+1) till June (t+2) and so on.

    However, since one stock can be hold by multiple funds at each date, there are duplicates stock_id mdate markcap in my dataset. Therefore, before dividing the stocks into "small" and "big", I do need to exclude these duplicates so that every stock is considered only once per mdate. Therefore, I would either need to mark the duplicates and exclude them when performing the median calculation (but I do not know how to do this), or I copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.

    I am sorry that I can not provide one single code so far. But my stata knowledge for now is simply not good enough to know how to solve this problem. I guess for the labelling from July(t) till June(t+1) I could use rangejoin or a code that looks something like this (assuming that I did already define a small_big variable, indicating to which group each stock_id belongs)?

    Code:
    by stock_id(mdate), sort: replace small_big = small_big [_n-1] if missing(small_big)

    And the creation of the small_big variable could perhaps look something like this (assuming that I did already create a markcap decile for each stock_id per year in June). Eventhough I am not sure if using deciles is the smartest way.
    Code:
    label define small_big 0 "small" 1 "big"
    gen byte small_big= 1 if inlist(markcap_decile_june, 6,7,8,9, 10)
    replace small_big= 0 if inlist(marpcap_decile_june, 1, 2,3,4,5)
    label values small_big small_big
    I would reall appreciate any help a lot.
    Thank you very much.

    Tim Wolf

  • #2
    markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).
    Surely you mean that markcap is the market capitalization of each stock, not of each fund?

    I am sorry that I can not provide one single code so far
    You can, however, provide an example of your data, as suggested in the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. (Section 12.2) Even the best descriptions of data are no substitute for an actual example of the data.

    Be sure to use the dataex command to do this. If you are running version 15.1 or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use dataex.

    Comment


    • #3
      Thank you for your reply, William Lisowski. You are right, marketcap is the market capitalization of each stock.

      Here are 20 observations of my data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str10 fund float mdate double number str8 stock_id double markcap
      "746916105" 662  133 "00030710"  666.0591269199999
      "746916105" 660  133 "00030710"  666.0591269199999
      "746916105" 661  133 "00030710"  666.0591269199999
      "746916105" 663  127 "00030710"       951.33236616
      "746916105" 665  126 "00030710"       951.33236616
      "746916105" 664  127 "00030710"       951.33236616
      "233203298" 584 1548 "00036020" 314.34953912000003
      "233203280" 584 1974 "00036020" 314.34953912000003
      "233203298" 585 1543 "00036020"   359.049744719999
      "233203280" 585 1966 "00036020"   359.049744719999
      "233203280" 586 1957 "00036020"   359.049744719999
      "233203298" 586 1538 "00036020"   359.049744719999
      "233203298" 587 1538 "00036020"   359.049744719999
      "233203280" 587 1956 "00036020"   359.049744719999
      "746935741" 590  173 "00036020"        311.1690522
      "40171W470" 589  318 "00036020"        311.1690522
      "233203298" 589 1525 "00036020"        311.1690522
      "746935741" 588  175 "00036020"        311.1690522
      "233203280" 588 1947 "00036020"        311.1690522
      "40171W470" 590  316 "00036020"        311.1690522
      end
      format %tm mdate
      I hope this helps.

      Comment


      • #4
        Thank you. Thinking about writing the code leads to two more questions.

        1) How do you want to treat January-June 2005?

        2) How do you want to treat stocks that are not in the data on June but appear in the data in a subsequent month? (For example, from an IPO. Or just from one fund's decision to add a stock to its portfolio that no other fund has.)

        Comment


        • #5
          Thank you very much for your help and your questions.

          1) I would like the "small" or "big" categorization from June2005 to also apply for the period January-June 2005.

          2) If there are no data for a stock in June, these stocks are excluded and will be taken into account in the next year, in case there is now an observation for June.

          Comment


          • #6
            copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.
            That is the approach I chose to take. Here is some example code run on invented data similar to your example that may start you in a useful direction.
            Code:
            use `data'
            
            // create yearly big/small
            keep if month(dofm(mdate))==6
            egen tokeep = tag(stock_id mdate)
            keep if tokeep
            keep mdate stock_id markcap
            bysort mdate (stock_id): egen medcap = median(markcap)
            generate big = markcap>=medcap
            label define BIG 1 "big" 0 "small"
            label values big BIG
            list, clean noobs // June values
            
            // expand to monthly big/small - July through June
            drop markcap medcap
            expand 12
            bysort stock_id mdate: replace mdate = mdate+_n
            
            // add January-June 2005
            expand 6 if mdate==tm(2005m7), generate(new)
            replace mdate = . if new
            bysort stock_id new: replace mdate=tm(2005m1)-1+_n if new
            drop new
            sort mdate stock_id
            list if inlist(month(dofm(mdate)),1,7), clean noobs
            
            merge 1:m stock_id mdate using `data'
            Code:
            . list, clean noobs // June values
            
                 mdate   stock_id   markcap   medcap     big  
                2005m6         S1       150      400   small  
                2005m6         S2       650      400     big  
                2006m6         S1       250      500   small  
                2006m6         S2       750      500     big  
                2007m6         S1       350      600   small  
                2007m6         S2       850      600     big
            Code:
            . list if inlist(month(dofm(mdate)),1,7), clean noobs
            
                 mdate   stock_id     big  
                2005m1         S1   small  
                2005m1         S2     big  
                2005m7         S1   small  
                2005m7         S2     big  
                2006m1         S1   small  
                2006m1         S2     big  
                2006m7         S1   small  
                2006m7         S2     big  
                2007m1         S1   small  
                2007m1         S2     big  
                2007m7         S1   small  
                2007m7         S2     big  
                2008m1         S1   small  
                2008m1         S2     big

            Comment


            • #7
              Thank you very very much. Your provided code works fine. This really helped me a lot and gave me an idea how to handle similar problems.

              Comment


              • #8
                I do have one small follow up question. Your code works fine and seems to do exactly what I want.

                However, after having done this:
                Code:
                keep if month(dofm(mdate))==6
                egen tokeep = tag(stock_id mdate)
                keep if tokeep
                keep mdate stock_id markcap
                bysort mdate (stock_id): egen medcap = median(markcap)
                generate big = markcap>=medcap
                label define BIG 1 "big" 0 "small"
                label values big BIG
                Out of the 33,930 observations 17,617 are labelled as "big" and only 16,313 labelled as "small". I thought the difference might be due to the fact that I did include all values which are =medcap to be labelled as "big".

                However, after changing the last code line to include only bigger values as "big"
                Code:
                generate big = markcap>medcap
                the amount of "big" and "small" observations still remains unchanged. This shows that the mean is build as an average of the two values being in the middle, which is the case if the size of the sample is an equal number.

                However, I do not understand why the amount of stocks being labelled as "big" and "small" is not exactly the same?

                Comment


                • #9
                  So I again have a question. At what point do you have 33,930 observations? In your full dataset, or in the reduced dataset used to calculate the medians?
                  Code:
                  * 33,930 HERE?
                  keep if month(dofm(mdate))==6
                  * 33,930 HERE?
                  egen tokeep = tag(stock_id mdate)
                  keep if tokeep
                  * 33,930 HERE?
                  keep mdate stock_id markcap
                  bysort mdate (stock_id): egen medcap = median(markcap)
                  generate big = markcap>=medcap
                  label define BIG 1 "big" 0 "small"
                  label values big BIG

                  Comment


                  • #10
                    Sorry for not making it clear
                    Code:
                    keep if month(dofm(mdate))==6
                    egen tokeep = tag(stock_cusip mdate)
                    keep if to keep
                    * 33,390 HERE

                    Comment


                    • #11
                      Do you perhaps have observations for which markcap is missing? In that case, you will want
                      Code:
                      generate big = markcap>=medcap
                      replace big = markcap if missing(markcap) // copy missing value

                      Comment


                      • #12
                        Thank you very much. I do have observations for which markcap is missing.

                        Now it is nearly even. My "big" group consists of 16,315 and my "small" group of 16,313 values.

                        It does not make a difference if I use
                        Code:
                        generate big = markcap>=medcap
                        or
                        Code:
                        generate big = markcap>medcap
                        the size in "big" remains unchanged at 16,315.

                        Do you have any idea how the difference can be explained?

                        Comment


                        • #13
                          (Edited since original posting.)

                          Perhaps
                          Code:
                          bysort mdate (stock_id): egen double medcap = median(markcap}
                          will take care of this. But ties can also cause this.

                          Also you have 14 different values of medcap, one for each year from 2005 to 2018. You should be looking at the results separately.
                          Code:
                          tab mdate big, missing
                          Last edited by William Lisowski; 05 Apr 2019, 05:36.

                          Comment


                          • #14
                            Thank you very much, especially the last code did help me a lot.

                            Comment

                            Working...
                            X