Group and label variables (perhaps using -rangejoin-)

Tim Wolf

Join Date: Mar 2019

Posts: 25
#1

Group and label variables (perhaps using -rangejoin-)

30 Mar 2019, 05:24

Hello,

my following problem seems to me as being pretty complicated. However, I will try to explain it as goood and structured as possible. I hope you can help me.

My dataset consists of the following variables: fund (identifier for each fund) mdate (yearmonth) number (number of stocks the fund holds) stock_id (identifier for each stock) markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).
So if fund A holds lets say 5 stocks at January2005, there will be 5 observations for fund A in January2005, given that each stock is considered seperately within the stock_id column.

My goal: Each year in June (overall period is January2005 till December2018), use the median of markcap to split the stocks in two groups, named "small" and "big". If markcap > median = "big". If markcap < median = "small".

If markcap > median in June(t), the stock should then be labelled as "big" from July(t) till June(t+1). Then in June (t+1) stocks are divided again in "small"/"big" and then labelled from July(t+1) till June (t+2) and so on.

However, since one stock can be hold by multiple funds at each date, there are duplicates stock_id mdate markcap in my dataset. Therefore, before dividing the stocks into "small" and "big", I do need to exclude these duplicates so that every stock is considered only once per mdate. Therefore, I would either need to mark the duplicates and exclude them when performing the median calculation (but I do not know how to do this), or I copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.

I am sorry that I can not provide one single code so far. But my stata knowledge for now is simply not good enough to know how to solve this problem. I guess for the labelling from July(t) till June(t+1) I could use rangejoin or a code that looks something like this (assuming that I did already define a small_big variable, indicating to which group each stock_id belongs)?

Code:

by stock_id(mdate), sort: replace small_big = small_big [_n-1] if missing(small_big)

And the creation of the small_big variable could perhaps look something like this (assuming that I did already create a markcap decile for each stock_id per year in June). Eventhough I am not sure if using deciles is the smartest way.

Code:

label define small_big 0 "small" 1 "big" gen byte small_big= 1 if inlist(markcap_decile_june, 6,7,8,9, 10) replace small_big= 0 if inlist(marpcap_decile_june, 1, 2,3,4,5) label values small_big small_big

I would reall appreciate any help a lot.
Thank you very much.

Tim Wolf
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

30 Mar 2019, 08:44

markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).

Surely you mean that markcap is the market capitalization of each stock, not of each fund?

I am sorry that I can not provide one single code so far

You can, however, provide an example of your data, as suggested in the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. (Section 12.2) Even the best descriptions of data are no substitute for an actual example of the data.

Be sure to use the dataex command to do this. If you are running version 15.1 or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use dataex.
Comment

Tim Wolf

Join Date: Mar 2019
Posts: 25

01 Apr 2019, 04:21

Thank you for your reply, William Lisowski. You are right, marketcap is the market capitalization of each stock.

Here are 20 observations of my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 fund float mdate double number str8 stock_id double markcap
"746916105" 662  133 "00030710"  666.0591269199999
"746916105" 660  133 "00030710"  666.0591269199999
"746916105" 661  133 "00030710"  666.0591269199999
"746916105" 663  127 "00030710"       951.33236616
"746916105" 665  126 "00030710"       951.33236616
"746916105" 664  127 "00030710"       951.33236616
"233203298" 584 1548 "00036020" 314.34953912000003
"233203280" 584 1974 "00036020" 314.34953912000003
"233203298" 585 1543 "00036020"   359.049744719999
"233203280" 585 1966 "00036020"   359.049744719999
"233203280" 586 1957 "00036020"   359.049744719999
"233203298" 586 1538 "00036020"   359.049744719999
"233203298" 587 1538 "00036020"   359.049744719999
"233203280" 587 1956 "00036020"   359.049744719999
"746935741" 590  173 "00036020"        311.1690522
"40171W470" 589  318 "00036020"        311.1690522
"233203298" 589 1525 "00036020"        311.1690522
"746935741" 588  175 "00036020"        311.1690522
"233203280" 588 1947 "00036020"        311.1690522
"40171W470" 590  316 "00036020"        311.1690522
end
format %tm mdate

I hope this helps.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

01 Apr 2019, 06:19

Thank you. Thinking about writing the code leads to two more questions.

1) How do you want to treat January-June 2005?

2) How do you want to treat stocks that are not in the data on June but appear in the data in a subsequent month? (For example, from an IPO. Or just from one fund's decision to add a stock to its portfolio that no other fund has.)
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#5

01 Apr 2019, 08:23

Thank you very much for your help and your questions.

1) I would like the "small" or "big" categorization from June2005 to also apply for the period January-June 2005.

2) If there are no data for a stock in June, these stocks are excluded and will be taken into account in the next year, in case there is now an observation for June.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Apr 2019, 16:02

copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.

That is the approach I chose to take. Here is some example code run on invented data similar to your example that may start you in a useful direction.

Code:

use `data'

// create yearly big/small
keep if month(dofm(mdate))==6
egen tokeep = tag(stock_id mdate)
keep if tokeep
keep mdate stock_id markcap
bysort mdate (stock_id): egen medcap = median(markcap)
generate big = markcap>=medcap
label define BIG 1 "big" 0 "small"
label values big BIG
list, clean noobs // June values

// expand to monthly big/small - July through June
drop markcap medcap
expand 12
bysort stock_id mdate: replace mdate = mdate+_n

// add January-June 2005
expand 6 if mdate==tm(2005m7), generate(new)
replace mdate = . if new
bysort stock_id new: replace mdate=tm(2005m1)-1+_n if new
drop new
sort mdate stock_id
list if inlist(month(dofm(mdate)),1,7), clean noobs

merge 1:m stock_id mdate using `data'

Code:

. list, clean noobs // June values

     mdate   stock_id   markcap   medcap     big  
    2005m6         S1       150      400   small  
    2005m6         S2       650      400     big  
    2006m6         S1       250      500   small  
    2006m6         S2       750      500     big  
    2007m6         S1       350      600   small  
    2007m6         S2       850      600     big

Code:

. list if inlist(month(dofm(mdate)),1,7), clean noobs

     mdate   stock_id     big  
    2005m1         S1   small  
    2005m1         S2     big  
    2005m7         S1   small  
    2005m7         S2     big  
    2006m1         S1   small  
    2006m1         S2     big  
    2006m7         S1   small  
    2006m7         S2     big  
    2007m1         S1   small  
    2007m1         S2     big  
    2007m7         S1   small  
    2007m7         S2     big  
    2008m1         S1   small  
    2008m1         S2     big

Comment

Tim Wolf

Join Date: Mar 2019

Posts: 25
#7

03 Apr 2019, 11:48

Thank you very very much. Your provided code works fine. This really helped me a lot and gave me an idea how to handle similar problems.
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#8

04 Apr 2019, 11:09

I do have one small follow up question. Your code works fine and seems to do exactly what I want.

However, after having done this:

Code:

keep if month(dofm(mdate))==6 egen tokeep = tag(stock_id mdate) keep if tokeep keep mdate stock_id markcap bysort mdate (stock_id): egen medcap = median(markcap) generate big = markcap>=medcap label define BIG 1 "big" 0 "small" label values big BIG

Out of the 33,930 observations 17,617 are labelled as "big" and only 16,313 labelled as "small". I thought the difference might be due to the fact that I did include all values which are =medcap to be labelled as "big".

However, after changing the last code line to include only bigger values as "big"

Code:

generate big = markcap>medcap

the amount of "big" and "small" observations still remains unchanged. This shows that the mean is build as an average of the two values being in the middle, which is the case if the size of the sample is an equal number.

However, I do not understand why the amount of stocks being labelled as "big" and "small" is not exactly the same?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Apr 2019, 12:12

So I again have a question. At what point do you have 33,930 observations? In your full dataset, or in the reduced dataset used to calculate the medians?

Code:

* 33,930 HERE?
keep if month(dofm(mdate))==6
* 33,930 HERE?
egen tokeep = tag(stock_id mdate)
keep if tokeep
* 33,930 HERE?
keep mdate stock_id markcap
bysort mdate (stock_id): egen medcap = median(markcap)
generate big = markcap>=medcap
label define BIG 1 "big" 0 "small"
label values big BIG

Comment

Tim Wolf

Join Date: Mar 2019

Posts: 25
#10

04 Apr 2019, 12:58

Sorry for not making it clear

Code:

keep if month(dofm(mdate))==6 egen tokeep = tag(stock_cusip mdate) keep if to keep * 33,390 HERE
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

04 Apr 2019, 13:13

Do you perhaps have observations for which markcap is missing? In that case, you will want

Code:

generate big = markcap>=medcap replace big = markcap if missing(markcap) // copy missing value
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#12

05 Apr 2019, 02:52

Thank you very much. I do have observations for which markcap is missing.

Now it is nearly even. My "big" group consists of 16,315 and my "small" group of 16,313 values.

It does not make a difference if I use

Code:

generate big = markcap>=medcap

or

Code:

generate big = markcap>medcap

the size in "big" remains unchanged at 16,315.

Do you have any idea how the difference can be explained?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

05 Apr 2019, 04:57

(Edited since original posting.)

Perhaps

Code:

bysort mdate (stock_id): egen double medcap = median(markcap}

will take care of this. But ties can also cause this.

Also you have 14 different values of medcap, one for each year from 2005 to 2018. You should be looking at the results separately.

Code:

tab mdate big, missing

Last edited by William Lisowski; 05 Apr 2019, 05:36.
Comment
Tim Wolf

Join Date: Mar 2019

Posts: 25
#14

05 Apr 2019, 08:31

Thank you very much, especially the last code did help me a lot.
Comment

Announcement