Splitting sample into quantiles for every year-industry*

Rasa Augusta

Join Date: May 2018

Posts: 3
#1

Splitting sample into quantiles for every year-industry*

15 May 2018, 10:42

Hi everyone,

I am having a hard time generating deciles for a specific variable, in this case, a market share (m_share_dec), for each industry -year. I have tried several commands:

1)
xtile2 m_share_dec = m_share, nq(10) by(year industry)
-> this also gives me an error 'no observations'

2)
bysort year industry: egen m_share_dec = xtile(m_share), nq(10)
-> this one gives me an error 'too many observations'

Would anyone be able to help on this? Maybe it is possible to also use a loop or another command that would create the correct deciles .

Thanks!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

15 May 2018, 10:57

Here deciles means bins, classes or intervals based on deciles.

xtile2 is from SSC. xtile() for egen is from egenmore (SSC). Please note FAQ Advice #12 whereby you're asked to explain where community-contributed commands you're referring to come from.

We can't check either of those problems easily. Could you show us the results of these commands?

Code:

describe m_share year industry summarize m_share year industry
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#3

15 May 2018, 11:00

Cross-posted at https://stackoverflow.com/questions/...on-cateogories

Our policy on cross-posting is explicit in the FAQ Advice. You are asked to tell us about it.

At least one of your names is fake. If it's the name you're using here, then your sins have found you out. Again, we ask for full real names here.

Last edited by Nick Cox; 15 May 2018, 11:02.
Comment
Rasa Augusta

Join Date: May 2018

Posts: 3
#4

16 May 2018, 14:12

My apologies, I was not aware it is not possible to ask the same question in multiple places. The question has been removed from Stackoverflow. Ideally, I would like to create only 3 bins for each industry(named SIC)-year for several variables. In this case, I have presented the m_share.The summary statistics are as following (in the attachment)

From this point, unfortunately the xtile () function does not work. Would anyone be able to suggest another solution how to classify information into 3 bins based on 2 cateogories (year & industry)?

I appreciate the help!

Last edited by Rasa Augusta; 16 May 2018, 14:18.
Comment
Rasa Augusta

Join Date: May 2018

Posts: 3
#5

16 May 2018, 14:21

. describe m_share year SIC

storage display value
variable name type format label variable label

m_share float %9.0g m_share
year double %6.0g Data Year - Fiscal
SIC long %8.0g SIC Standard Industry Classification Code

. summarize m_share year SIC

Variable Obs Mean Std. Dev. Min Max

m_share 135,799 .0655358 .1358734 0 .8063346
year 135,799 1996.138 9.857962 1979 2017
SIC 135,799 213.2867 131.7573 1 448
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#6

16 May 2018, 17:17

On cross-posting: It's certainly possible to post in different places and no-one here is trying even to ask you not to do it. As explained in the FAQ Advice, all we ask is that you tell us about it. That is all explained in https://www.statalist.org/forums/help#crossposting as part of what you're reminded to read each time your post.

Also, the post on Stack Overflow remains there. What's happened is that you have removed yourself from SO, which is not at all the same thing. So, you won't be able to delete it now.

None of that matters very much so long as it's clear.

Now to your question. You've 135799 non-missing value of m_share. I guess at 39 distinct years and possibly 448 distinct SIC codes. If that's right then we're talking say 448 * 39 = 17472 cross-combinations of industry and year and that would mean less than 8 observations per industry-year combination on average. That sounds too few to make sense for decile bins.

As an experiment I tried this in Stata 15.1:

Code:

clear set obs `=40 * 500 * 10' set seed 2803 egen industry = seq(), block(400) egen year = seq(), block(40) gen share = runiform() bysort year industry: egen decile = xtile(share), nq(10)

which is for a dataset bigger than yours and xtile() is slow but does not fall over.

I also tried this, which is immensely faster and produces the same result. However, your real data may have a problem with ties.

Code:

bysort year industry (share) : gen DECILE = ceil(10 * _n/_N) assert decile == DECILE

I didn't try anything with xtile2.
Comment

Announcement

Splitting sample into quantiles for every year-industry*

Comment

Comment

Comment

Comment

Comment