Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting sample into quantiles for every year-industry*

    Hi everyone,

    I am having a hard time generating deciles for a specific variable, in this case, a market share (m_share_dec), for each industry -year. I have tried several commands:

    1)
    xtile2 m_share_dec = m_share, nq(10) by(year industry)
    -> this also gives me an error 'no observations'

    2)
    bysort year industry: egen m_share_dec = xtile(m_share), nq(10)
    -> this one gives me an error 'too many observations'


    Would anyone be able to help on this? Maybe it is possible to also use a loop or another command that would create the correct deciles .

    Thanks!

  • #2
    Here deciles means bins, classes or intervals based on deciles.

    xtile2 is from SSC. xtile() for egen is from egenmore (SSC). Please note FAQ Advice #12 whereby you're asked to explain where community-contributed commands you're referring to come from.

    We can't check either of those problems easily. Could you show us the results of these commands?

    Code:
    describe m_share year industry 
    summarize m_share year industry

    Comment


    • #3
      Cross-posted at https://stackoverflow.com/questions/...on-cateogories

      Our policy on cross-posting is explicit in the FAQ Advice. You are asked to tell us about it.

      At least one of your names is fake. If it's the name you're using here, then your sins have found you out. Again, we ask for full real names here.
      Last edited by Nick Cox; 15 May 2018, 11:02.

      Comment


      • #4
        My apologies, I was not aware it is not possible to ask the same question in multiple places. The question has been removed from Stackoverflow. Ideally, I would like to create only 3 bins for each industry(named SIC)-year for several variables. In this case, I have presented the m_share.The summary statistics are as following (in the attachment)


        From this point, unfortunately the xtile () function does not work. Would anyone be able to suggest another solution how to classify information into 3 bins based on 2 cateogories (year & industry)?

        I appreciate the help!
        Last edited by Rasa Augusta; 16 May 2018, 14:18.

        Comment


        • #5
          . describe m_share year SIC

          storage display value
          variable name type format label variable label

          m_share float %9.0g m_share
          year double %6.0g Data Year - Fiscal
          SIC long %8.0g SIC Standard Industry Classification Code

          . summarize m_share year SIC

          Variable Obs Mean Std. Dev. Min Max

          m_share 135,799 .0655358 .1358734 0 .8063346
          year 135,799 1996.138 9.857962 1979 2017
          SIC 135,799 213.2867 131.7573 1 448

          Comment


          • #6
            On cross-posting: It's certainly possible to post in different places and no-one here is trying even to ask you not to do it. As explained in the FAQ Advice, all we ask is that you tell us about it. That is all explained in https://www.statalist.org/forums/help#crossposting as part of what you're reminded to read each time your post.

            Also, the post on Stack Overflow remains there. What's happened is that you have removed yourself from SO, which is not at all the same thing. So, you won't be able to delete it now.

            None of that matters very much so long as it's clear.

            Now to your question. You've 135799 non-missing value of m_share. I guess at 39 distinct years and possibly 448 distinct SIC codes. If that's right then we're talking say 448 * 39 = 17472 cross-combinations of industry and year and that would mean less than 8 observations per industry-year combination on average. That sounds too few to make sense for decile bins.

            As an experiment I tried this in Stata 15.1:

            Code:
            clear 
            set obs `=40 * 500 * 10' 
            set seed 2803 
            egen industry = seq(), block(400) 
            egen year = seq(), block(40) 
            gen share = runiform()
            bysort year industry: egen decile = xtile(share), nq(10)
            which is for a dataset bigger than yours and xtile() is slow but does not fall over.

            I also tried this, which is immensely faster and produces the same result. However, your real data may have a problem with ties.

            Code:
             bysort year industry (share) : gen DECILE = ceil(10 * _n/_N)
            
            assert decile == DECILE
            I didn't try anything with xtile2.

            Comment

            Working...
            X