Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Divide data into groups according to percentile rank AND another given variable (if tie exists)

    Hi experts!

    I have a four-year repeated cross-sectional data set. I want to divide my observations (within each year) into 400 equal-sized groups according to their income ranks. In addition, whenever there are ties, I want to assign ranks according to the value of another given variable.

    I only know that we can do the following:

    xtile newvar = income, n(400)

    But xtile does not seem to allow me to introduce an additional variable to assign ranks when there are ties. Is there any simple method or user-written command? Thank you!

  • #2
    I am showing you below ancient things which are not endorsed as of now, and you might want to wait for somebody to tell you a more "proper" solution. If nobody shows up with better proposal, this is what I would do:

    Code:
    . webuse nlswork, clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . keep year ln_wage hours
    
    . sort year hours ln_wage
    
    . by year: gen fourth = group(400)
    The code above sorts by the 3 variables listed, and then by the variable year, splits the (already sorted data by the other 2 variables) into 400 roughly equal groups.

    Comment


    • #3
      Originally posted by Joro Kolev View Post
      I am showing you below ancient things which are not endorsed as of now, and you might want to wait for somebody to tell you a more "proper" solution. If nobody shows up with better proposal, this is what I would do:

      Code:
      . webuse nlswork, clear
      (National Longitudinal Survey. Young Women 14-26 years of age in 1968)
      
      . keep year ln_wage hours
      
      . sort year hours ln_wage
      
      . by year: gen fourth = group(400)
      The code above sorts by the 3 variables listed, and then by the variable year, splits the (already sorted data by the other 2 variables) into 400 roughly equal groups.
      Thank you very much Joro! This is very helpful!!

      Comment


      • #4
        Joro Kolev

        What part of your code in #2 is no longer "endorsed"?

        Comment


        • #5
          The
          Code:
          gen newvar = group(varlist)
          function is undocumented as of now.

          Once upon the time I sent a Stata Tip submission, the major point being that the construct above (that is the -gen, group function) is fast as lightning compared to the user contributed "egen, xtile".
          Nick Cox shot it down by some arguments that did not swing my opinion much (of the sort Stata Corp discontinued it, we should not use it therefore). But also Nick had some real objections, which I might have not completely understood. I remember he said something like "The group function does not map likes to likes."

          I will try to dig out his email in a bit.

          Originally posted by David Benson View Post
          Joro Kolev

          What part of your code in #2 is no longer "endorsed"?

          Comment


          • #6
            First, I did not explain very well above, I should have warned that -gen newvar = group(NUMBER)- is different from the -egen newvar=group(varlist)-.

            Then, the function -gen newvar = group(NUMBER)- is undocumented as of now. Unfortunately I do not know of any command that accomplishes the same task in modern ways.

            I am attaching the Stata Tip I wrote (basically the point is that in Finance you have a lot of tasks where you need to sort by some set of variables, and then split the data into "roughly equal groups"). Keep in mind that this stuff is ancient, I wrote and sent the tip to Stata Journal in 2007, and back then Nick told me that this is outdated.

            And below I am also quoting Nick Cox with the critique he had for this approach. The first part of the critique is easy, if you have missing values, the sort will send them to the end, and then the -gen newvar = group(NUMBER)- will send these missing values to groups of their own. (To which my reaction is, if you form groups, think firstly on which variable sorts you are forming them.)

            The second critique I did not get, but Nick did see a problem in this approach.

            This is what Nick says:

            "-group()- for example does not map like to
            like, nor does it handle missing values properly. "


            A simple example shows what I mean about like to like.

            . sysuse auto
            (1978 Automobile Data)

            . sort rep78, stable

            . gen foo = group(rep78)

            . tab foo rep78

            | Repair Record 1978
            foo | 1 2 3 4 5 |
            Total
            -----------+-------------------------------------------------------+----
            ------
            1 | 2 8 15 0 0 |
            25
            2 | 0 0 15 0 0 |
            15
            3 | 0 0 0 15 0 |
            15
            4 | 0 0 0 3 1 |
            4
            5 | 0 0 0 0 10 |
            10
            -----------+-------------------------------------------------------+----
            ------
            Total | 2 8 30 18 11 |
            69





            Attached Files

            Comment


            • #7
              Originally posted by Joro Kolev View Post
              First, I did not explain very well above, I should have warned that -gen newvar = group(NUMBER)- is different from the -egen newvar=group(varlist)-.

              Then, the function -gen newvar = group(NUMBER)- is undocumented as of now. Unfortunately I do not know of any command that accomplishes the same task in modern ways.

              I am attaching the Stata Tip I wrote (basically the point is that in Finance you have a lot of tasks where you need to sort by some set of variables, and then split the data into "roughly equal groups"). Keep in mind that this stuff is ancient, I wrote and sent the tip to Stata Journal in 2007, and back then Nick told me that this is outdated.

              And below I am also quoting Nick Cox with the critique he had for this approach. The first part of the critique is easy, if you have missing values, the sort will send them to the end, and then the -gen newvar = group(NUMBER)- will send these missing values to groups of their own. (To which my reaction is, if you form groups, think firstly on which variable sorts you are forming them.)

              The second critique I did not get, but Nick did see a problem in this approach.

              This is what Nick says:

              "-group()- for example does not map like to
              like, nor does it handle missing values properly. "


              A simple example shows what I mean about like to like.

              . sysuse auto
              (1978 Automobile Data)

              . sort rep78, stable

              . gen foo = group(rep78)

              . tab foo rep78

              | Repair Record 1978
              foo | 1 2 3 4 5 |
              Total
              -----------+-------------------------------------------------------+----
              ------
              1 | 2 8 15 0 0 |
              25
              2 | 0 0 15 0 0 |
              15
              3 | 0 0 0 15 0 |
              15
              4 | 0 0 0 3 1 |
              4
              5 | 0 0 0 0 10 |
              10
              -----------+-------------------------------------------------------+----
              ------
              Total | 2 8 30 18 11 |
              69




              Thank you Joro! Honestly I do not understand the "like by like" part. I guess it refers to the situation when the group size is not completely identical because the sample size cannot be neatly divided by the number of groups. In such a case, maybe xtile has a built-in algorithm that can help it determine whether the observations on the "boundaries" should be put in an upper or lower group based on the similarity between the boundary cases' values and the cases in the adjacent groups?

              Comment


              • #8
                shem shen sir i have a question, i also want to rank my dependent variables from zero to 100, i mean it as a percentile, i am using this command
                xtile newvarz = EQ , nquantiles(100)
                it is samle like yours command xtile newvarz = EQ , n(100),
                but i want to rank my variables on year basis, but i think it just ranks percentile on the whole dataset, not on year basis.
                so how can i rank my variables for percentiles on year basis?
                looking forward to your kind reply.

                Comment


                • #9
                  Hi Ayub,

                  Suppose your year variable is "y"

                  foreach yr of numlist year1 year2 year3 ... {
                  xtile newvarz`yr'=EQ if y==`yr',n(100)
                  }
                  egen newvarz=rowmean(newvarz*)
                  drop newvarzyear1 newvarzyear2 ...


                  Comment


                  • #10
                    Code:
                    ssc inst egenmore 
                    egen wanted = xtile(EQ), by(year) nq(100)
                    Shem Shen's syntax in #9 will fail unless you substitute numeric values in place of year1 year2 year3.

                    Comment


                    • #11
                      @ Nick Cox
                      thank you somuch for your kind reply,
                      i tried to install egenmore but i am not succeeded. it takes to much time and in last give me this message
                      ". ssc inst egenmore
                      checking egenmore consistency and verifying not already installed...
                      connection timed out -- see help r(2) for troubleshooting
                      could not copy http://fmwww.bc.edu/repec/bocode/_/_gmsub.ado
                      (no action taken)
                      r(2);
                      "
                      sir any solution please?

                      Comment


                      • #12
                        I'd check that your Stata can see SSC. The time required is trivial and the message usually arises for other reasons.

                        Code:
                        help netio
                        gives advice.

                        Comment


                        • #13
                          @ shem shen
                          i am also thankful to you for your positive response, sir could you please explain in detail my years are from 2008 to 2016 ,actually i am new users thats why i am requesting for more details.and sir y is for year but yr stands for?and
                          newvarz* stands for?
                          best regards sir

                          Comment


                          • #14
                            @ Nick Cox thank you sir, i will try again tomorrow, then i will let you know sir.

                            Comment


                            • #15
                              Nick Cox thank you sir for your guidance, i have installed the egenmore option, and i got it that sometimes we can not install some commands, then after waiting some time maybe 4 or 5 hours later or may be one or two days later, we can install it.thank you once again

                              Comment

                              Working...
                              X