Divide data into groups according to percentile rank AND another given variable (if tie exists)

shem shen

Join Date: Mar 2016

Posts: 136
#1

Divide data into groups according to percentile rank AND another given variable (if tie exists)

01 Feb 2019, 13:43

Hi experts!

I have a four-year repeated cross-sectional data set. I want to divide my observations (within each year) into 400 equal-sized groups according to their income ranks. In addition, whenever there are ties, I want to assign ranks according to the value of another given variable.

I only know that we can do the following:

xtile newvar = income, n(400)

But xtile does not seem to allow me to introduce an additional variable to assign ranks when there are ties. Is there any simple method or user-written command? Thank you!
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

01 Feb 2019, 14:22

I am showing you below ancient things which are not endorsed as of now, and you might want to wait for somebody to tell you a more "proper" solution. If nobody shows up with better proposal, this is what I would do:

Code:

. webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . keep year ln_wage hours . sort year hours ln_wage . by year: gen fourth = group(400)

The code above sorts by the 3 variables listed, and then by the variable year, splits the (already sorted data by the other 2 variables) into 400 roughly equal groups.
Comment
shem shen

Join Date: Mar 2016

Posts: 136
#3

01 Feb 2019, 14:26

Originally posted by Joro Kolev View Post

I am showing you below ancient things which are not endorsed as of now, and you might want to wait for somebody to tell you a more "proper" solution. If nobody shows up with better proposal, this is what I would do:

Code:

. webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . keep year ln_wage hours . sort year hours ln_wage . by year: gen fourth = group(400)

The code above sorts by the 3 variables listed, and then by the variable year, splits the (already sorted data by the other 2 variables) into 400 roughly equal groups.

Thank you very much Joro! This is very helpful!!
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#4

01 Feb 2019, 16:10

Joro Kolev

What part of your code in #2 is no longer "endorsed"?
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

01 Feb 2019, 16:17

The

Code:

gen newvar = group(varlist)

function is undocumented as of now.

Once upon the time I sent a Stata Tip submission, the major point being that the construct above (that is the -gen, group function) is fast as lightning compared to the user contributed "egen, xtile".
Nick Cox shot it down by some arguments that did not swing my opinion much (of the sort Stata Corp discontinued it, we should not use it therefore). But also Nick had some real objections, which I might have not completely understood. I remember he said something like "The group function does not map likes to likes."

I will try to dig out his email in a bit.

Originally posted by David Benson View Post

Joro Kolev

What part of your code in #2 is no longer "endorsed"?
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

01 Feb 2019, 18:44

First, I did not explain very well above, I should have warned that -gen newvar = group(NUMBER)- is different from the -egen newvar=group(varlist)-.

Then, the function -gen newvar = group(NUMBER)- is undocumented as of now. Unfortunately I do not know of any command that accomplishes the same task in modern ways.

I am attaching the Stata Tip I wrote (basically the point is that in Finance you have a lot of tasks where you need to sort by some set of variables, and then split the data into "roughly equal groups"). Keep in mind that this stuff is ancient, I wrote and sent the tip to Stata Journal in 2007, and back then Nick told me that this is outdated.

And below I am also quoting Nick Cox with the critique he had for this approach. The first part of the critique is easy, if you have missing values, the sort will send them to the end, and then the -gen newvar = group(NUMBER)- will send these missing values to groups of their own. (To which my reaction is, if you form groups, think firstly on which variable sorts you are forming them.)

The second critique I did not get, but Nick did see a problem in this approach.

This is what Nick says:

"-group()- for example does not map like to
like, nor does it handle missing values properly. "

A simple example shows what I mean about like to like.

. sysuse auto
(1978 Automobile Data)

. sort rep78, stable

. gen foo = group(rep78)

. tab foo rep78

| Repair Record 1978
foo | 1 2 3 4 5 |
Total
-----------+-------------------------------------------------------+----
------
1 | 2 8 15 0 0 |
25
2 | 0 0 15 0 0 |
15
3 | 0 0 0 15 0 |
15
4 | 0 0 0 3 1 |
4
5 | 0 0 0 0 10 |
10
-----------+-------------------------------------------------------+----
------
Total | 2 8 30 18 11 |
69

Attached Files

group.pdf (91.0 KB, 2 views)
Comment
shem shen

Join Date: Mar 2016

Posts: 136
#7

09 Feb 2019, 15:56

Originally posted by Joro Kolev View Post

First, I did not explain very well above, I should have warned that -gen newvar = group(NUMBER)- is different from the -egen newvar=group(varlist)-.

Then, the function -gen newvar = group(NUMBER)- is undocumented as of now. Unfortunately I do not know of any command that accomplishes the same task in modern ways.

I am attaching the Stata Tip I wrote (basically the point is that in Finance you have a lot of tasks where you need to sort by some set of variables, and then split the data into "roughly equal groups"). Keep in mind that this stuff is ancient, I wrote and sent the tip to Stata Journal in 2007, and back then Nick told me that this is outdated.

And below I am also quoting Nick Cox with the critique he had for this approach. The first part of the critique is easy, if you have missing values, the sort will send them to the end, and then the -gen newvar = group(NUMBER)- will send these missing values to groups of their own. (To which my reaction is, if you form groups, think firstly on which variable sorts you are forming them.)

The second critique I did not get, but Nick did see a problem in this approach.

This is what Nick says:

"-group()- for example does not map like to
like, nor does it handle missing values properly. "

A simple example shows what I mean about like to like.

. sysuse auto
(1978 Automobile Data)

. sort rep78, stable

. gen foo = group(rep78)

. tab foo rep78

| Repair Record 1978
foo | 1 2 3 4 5 |
Total
-----------+-------------------------------------------------------+----
------
1 | 2 8 15 0 0 |
25
2 | 0 0 15 0 0 |
15
3 | 0 0 0 15 0 |
15
4 | 0 0 0 3 1 |
4
5 | 0 0 0 0 10 |
10
-----------+-------------------------------------------------------+----
------
Total | 2 8 30 18 11 |
69

Thank you Joro! Honestly I do not understand the "like by like" part. I guess it refers to the situation when the group size is not completely identical because the sample size cannot be neatly divided by the number of groups. In such a case, maybe xtile has a built-in algorithm that can help it determine whether the observations on the "boundaries" should be put in an upper or lower group based on the similarity between the boundary cases' values and the cases in the adjacent groups?
Comment
Ayub UOM

Join Date: Feb 2018

Posts: 83
#8

04 Aug 2019, 02:49

shem shen sir i have a question, i also want to rank my dependent variables from zero to 100, i mean it as a percentile, i am using this command
xtile newvarz = EQ , nquantiles(100)
it is samle like yours command xtile newvarz = EQ , n(100),
but i want to rank my variables on year basis, but i think it just ranks percentile on the whole dataset, not on year basis.
so how can i rank my variables for percentiles on year basis?
looking forward to your kind reply.
Comment
shem shen

Join Date: Mar 2016

Posts: 136
#9

04 Aug 2019, 10:04

Hi Ayub,

Suppose your year variable is "y"

foreach yr of numlist year1 year2 year3 ... {
xtile newvarz`yr'=EQ if y==`yr',n(100)
}
egen newvarz=rowmean(newvarz*)
drop newvarzyear1 newvarzyear2 ...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

05 Aug 2019, 00:50

Code:

ssc inst egenmore egen wanted = xtile(EQ), by(year) nq(100)

Shem Shen's syntax in #9 will fail unless you substitute numeric values in place of year1 year2 year3.
Comment
Ayub UOM

Join Date: Feb 2018

Posts: 83
#11

06 Aug 2019, 06:11

@ Nick Cox
thank you somuch for your kind reply,
i tried to install egenmore but i am not succeeded. it takes to much time and in last give me this message
". ssc inst egenmore
checking egenmore consistency and verifying not already installed...
connection timed out -- see help r(2) for troubleshooting
could not copy http://fmwww.bc.edu/repec/bocode/_/_gmsub.ado
(no action taken)
r(2);
"
sir any solution please?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#12

06 Aug 2019, 06:15

I'd check that your Stata can see SSC. The time required is trivial and the message usually arises for other reasons.

Code:

help netio

gives advice.
Comment
Ayub UOM

Join Date: Feb 2018

Posts: 83
#13

06 Aug 2019, 06:43

@ shem shen
i am also thankful to you for your positive response, sir could you please explain in detail my years are from 2008 to 2016 ,actually i am new users thats why i am requesting for more details.and sir y is for year but yr stands for?and newvarz* stands for?
best regards sir
Comment
Ayub UOM

Join Date: Feb 2018

Posts: 83
#14

06 Aug 2019, 07:04

@ Nick Cox thank you sir, i will try again tomorrow, then i will let you know sir.
Comment
Ayub UOM

Join Date: Feb 2018

Posts: 83
#15

06 Aug 2019, 20:17

Nick Cox thank you sir for your guidance, i have installed the egenmore option, and i got it that sometimes we can not install some commands, then after waiting some time maybe 4 or 5 hours later or may be one or two days later, we can install it.thank you once again
Comment

Announcement

Divide data into groups according to percentile rank AND another given variable (if tie exists)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment