Finding nearest neighbor of a percentile...

Kyle Billings

Join Date: May 2014

Posts: 12
#1

Finding nearest neighbor of a percentile...

26 May 2014, 16:20

Suppose that I am using the system data set auto and I only want to look at two variables, lets say weight as the independent var and mpg as the dependent var. Additionally lets say I want to compress the data set just down to 5 observations and those 5 are either elements of the IQR or the next closest thing. I know how to get the IQR, set local macros for the min, max, r(p25), r(p50), r(p75), but there are times when those percentiles are not actually elements of the vector. (ie. the median of weight is 3190, and that is not an element of weight.) Is there a way I can tell Stata that if that percentile is not in the vector to use the next closest above observation? I need to get down to 5 observations and 5 responses for this project that I am doing....
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

26 May 2014, 16:52

Cross-posted at http://stackoverflow.com/questions/2...-of-percentile

Posters are encouraged to read the FAQ before posting. Among other details, it explains our policy about cross-posting, which is that you should tell us about it.
Comment
Kyle Billings

Join Date: May 2014

Posts: 12
#3

26 May 2014, 17:16

My apologies for not referencing the link at stack exchange that I cross posted. Although I glanced over the FAQ, I clearly missed the section on cross posting.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

26 May 2014, 17:22

Noted. Also, full real names are requested here, such as Kyle Billings.
Comment
Kyle Billings

Join Date: May 2014

Posts: 12
#5

26 May 2014, 17:37

I have already contacted the admin to change my name. Hopefully changed soon.
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#6

26 May 2014, 17:42

Here's code that might get you started (but which shows up a problem: there may be more than one nearest neighbour):

Code:

sysuse auto centile mpg, c(25 50 75) gen diff1 = abs(mpg - r(c_1)) egen d1min = min(diff1) list make if diff1 == d1min
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

26 May 2014, 17:42

This sounds a fairly odd thing to have to do. Also, what you ask for implies biased estimation, or selection.

But as you are aware,

1. The minimum and maximum are always values in the data.

2. The median and quartiles may be interpolated between values and not exist as data values

These facts may be put together to get alternative quartiles (wide sense).

Code:

summarize weight, detail scalar p75 = r(p75) scalar p50 = r(p50) scalar p25 = r(p25) su weight if weight >= scalar(p75), meanonly scalar p75 = r(min) su weight if weight >= scalar(p50), meanonly scalar p50 = r(min) su weight if weight >= scalar(p25), meanonly scalar p25 = r(min)

That can be made more concise

Code:

summarize weight, detail foreach p in 25 50 75 { scalar p`p' = r(p`p') } foreach p in 25 50 75 { su weight if weight >= scalar(p`p'), meanonly scalar p`p' = r(min) }

Note that those two loops must be two loops. Scalars hold more precision than locals, although that is perhaps unlikely to bite. Note also that ties on one variable may mean indeterminacy on the other variable.

An alternative is to devise your own definitions of quartiles that always are single values and select them after sorting the data. But watch out for missings.
Comment
Kyle Billings

Join Date: May 2014

Posts: 12
#8

26 May 2014, 17:49

I'm working on a little mata script to do a cubic spline. The idea was that I was going to build the spline with nodes from the iqr. I know that mkspline and spline3 already exist, this is more to understand the language of Stata/Mata.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

26 May 2014, 18:00

A small comment about terminology: I think "iqr" or "IQR" is most commonly understood to mean the difference between the quartiles. At a big stretch it's the lower and upper quartiles themselves as a pair of values, but it's not ever, in my experience, interpreted as the triple (lower quartile, median, upper quartile).
Comment
Kyle Billings

Join Date: May 2014

Posts: 12
#10

26 May 2014, 18:07

True, I'm playing a little fast and loose with terminology and not using my words correctly. I'm not using the iqr, I'm using the min, lower quartile, median, upper quartile and max.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

26 May 2014, 18:18

Here is my second suggestion implemented. I ignore what you might want to do with weights, meaning frequency weights or some other kind.

Code:

sysuse auto, clear sort weight count if weight < . l weight if inlist(_n, 1, ceil(r(N)/4), ceil(r(N)/2), ceil(3 * r(N)/4), r(N))
Comment

Kyle Billings

Join Date: May 2014
Posts: 12

#12

27 May 2014, 03:19

Thank you for all of your help today. I finally came to something that I think works nicely (for what I need) and can be generalized to a .ado file for this project.

Code:

sysuse auto, clear
preserve
sort weight
count if weight<.
keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4)  | _n==_N
gen X = weight
gen Y = mpg
list X Y 
/* at this point I will send X and Y to mata for the cubic spline 
routine that I am in the process of writing. It was this little step that 
was bugging me. */

restore

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

27 May 2014, 04:02

Your code assumes that there aren't missings on mpg either. That's true for the auto dataset, but if you want to do this generally for two variables, you will need more robust code to cope with the problem that missings may be present on just one variable in some observations. You might also need to worry about reproducibility.

Code:

drop if missing(x1, x2) sort x1 x2 keep if inlist(_n, 1, ceil(_N/4), ceil(_N/2), ceil(3 * _N/4), _N)

Last edited by Nick Cox; 27 May 2014, 04:36.
Comment

Announcement

Finding nearest neighbor of a percentile...

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment