pick 5th from the lowest and 5th from the largest value

Nishan Lamichhane

Join Date: Nov 2021
Posts: 36

pick 5th from the lowest and 5th from the largest value

30 Mar 2023, 09:23

Hello statisticians,

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float pred_val
.00057271216
.00057271216
.00057730306
 .0005819307
 .0005819307
 .0005819307
 .0005819307
 .0005819307
 .0005819307
 .0005819307
  .000585738
  .000585738
  .000585738
  .000585738
  .000585738
  .000585738
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005865954
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005904333
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
 .0005912975
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
  .000595166
 .0005960373
 .0005960373
 .0005960373
 .0005960373
 .0005960373
 .0005960373
end

sort pred_val
levelsof pred_val in 1/800, matrow(low5)
mat li low5
local low5=low5[5,1]
di `low5'

***How to do this for the highest one?

I want to create 2 macros that takes the value of the 5th from lowest value of pred_val and 5th from the highest of the pred_val.
I tried to rank it but rank would'nt hep as some values in pred_val are equal or not unique.

I tried levelsof , matrow(x) but it would only help with lowest 5, not the highest one.

can you please help?

Tags: None

Joseph Coveney

Join Date: Apr 2014
Posts: 4398

30 Mar 2023, 09:46

Try something along these lines.

Code:

sort pred_val
local low5 = pred_val[5]
local hi5 = pred_val[_N-5]
display in smcl as text `low5', `hi5'
// cf.
summarize pred_val, detail

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10187

30 Mar 2023, 09:52

Here is a way using frames.

Code:

frame put pred_val, into(rank)
frame rank{
    contract pred_val, nomiss
    sort pred_val
    di "5th lowest is" pred_val[5]
    di "5th highest is" pred_val[`=_N'-5]
}

Res.:

Code:

.     di "5th lowest is" pred_val[5]
5th lowest is.0005866
. 
.     di "5th highest is" pred_val[`=_N'-5]
5th highest is.00058574

Comment

Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#4

30 Mar 2023, 09:54

Thank you Joseph, but I have 1000s of observations, and pred_val[5] with only pick the 5th observation of pred_val. The first 2 observations are repeated. So we have to somehow rank them, right?
Comment
Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#5

30 Mar 2023, 09:57

Thank you Andrew #3. But I have been picking these number for various samples (bootstrapping), therefore there are in many instances the lowest 3 values are repeated.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10187
#6

30 Mar 2023, 10:01

If you are looping, you will need to create the frame each time. So drop the frame once you extract the value. contract ensures that no distinct value is repeated.

Code:

frame put pred_val, into(rank) frame rank{ contract pred_val, nomiss sort pred_val di "5th lowest is" pred_val[5] di "5th highest is" pred_val[`=_N'-5] } frame drop rank
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35645
#7

30 Mar 2023, 10:29

Suppose you have 100 observations. Then counting upwards 100 = 1 from bottom, 99 = 2, 98 = 3, 97 = 4, 96 = 5.

So, subtract 4 from _N, not 5.

Generic name:: fencepost error.
2 likes
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10187
#8

30 Mar 2023, 11:13

Good point Nick Cox!
Comment
Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#9

30 Mar 2023, 11:51

Thank you very much Nick and Andrew

Last edited by Nishan Lamichhane; 30 Mar 2023, 12:20.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4398
#10

30 Mar 2023, 18:32

Originally posted by Nick Cox View Post

. . .subtract 4 from _N, not 5.

Yep. Good catch.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35645
#11

31 Mar 2023, 00:46

Revisiting #1

I tried to rank it but rank wouldn't help as some values in pred_val are equal or not unique.

This is why the unique option is provided for egen, rank().

In principle, rank() therefore remains useful. Here is some typical code.

Code:

egen rank1 = rank(pred_val), unique egen rank2 = rank(-pred_val), unique su pred_val if rank1==5 scalar low5th = r(max) su pred_val if rank2==5 scalar high5th = r(max)
Comment
Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#12

31 Mar 2023, 01:47

Thank you Nick, But if I use egen rank1 = rank(pred_val), unique then the 5th smallest number would be the pred_val corresponding to rank 17. unique option seem to work if all the obs in pred_val are unique. I am a newbie in Stata and please correct me if I misunderstood your point.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35645

#13

31 Mar 2023, 03:14

The unique option to the rank() function insists that the results are to be unique (each rank will occur once only). It's not a claim about the uniqueness of original values. Otherwise you will never assign rank 5 or -5 (equivalent to sample size - 4) in your case.

Code:

. egen rank = rank(pred_val)

. egen rank1 = rank(pred_val), unique

.
. list if inrange(_n, 1, 10) | inrange(_n, _N - 9, _N)

     +-------------------------+
     | pred_val   rank   rank1 |
     |-------------------------|
  1. | .0005727    1.5       1 |
  2. | .0005727    1.5       2 |
  3. | .0005773      3       3 |
  4. | .0005819      7       4 |
  5. | .0005819      7       5 |
     |-------------------------|
  6. | .0005819      7       6 |
  7. | .0005819      7       7 |
  8. | .0005819      7       8 |
  9. | .0005819      7       9 |
 10. | .0005819      7      10 |
     |-------------------------|
 91. | .0005952     83      91 |
 92. | .0005952     83      92 |
 93. | .0005952     83      93 |
 94. | .0005952     83      94 |
 95. |  .000596   97.5      95 |
     |-------------------------|
 96. |  .000596   97.5      96 |
 97. |  .000596   97.5      97 |
 98. |  .000596   97.5      98 |
 99. |  .000596   97.5      99 |
100. |  .000596   97.5     100 |
     +-------------------------+

.
See dm51 in https://www.stata.com/products/stb/journals/stb51.pdf, for the original formulation. This functionality was folded later into official Stata.

Naturally in your problem with so many ties the idea of rank is moot in any case, but my suggestion is that so-called unique ranks are the best you can do.

Announcement

pick 5th from the lowest and 5th from the largest value

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment