Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • pick 5th from the lowest and 5th from the largest value

    Hello statisticians,


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float pred_val
    .00057271216
    .00057271216
    .00057730306
     .0005819307
     .0005819307
     .0005819307
     .0005819307
     .0005819307
     .0005819307
     .0005819307
      .000585738
      .000585738
      .000585738
      .000585738
      .000585738
      .000585738
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005865954
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005904333
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
     .0005912975
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
      .000595166
     .0005960373
     .0005960373
     .0005960373
     .0005960373
     .0005960373
     .0005960373
    end
    
    ​​​​​​​sort pred_val
    levelsof pred_val in 1/800, matrow(low5)
    mat li low5
    local low5=low5[5,1]
    di `low5'
    
    ***How to do this for the highest one?

    I want to create 2 macros that takes the value of the 5th from lowest value of pred_val and 5th from the highest of the pred_val.
    I tried to rank it but rank would'nt hep as some values in pred_val are equal or not unique.

    I tried levelsof , matrow(x) but it would only help with lowest 5, not the highest one.



    can you please help?

  • #2
    Try something along these lines.
    Code:
    sort pred_val
    local low5 = pred_val[5]
    local hi5 = pred_val[_N-5]
    display in smcl as text `low5', `hi5'
    // cf.
    summarize pred_val, detail

    Comment


    • #3
      Here is a way using frames.

      Code:
      frame put pred_val, into(rank)
      frame rank{
          contract pred_val, nomiss
          sort pred_val
          di "5th lowest is" pred_val[5]
          di "5th highest is" pred_val[`=_N'-5]
      }
      Res.:

      Code:
      .     di "5th lowest is" pred_val[5]
      5th lowest is.0005866
      . 
      .     di "5th highest is" pred_val[`=_N'-5]
      5th highest is.00058574

      Comment


      • #4
        Thank you Joseph, but I have 1000s of observations, and pred_val[5] with only pick the 5th observation of pred_val. The first 2 observations are repeated. So we have to somehow rank them, right?

        Comment


        • #5
          Thank you Andrew #3. But I have been picking these number for various samples (bootstrapping), therefore there are in many instances the lowest 3 values are repeated.

          Comment


          • #6
            If you are looping, you will need to create the frame each time. So drop the frame once you extract the value. contract ensures that no distinct value is repeated.

            Code:
            frame put pred_val, into(rank)
            frame rank{
                contract pred_val, nomiss
                sort pred_val
                di "5th lowest is" pred_val[5]
                di "5th highest is" pred_val[`=_N'-5]
            }
            frame drop rank

            Comment


            • #7
              Suppose you have 100 observations. Then counting upwards 100 = 1 from bottom, 99 = 2, 98 = 3, 97 = 4, 96 = 5.

              So, subtract 4 from _N, not 5.

              Generic name:: fencepost error.

              Comment


              • #8
                Good point Nick Cox!

                Comment


                • #9
                  Thank you very much Nick and Andrew
                  Last edited by Nishan Lamichhane; 30 Mar 2023, 12:20.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    . . .subtract 4 from _N, not 5.
                    Yep. Good catch.

                    Comment


                    • #11
                      Revisiting #1

                      I tried to rank it but rank wouldn't help as some values in pred_val are equal or not unique.
                      This is why the unique option is provided for egen, rank().

                      In principle, rank() therefore remains useful. Here is some typical code.

                      Code:
                      egen rank1 = rank(pred_val), unique 
                      egen rank2 = rank(-pred_val), unique 
                      
                      su pred_val if rank1==5 
                      scalar low5th = r(max)
                      su pred_val if rank2==5 
                      scalar high5th = r(max)

                      Comment


                      • #12
                        Click image for larger version

Name:	Screenshot 2023-03-31 at 09.37.58.png
Views:	1
Size:	195.7 KB
ID:	1707910


                        Thank you Nick, But if I use egen rank1 = rank(pred_val), unique then the 5th smallest number would be the pred_val corresponding to rank 17. unique option seem to work if all the obs in pred_val are unique. I am a newbie in Stata and please correct me if I misunderstood your point.

                        Comment


                        • #13
                          The unique option to the rank() function insists that the results are to be unique (each rank will occur once only). It's not a claim about the uniqueness of original values. Otherwise you will never assign rank 5 or -5 (equivalent to sample size - 4) in your case.

                          Code:
                          . egen rank = rank(pred_val)
                          
                          . egen rank1 = rank(pred_val), unique
                          
                          .
                          . list if inrange(_n, 1, 10) | inrange(_n, _N - 9, _N)
                          
                               +-------------------------+
                               | pred_val   rank   rank1 |
                               |-------------------------|
                            1. | .0005727    1.5       1 |
                            2. | .0005727    1.5       2 |
                            3. | .0005773      3       3 |
                            4. | .0005819      7       4 |
                            5. | .0005819      7       5 |
                               |-------------------------|
                            6. | .0005819      7       6 |
                            7. | .0005819      7       7 |
                            8. | .0005819      7       8 |
                            9. | .0005819      7       9 |
                           10. | .0005819      7      10 |
                               |-------------------------|
                           91. | .0005952     83      91 |
                           92. | .0005952     83      92 |
                           93. | .0005952     83      93 |
                           94. | .0005952     83      94 |
                           95. |  .000596   97.5      95 |
                               |-------------------------|
                           96. |  .000596   97.5      96 |
                           97. |  .000596   97.5      97 |
                           98. |  .000596   97.5      98 |
                           99. |  .000596   97.5      99 |
                          100. |  .000596   97.5     100 |
                               +-------------------------+
                          .
                          See dm51 in https://www.stata.com/products/stb/journals/stb51.pdf, for the original formulation. This functionality was folded later into official Stata.

                          Naturally in your problem with so many ties the idea of rank is moot in any case, but my suggestion is that so-called unique ranks are the best you can do.

                          Comment

                          Working...
                          X