Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding nearest neighbor of a percentile...

    ​Suppose that I am using the system data set auto and I only want to look at two variables, lets say weight as the independent var and mpg as the dependent var. Additionally lets say I want to compress the data set just down to 5 observations and those 5 are either elements of the IQR or the next closest thing. I know how to get the IQR, set local macros for the min, max, r(p25), r(p50), r(p75), but there are times when those percentiles are not actually elements of the vector. (ie. the median of weight is 3190, and that is not an element of weight.) Is there a way I can tell Stata that if that percentile is not in the vector to use the next closest above observation? I need to get down to 5 observations and 5 responses for this project that I am doing....

  • #2
    Cross-posted at http://stackoverflow.com/questions/2...-of-percentile

    Posters are encouraged to read the FAQ before posting. Among other details, it explains our policy about cross-posting, which is that you should tell us about it.

    Comment


    • #3
      My apologies for not referencing the link at stack exchange that I cross posted. Although I glanced over the FAQ, I clearly missed the section on cross posting.

      Comment


      • #4
        Noted. Also, full real names are requested here, such as Kyle Billings.

        Comment


        • #5
          I have already contacted the admin to change my name. Hopefully changed soon.

          Comment


          • #6
            Here's code that might get you started (but which shows up a problem: there may be more than one nearest neighbour):
            Code:
            sysuse auto
            centile mpg, c(25 50 75)
            gen diff1 = abs(mpg - r(c_1))
            egen d1min = min(diff1)
            list make if diff1 == d1min

            Comment


            • #7
              This sounds a fairly odd thing to have to do. Also, what you ask for implies biased estimation, or selection.

              But as you are aware,

              1. The minimum and maximum are always values in the data.

              2. The median and quartiles may be interpolated between values and not exist as data values

              These facts may be put together to get alternative quartiles (wide sense).

              Code:
               
              summarize weight, detail 
              scalar p75 = r(p75) 
              scalar p50 = r(p50) 
              scalar p25 = r(p25) 
              su weight if weight >= scalar(p75), meanonly  
              scalar p75 = r(min) 
              su weight if weight >= scalar(p50), meanonly  
              scalar p50 = r(min) 
              su weight if weight >= scalar(p25), meanonly
              scalar p25 = r(min)
              That can be made more concise

              Code:
              summarize weight, detail 
              foreach p in 25 50 75 { 
                  scalar p`p' = r(p`p') 
              } 
              foreach p in 25 50 75 {
                  su weight if weight >= scalar(p`p'), meanonly 
                  scalar p`p' = r(min) 
              }
              Note that those two loops must be two loops. Scalars hold more precision than locals, although that is perhaps unlikely to bite. Note also that ties on one variable may mean indeterminacy on the other variable.

              An alternative is to devise your own definitions of quartiles that always are single values and select them after sorting the data. But watch out for missings.


              Comment


              • #8
                I'm working on a little mata script to do a cubic spline. The idea was that I was going to build the spline with nodes from the iqr. I know that mkspline and spline3 already exist, this is more to understand the language of Stata/Mata.

                Comment


                • #9
                  A small comment about terminology: I think "iqr" or "IQR" is most commonly understood to mean the difference between the quartiles. At a big stretch it's the lower and upper quartiles themselves as a pair of values, but it's not ever, in my experience, interpreted as the triple (lower quartile, median, upper quartile).

                  Comment


                  • #10
                    True, I'm playing a little fast and loose with terminology and not using my words correctly. I'm not using the iqr, I'm using the min, lower quartile, median, upper quartile and max.

                    Comment


                    • #11
                      Here is my second suggestion implemented. I ignore what you might want to do with weights, meaning frequency weights or some other kind.

                      Code:
                      sysuse auto, clear 
                      sort weight
                      count if weight < .
                      l weight if inlist(_n, 1, ceil(r(N)/4), ceil(r(N)/2), ceil(3 * r(N)/4), r(N))

                      Comment


                      • #12
                        Thank you for all of your help today. I finally came to something that I think works nicely (for what I need) and can be generalized to a .ado file for this project.
                        Code:
                        sysuse auto, clear
                        preserve
                        sort weight
                        count if weight<.
                        keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4)  | _n==_N
                        gen X = weight
                        gen Y = mpg
                        list X Y 
                        /* at this point I will send X and Y to mata for the cubic spline 
                        routine that I am in the process of writing. It was this little step that 
                        was bugging me. */
                        
                        restore

                        Comment


                        • #13
                          Your code assumes that there aren't missings on mpg either. That's true for the auto dataset, but if you want to do this generally for two variables, you will need more robust code to cope with the problem that missings may be present on just one variable in some observations. You might also need to worry about reproducibility.

                          Code:
                          drop if missing(x1, x2)
                          sort x1 x2
                          keep if inlist(_n, 1, ceil(_N/4), ceil(_N/2), ceil(3 * _N/4), _N)
                          Last edited by Nick Cox; 27 May 2014, 04:36.

                          Comment

                          Working...
                          X