Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • efficiently calculating leave-out median of a variable by groups with weights

    I would like to calculate the leave-out weighted median of a variable within groups. The following code illustrates my problem, calculating the median sale price for all other foreign or domestic cars, for each make of car. It is built on this FAQ answer and this answer to a previous question on Statalist:

    Code:
    sysuse auto, clear
    gen sales = floor(uniform()*100)  // create artificial weight variable
    
    capture drop leave_out_med_sales
    gen leave_out_med_sales = .
    
    capture drop temp
    tempvar temp
    
    
    forvalues i = 1/`=_N' {
        
        qui gen temp = price
        qui replace temp = . if _n == `i'
        
        qui su price [w = sales] if foreign == foreign[`i'], detail
        qui replace leave_out_med_sales = r(p50) if _n == `i'
        drop temp
    
    }
    The problem is that this code is much too slow. I have a dataset of around 20 million observations, so this calculation needs to be done much more quickly. Is there a way I can vectorise this operation, or at least speed it up dramatically?

  • #2
    Your code doesn't do what you say. The temp variable doesn't do anything useful; you modify it repeatedly but never use it in any calculations. In your results you just end with two distinct medians, one for each distinct value of foreign.

    I think the code below is more along the right lines, and much simpler than you had. It doesn't address your concern about efficiency for which I think you need to write some custom code for e.g. rangerun or runby (SSC).

    Incidentally, you are confused about temporary variables.

    Code:
    tempvar temp
    creates a variable whose name you refer to as `temp'. It has precisely nothing to do with a variable called temp.


    Code:
    sysuse auto, clear
    gen sales = floor(uniform()*100)  // create artificial weight variable
    
    gen leave_out_med_sales = .
    
    quietly forvalues i = 1/`=_N' {
        su price [w = sales] if foreign == foreign[`i'] & _n != `i', detail
        replace leave_out_med_sales = r(p50) in `i'
    }

    Comment


    • #3
      Hi Sebastien,
      May I ask what is what you are trying to do? When i read your post regarding the leave-one-out median, two ideas came to my mind. First, you are trying to do a kind of cross validation, but that uses the LOO error or predicted mean, not the median.
      The other option was doing something like Influence functions. But that is not when you leave one variable out, but rather when you add more weight to a single observation. Although i think the same process can be done in reverse.
      Keep in mind that using the manual LOO process may result in the median having little to no change compared to the overall median if there are sufficient number of values at the median.
      For example, if you have the following values 1 2 3 3 3 4 5, the leave one out median for each observation will be 3 in all cases. And something tells me this is not what you want.
      Fernando


      Comment


      • #4
        This sounds to me like a computationally intensive, potentially pointless exercise. The median is a robust measure of location, and would not change much, if not at all, if you have enough members with distinct values in the group by which you are calculating the leave one out median. (Or as Fernando puts it "the median having little to no change compared to the overall median if there are sufficient number of values.")

        As for the speed, you might be able to speed up a bit by using - _pctile- instead of -summ, detail-. I have not tested this, you should test it before you go for the change, but logically - _pctile- should be faster than -summ- because the former is computing only the one thing you need, while -summ- calculates plenty of things.

        So you can try replacing the section in Nick's code:

        Code:
         
         quietly forvalues i = 1/`=_N' {     su price [w = sales] if foreign == foreign[`i'] & _n != `i', detail     replace leave_out_med_sales = r(p50) in `i' }
        by

        Code:
         
         quietly forvalues i = 1/`=_N' {     _pctile price [w = sales] if foreign == foreign[`i'] & _n != `i', p(50)     replace leave_out_med_sales = r(r1) in `i' }

        Comment


        • #5
          Thanks everyone for your replies.

          To answer Fernando first, what I really want is the following. I have data on median wages and total employment at the firm level, and I want to calculate the leave-out median wage in a district. I want to do this to be able to relate changes in wages the firm pays to changes in wages at the local (district) level.

          As Fernando and Joro point out, it is possible that the leave-out median wage and the median wage will be pretty similar, so I might just go ahead and use the median wage rather than the leave-out median wage.

          Thanks Nick for clearing up my condusion about how -tempvar- works, and thanks to Nick and Joro for your suggestions on how to improve my existing code. For interest, it seems -_pctile- is indeed much faster than -sum- : 0.004 seconds versus 0.102 in my example.
          Last edited by Sebastien Willis; 21 Jan 2019, 07:21. Reason: added timings for _pctile

          Comment


          • #6
            By all means just go with the overall median. In a district there are enough distinct observations, and leaving one out would not make any difference at all. You can verify it for yourself, pick up a couple of districts, save a reduced data file, and execute the code above for this reduced file to verify that leave one out, and the standard medians are close.

            Comment

            Working...
            X