Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sorting before 1:1 matching?

    Dear all, This is a question (more) related to the concept of propensity score matching (PSM). The following description takes from the last paragraph of page 210 of the interesting book "Propensity score analysis: Statistical methods and applications (https://www.stata.com/bookstore/prop...core-analysis/)" (2nd Edition, by Shenyang Y. Guo and Mark W. Fraser):
    (Note that, please ssc install psmatch2)
    HTML Code:
    A few cautionary statements about running psmatch2 are worth mentioning.
    
    When one treated case is found, several nontreated cases—each of which has the same value of propensity score—may be tied.
    
    In a 1-to-1 match, identifying which of the tied cases was the matched case depends on the order of the data. 
    I agree so far. However, they go on:
    HTML Code:
    Thus, it is important to first create a random variable and then sort data using this variable.
    which I do not agree (
    I really don’t think that the last step is necessary). In particular, they suggest to do something like this:
    Code:
    set seed 10101
    gen ranorder=runiform()
    sort ranorder
    before one does the following
    Code:
    psmatch2 treated x1 x2 ..., common n(1)
    My argument is: Suppose that one treated observation T with propensity score (PS=0.8) is matched to three (tied) control observations (C1, C2, and C3, in that raw ordering, all with PS=0.8 as well). If we do not randomly sort again, C1 will be picked up in the 1:1 matching. Suppose that, following your suggestion, we generate random number and sort the raw data according to that number, and now assume that C3 is picked up. Since C1, C2, and C3 are equally good (all with PS=0.8) in nature, and the raw order is also a realization of the population, I don’t see why another realization C3 will be better than the one (C1) without sorting.

    I really appreciate any comments from all of you. Thanks!

    BTY, Does anyone know that whether the Stata teffects psmatch command does this kind of random sorting before matching (with option nn(1))?
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    I am not familiar with the reference you are citing, so I can't be sure that what I say here is what they intend. But here's my guess:

    I suspect that if you do not create a random number and then sort on it, then, to use your example, when you run -psmatch2- you will sometimes get C1, sometimes get C2, and sometimes get C3. That is, the results will be indeterminate, and irreproducible. So I think it is not so much that any one of C1, C2, or C3 is better than the others. I think it's a matter of assuring that each time you run the code with the same data you get the same result. By setting the random number seed and creating a specific sort order based on it, you assure that.

    That said, their way of doing it is inadequate for large data sets. When you use -gen ranorder = runiform()-, ranorder is stored as a float. If your data set contains more than, say, half a million observations, there is an appreciable chance that two values of ranorder will be the same (because of the limited precision of a float) and so even this sort order could produce indeterminate and irreproducible results. When creating a sort order for the purposes of avoiding indeterminacy, unless the data set is small, it is safer to use -gen double ranorder = runiform()-. And, in fact, if the data set contains several tens of millions of observations, even that could leave things in determinate, and one would need to generate two double precision random numbers and sort on the pair of them.

    Comment


    • #3
      Dear Clyde, Thanks for your reply.
      1. Your explanation makes sense to me (however, shouldn't I get the same results if I do not sort the data, i.e., using the raw data/order)?
      2. I see your point (I am not aware of this). Fortunately, I don't have so many observations (It will take lots, lots of time to match, I think).
      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        Well, I think that if the code up to that point leaves a uniquely determined sort order on the data, then the random sorting would not be necessary. The problem is that often the code shuffles data around in various ways that we don't necessarily think about, so I think they are just being cautious about it. I think that if you know for sure that the sort order of the data is uniquely determined at that point, then you can skip the random sorting.

        Comment


        • #5
          Dear Clyde, Got it and thanks again.
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment

          Working...
          X