Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • strange behaviour of -merge- with option -keep- when option -assert- fails

    Consider this toy situation:
    Code:
    clear
    input byte(id num)
    1 10
    2 20
    3 30
    4 40
    end
    
    tempfile using
    save `using'
    
    clear
    input byte(id numnum)
    1 1
    2 2
    5 5
    end
    Now if I run

    Code:
    merge 1:1 id using `using', assert(1 3) keep(3)
    where the assertion of course fails, the resulting dataset looks like this:
    Code:
    . list
    
         +-------------------------------------+
         | id   numnum   num            _merge |
         |-------------------------------------|
      1. |  1        1    10       matched (3) |
      2. |  2        2    20       matched (3) |
      3. |  5        5     .   master only (1) |
         +-------------------------------------+
    That is, it has not obeyed my desire to keep only the _merge == 3 observations, but it has chosen to drop the _merge == 2 observations while retaining the _merge == 1 observation.

    On the other hand, if I had done just
    Code:
    merge 1:1 id using `using', assert(1 3)
    then we get
    Code:
    . list
    
         +-------------------------------------+
         | id   numnum   num            _merge |
         |-------------------------------------|
      1. |  1        1    10       matched (3) |
      2. |  2        2    20       matched (3) |
      3. |  5        5     .   master only (1) |
      4. |  3        .    30    using only (2) |
      5. |  4        .    40    using only (2) |
         +-------------------------------------+
    i.e. Stata keeps all observations, matched or unmatched in any direction.

    This is not making sense to me. Either the presence of the -keep- option should have made Stata only keep the desired result, or it should keep all observations (as it does when -keep- is not specified).

    I am on Stata 18/MP, but I have tested this on the most updated version of Stata 16/MP, and I see the same behaviour. I don't recall ever encountering this before, so I don't know if this is the result of some relatively recent update to -merge-.

  • #2
    I don't have a good answer. From the manual, I would agree that assert is evaluated first, and then any filtering by keep is applied.

    assert() and keep() are convenience options whose functionality can be duplicated using _merge directly.

    . merge ..., assert(match master) keep(match)

    is identical to

    . merge ...
    . assert _merge==1 | _merge==3
    . keep if _merge==3
    I suppose one way to look at this is that Stata's error message is taken to mean that, while it leaves the merge result in memory, it is not guaranteed to conform to any request of the command.

    Comment


    • #3
      I would argue that Stata's current behaviour is both unexpected and unhelpful.

      Code:
      . merge 1:1 id using `using', assert(1 3) keep(3)
      after merge, not all observations from master or matched
      (merged result left in memory)
      r(9);
      Notice that the assertion failed because there are some "using only" (_merge == 2) observations. Stata's behaviour is unexpected, because the "merged result" has not been actually left in memory -- a part of it has disappeared. Indeed, this is the part you are likeliest to want to inspect whilst troubleshooting -- the errant _merge == 2 observations. Ergo, unhelpful. Stata should simply retain all observations.
      Last edited by Hemanshu Kumar; 26 Jun 2023, 20:42.

      Comment


      • #4
        assert() and keep() are convenience options whose functionality can be duplicated using _merge directly.

        . merge ..., assert(match master) keep(match)

        is identical to

        . merge ...
        . assert _merge==1 | _merge==3
        . keep if _merge==3
        Thanks for this quote, Leonardo. Indeed, this raises the further point that the behaviour is inconsistent with the manual. The two sets of commands are not identical. If I follow the second path, the code will break on the assert step, at which point I will have a dataset with all the observations, which will help me troubleshoot the failed assertion.

        Comment


        • #5
          To add some context, Stata used to evalute the keep() before the assert().

          Code:
          . version
          version 14.2
          
          . list
          
               +---------------------------------+
               | id   numnum   num        _merge |
               |---------------------------------|
            1. |  1        1    10   matched (3) |
            2. |  2        2    20   matched (3) |
               +---------------------------------+
          This was fixed with 15.1 (help whatsnew15).

          Code:
              26. merge with options keep() and assert() did not always verify the required match results before keeping the
                  requested observations.  This could result in merge not reporting an error when it should have.  This has
                  been fixed.
          That said, I agree that the current behavior is unexpected and unhelpful, and would escalate this to tech support. They might claim that this is "undefined behavior", but I think you have a good argument that at the very least the error message and the manual are inconsistent with the behavior.

          Comment


          • #6
            Thanks for that info, Nils. I have sent off an email to Tech Support, pointing them to this thread.

            Comment


            • #7
              To close the loop on this, Stata 17 update 29aug2023 contains:

              3. merge with options keep() and assert(), when the results of the merge failed to match option assert(), would not leave unmatched data from the using dataset in memory for inspection after the failed merge. This has been fixed.

              Comment

              Working...
              X