Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • unintended behavior of -ereplace- when combined with an if statement

    I was using -ereplace- today with if statements. (See also its introduction on Statalist here.)

    I noticed that erplace was not only replacing values for observations that meet the conditions of the if statement (what I wanted); it was also changing all values of the variable in question to missing for all observations that did not meet the conditions of the if statement (not what I wanted!). This seems like unintended behavior, as -replace- doesn't do this and -ereplace- is intended to be identical to -replace- except that it works for egen commands.

    Here is a minimum working example loosely based on my dataset:

    Code:
    clear
    input float var1 str1 var2 float var3
    1 "A" 0
    2 "A" 0
    3 "A" 0
    4 "B" 0
    5 "B" 0
    6 "B" 0
    7 "C" 1
    8 "C" 1
    9 "C" 1
    10 "D" 1
    11 "D" 1
    12 "D" 1
    end
    Suppose I want to generate a new variable that follows one rule for those observations which take the value 0 on var3 and different rule for those observations that take the value 1.

    Code:
    egen var4=concat(var1 var2) if var3==0, punct("-")
    (6 missing values generated)
    My data now look like:
    var1 var2 var3 var4
    1 A 0 1-A
    2 A 0 2-A
    3 A 0 3-A
    4 B 0 4-A
    5 B 0 5-A
    6 B 0 6-A
    7 C 1
    8 C 1
    9 C 1
    10 D 1
    11 D 1
    12 D 1
    Now to replace the missing values of var4 with what I want for those observations which take on the value 1 for var3:

    Code:
    ssc install ereplace
    ereplace var4=concat(var1 var2 var3) if var3==1, punct("-")
    (6 missing values generated)
    (12 real changes made)
    My data now look like:
    var1 var2 var3 var4
    1 A 0
    2 A 0
    3 A 0
    4 B 0
    5 B 0
    6 B 0
    7 C 1 7-C-1
    8 C 1 8-C-1
    9 C 1 9-C-1
    10 D 1 10-D-1
    11 D 1 11-D-1
    12 D 1 12-D-1
    Why did -ereplace- modify var4 for obs. 1 through 6? Is this intended behavior of -ereplace- or a bug?

    I am assuming only the authors will be able to answer this question, but if anyone knows how I might be using -ereplace- wrong, your advice is appreciated.

    (I have already found an alternate approach that solves my problem, so no need to suggest alternate solutions.)

  • #2
    Welcome to Statalist.

    You've done nothing wrong. I installed ereplace and opened it in the do-file editor
    Code:
    doedit "`c(sysdir_plus)'/e/ereplace.ado"
    The next-to-last line is
    Code:
            replace `name' = `dummy'
    which I edited to
    Code:
            replace `name' = `dummy' `if' `in'
    This solves the problem. Using your example data (thank you for providing it, and for using dataex to do so!)
    Code:
    . egen var4=concat(var1 var2) if var3==0, punct("-")
    (6 missing values generated)
    
    . ereplace var4=concat(var1 var2 var3) if var3==1, punct("-")
    (6 missing values generated)
    variable var4 was str4 now str6
    (6 real changes made)
    
    . list, clean
    
           var1   var2   var3     var4  
      1.      1      A      0      1-A  
      2.      2      A      0      2-A  
      3.      3      A      0      3-A  
      4.      4      B      0      4-B  
      5.      5      B      0      5-B  
      6.      6      B      0      6-B  
      7.      7      C      1    7-C-1  
      8.      8      C      1    8-C-1  
      9.      9      C      1    9-C-1  
     10.     10      D      1   10-D-1  
     11.     11      D      1   11-D-1  
     12.     12      D      1   12-D-1

    Comment


    • #3
      Okay, great, and thanks a bunch, Will! This edit to -ereplace- will be a lot more convenient for me in the future.

      If Nick or Chris doesn't see this thread perhaps I will go leave a note in the original thread to let them know of your debugging.

      (thank you for providing it, and for using dataex to do so!)
      I may be a newly registered user, but I've been reading Statalist long enough to learn the value of providing an example.
      Last edited by Andrew Benson; 08 Oct 2019, 20:51.

      Comment


      • #4
        I'm a little worried about the solution in #2. Clearly the original code is wrong and fails to account for the -if- and -in- conditions. But I worry about implementing the support for -if- and -in- in the way shown in #2. My worry is that there may be functions in -egen- or -egenmore- that alter the sort order of the observations. If that is the case, the application of `in' at the very end can cause the wrong observations to be used.

        Generally, if one wants a program to support [if] and [in], at the very beginning of the program you identify the included observations, typically with a command like -marksample touse-, which creates a tempvar named `touse' that indicates the included observations. Then, at the end you would -replace `name' = `dummy' if `touse'- That approach will target the correct subset of observations for replacement even if the sort order of the data has changed.

        Unlike William Lisowski, I'm not ambitious enough to install -ereplace- and upgrade the code to include this tonight. So I don't know if even more extensive changes would be needed to ereplace.ado to properly implement this approach. At a minimum, every place the code does show -`if' `in'-, that would be replaced by -if `touse'-.

        All of that said, I don't know if there actually are any -egen- functions that do change the data's sort order and fail to restore it, so my concerns here may not really bite..

        Comment


        • #5
          Thanks very much for the problem report and suggested fixes. Oops. Chris Larkin is first author, so if he's active I imagine that he will take this forward.

          Comment


          • #6
            Clyde raises an interesting possibility in post #4 that I had overlooked.

            There are certainly egen functions that change the data's sort order:
            Code:
            viewsource _ggroup.ado
            shows one of them. But egen is implemented as an ado program and
            Code:
            viewsource egen.ado
            shows us that egen takes care of returning the data to its original order
            Code:
            *! version 3.4.1  05jun2013
            program define egen, byable(onecall) sortpreserve
            With that said, I don't know what would happen if a function were to drop observations or otherwise alter values in the dataset. That sort of side-effect seems antithetical to the mathematician's idea of a function, but I don't know if or how Stata defends against it. Unlike sorting the dataset, though, this seems an unlikely possibility.

            Comment


            • #7
              Hi all, it's taken a global pandemic for me to finally get around to doing this. My apologies. Thank you for pointing the issue out Andrew Benson; it's definitely a bug and i've been able to replicate it. I've taken Clyde's advice and added in -marksample touse- and then -if `touse'- wherever the programme calls `if' `in'. I'm going to send this on to Kit Baum now. Thanks again everyone, and I hope you're keeping well

              Comment

              Working...
              X