Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replace missing between equal values in a set of variables

    Hey!

    This is my first post here at this forum. I have been battling with this problem for the last two days. I have 17 scoring variables in a panel with i patients for t months. Observed months can vary.
    Due to anonymization, for some months the scores for all 17 variables are gone missing. I know that these variables are hardly changing. Therefore, if the last observed score is equal the first score after gone missing, then the missing in between should be the same. How can I do this?

    I believe that an example will be better. This is my data, where F1-F6 is my scoring variables and S marks the observations that are “suppressed” due to anonymization.
    ID T F1 F2 F3 F4 F5 F6 S
    1 1 . . . . . . 1
    1 2 2 3 2 3 2 4 0
    1 3 . . . . . . 1
    1 4 . . . . . . 1
    1 5 . . . . . . 1
    1 6 2 3 2 3 2 4 0
    1 7 2 3 2 3 2 4 0
    1 8 . . . . . . 1
    1 9 . . . . . . 1
    1 10 2 3 3 4 2 4 0
    2 1 1 2 1 1 2 3 0
    2 2 1 2 1 1 2 3 0
    2 3 . . . . . . 1
    2 4 . . . . . . 1
    2 5 . . . . . . 1
    2 6 . . . . . . 1
    2 7 1 2 1 1 4 4 1
    2 8 . . . . . . 1
    2 9 1 2 1 1 4 4 1
    I will only impute if non of the scoring variables have changed from the latest observed scoring.
    If one or several has changed, then one should not replace these missing. I would like it to look like this:
    ID T F1 F2 F3 F4 F5 F6 S
    1 1 . . . . . . 1
    1 2 2 3 2 3 2 4 0
    1 3 2 3 2 3 2 4 1
    1 4 2 3 2 3 2 4 1
    1 5 2 3 2 3 2 4 1
    1 6 2 3 2 3 2 4 0
    1 7 2 3 2 3 2 4 0
    1 8 . . . . . . 1
    1 9 . . . . . . 1
    1 10 2 3 3 4 2 4 0
    2 1 1 2 1 1 2 3 0
    2 2 1 2 1 1 2 3 0
    2 3 . . . . . . 1
    2 4 . . . . . . 1
    2 5 . . . . . . 1
    2 6 . . . . . . 1
    2 7 1 2 1 1 4 4 1
    2 8 1 2 1 1 4 4 1
    2 9 1 2 1 1 4 4 1

    Any ideas? I can post some of my failed tries, if needed.

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(id t f1 f2 f3 f4 f5 f6 s)
    1  1 . . . . . . 1
    1  2 2 3 2 3 2 4 0
    1  3 . . . . . . 1
    1  4 . . . . . . 1
    1  5 . . . . . . 1
    1  6 2 3 2 3 2 4 0
    1  7 2 3 2 3 2 4 0
    1  8 . . . . . . 1
    1  9 . . . . . . 1
    1 10 2 3 3 4 2 4 0
    2  1 1 2 1 1 2 3 0
    2  2 1 2 1 1 2 3 0
    2  3 . . . . . . 1
    2  4 . . . . . . 1
    2  5 . . . . . . 1
    2  6 . . . . . . 1
    2  7 1 2 1 1 4 4 1
    2  8 . . . . . . 1
    2  9 1 2 1 1 4 4 1
    end
    
    gen int matches = 0
    forvalues i = 1/6 {
        gen down`i' = f`i'
        by id (t), sort: replace down`i' = down`i'[_n-1] if missing(down`i')
        gen up`i' = f`i'
        gsort id -t
        by id: replace up`i' = up`i'[_n-1] if missing(up`i')
        replace matches = matches + 1 if up`i' == down`i'
    }
    
    forvalues i = 1/6 {
        replace f`i' = up`i' if matches == 6
    }
    sort id t
    
    drop up* down* matches // OPTIONAL
    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.



    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Added:
    Therefore, if the last observed score is equal the first score after gone missing, then the missing in between should be the same.
    sounds like a questionable assumption to me, but it's your call.
    Last edited by Clyde Schechter; 14 Nov 2018, 10:53.

    Comment


    • #3
      I don't entirely understand your description of all your conditions, and I suspect that your example does not illustrate all the odd things that might happen. (For example, what if an individual starts out with missing F values?) And, you'd help us help you if you could post your example data using -dataex-, as described in the FAQ.

      All that being said: One thing that might make your work easier would be to start by representing your list of F scores as a single string, which, if these variables are integers as you show, can be done with:
      Code:
      egen Fstring = concat(F*)
      This would facilitate comparing whether all the scores have stayed the same.

      Comment


      • #4
        Thanks to both of you!
        Your code worked Clyde Schechter! That is just brilliant!

        Interestingly, when I typed it myself I got and error r(198) invalid ‘1’ in STATA. When I just copied your code in a simple copy paste, and edited it a bit for my variable names, it worked fine.
        Comparing the two different codes, I cannot find anything different. Strange.

        I will try to use –dataex- next time, but not always that easy since I work on a research server with strict rules for exporting data.

        When it comes to the questionable assumption, I totally understand your skepticism. I also had them. After studying the dynamics of these scores for all the observation that where not suppressed (more than 4 million observations), this is actually the pattern. No matter age, gender etc.

        Comment


        • #5
          Hi!

          Originally posted by Tore Bersvendsen View Post
          [...]
          [...]When it comes to the questionable assumption, I totally understand your skepticism. I also had them. After studying the dynamics of these scores for all the observation that where not suppressed (more than 4 million observations), this is actually the pattern. No matter age, gender etc.
          Even without knowing anything about your data source, it comes to my mind that these data probably have been suppressed for a reason. You state the reaseon is "anonymization"; to me, this sounds like they have been suppressed by the data provider because the patterns of these observations are different to all the others, and the data has to be obfuscated in order to keep the identity of the patients confidential.

          Thus said, I strongly support Clyde's skepticism. Assuming these observations' data patterns are identical to all the others denies the fact that there have been reasons to suppress the data in the first place. If I was the one who's supposed to work with these data, I would contact the data provider and ask for details why these data have been suppressed. But these are just my two cents.

          Regards
          Bela

          Comment


          • #6
            Thanks for your thoughts Daniel Bela!

            That was also my original concern, so we have been in touch with our data provider. I even know how the anonymization algorithm is designed and implemented. As it turns out, they are mainly suppressed for one reason only, which does not influences the scores. I do not have any reason for believing that the data provider is lying to me

            Best regards

            Tore

            Comment

            Working...
            X