Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacement condition based in multiple (and large number) of variables

    How can I condition some replacement in the values of other variables in that same observation (not doing it one by one).
    For example I have variables v1, v2, v3 and v4, with v4 just missing values. I need to replace v4=x, conditional in that v1, v2 and v3 are not equal to x in that observation.
    The original problem is the following: I reshaped and then merged 2 datasets leading to a dataset that looks something like this:
    ID v1_1 v1_2 v2_3 non-missing1 v2_1 v2_2 v2_3 non-missing2
    1 1 2 . 2 8 . . 1
    2 3 . . 1 9 . . 1
    3 4 5 . 2 10 11 . 2
    4 6 7 . 2 12 13 14 3
    And I would need to transform into something like this:
    ID v1 v2 v3 v4 v5
    1 1 2 8 . .
    2 3 9 . . .
    3 4 5 10 11 .
    4 6 7 12 13 14
    If there is a simple way to do this, then my original question is not pertinent anymore.
    What I am doing to do the previous transformation is:

    local i=1
    local j=1
    while `i' < 15 {
    gen v`j'=.
    if _merge==1 replace v`j'=v2_`j'
    if _merge==2 replace v`j'=v1_`j'
    local i=`i'+1
    local j=`j'+1
    }

    local i=2
    while `i' < 15 {
    local j=1
    while `j' < 10 {
    local z = `i' - 1
    replace v`i'=v2_`j' if _merge==3 & v`i'==. & v1==v1_1 & v`z'!=v2_`j' & `i'<=nm
    replace v`i'=v1_`j' if _merge==3 & v`i'==. & v1==v2_1 & v`z'!=v1_`j' & `i'<=nm
    local j=`j'+1
    }
    local i=`i'+1
    }

    The condition that is in bold checks that the new value is not equal to the one just before, for example, that the potential replacemente for V5 is not equal to V4. What I need is to change that for a condition that checks that the potential replacement for V5 (v*_j in the code) is not equal no all the rest of the V* variables for that observation. With that, the transformation should work without problems. Is this possible?

    Thank you!

  • #2
    Managed to solve this with the following code:

    gen nm=nm1+nm2
    gen check1=.
    gen check2=.

    local i=1
    local j=1
    while `i' < 15 {
    gen v`j'=.
    if _merge==1 replace v`j'=v2_`j'
    if _merge==2 replace v`j'=v1_`j'
    local i=`i'+1
    local j=`j'+1
    }

    local i=2
    while `i' < 15 {
    local j=1
    while `j' < 10 {
    local z = `i' - 1
    forval w = 1(1)14 {
    replace check1=1 if v1_`j'==v`w'
    replace check2=1 if v2_`j'==v`w'
    }
    replace v`i'=v2_`j' if _merge==3 & v`i'==. & v1==v1_1 & v`z'!=v2_`j' & `i'<=nm & check2==.
    replace v`i'=v1_`j' if _merge==3 & v`i'==. & v1==v2_1 & v`z'!=v1_`j' & `i'<=nm & check1==.
    replace check1=.
    replace check2=.
    local j=`j'+1
    }
    local i=`i'+1
    }


    Not very elegant or efficient. If anyone has a better way of doing it, would be great to know.
    Thanks!

    Comment


    • #3
      I haven't tried to understand your code in #1 or #2. The logic seems very complicated. But it doesn't look as if you are using the if command appropriately. See http://www.stata.com/support/faqs/pr...-if-qualifier/

      I am not sure that I understand the original problem. You want

      Code:
      clear
      input ID    v1    v2    v3    v4    v5
      1    1    2    8    .    .
      2    3    9    .    .    .
      3    4    5    10    11    .
      4    6    7    12    13    14
      end
      but what rules lead to this? This works, but it may not be following your rules. The problem is that I don't understand your rules. You refer to x, but I don't see what it is.

      First I correct a typo in your first display The variable name v2_3 cannot occur twice.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(id v1_1 v1_2 v1_3 nonmissing1 v2_1 v2_2 var2_3 nonmissing2)
      1 1 2 . 2  8  .  . 1
      2 3 . . 1  9  .  . 1
      3 4 5 . 2 10 11  . 2
      4 6 7 . 2 12 13 14 3
      end
      
      drop non* 
      
      qui foreach v of var * { 
          count if missing(`v') 
          if r(N) == _N drop `v' 
      }
      
      rename (v*) (v#), addnumber 
      
      list 
      
          +-----------------------------+
           | id   v1   v2   v3   v4   v5 |
           |-----------------------------|
        1. |  1    1    2    8    .    . |
        2. |  2    3    .    9    .    . |
        3. |  3    4    5   10   11    . |
        4. |  4    6    7   12   13   14 |
           +-----------------------------+


      Comment


      • #4
        I have a different guess about what Ignacio Parot wants. I think his rule is to count through the v1_'s until a missing value is encountered. Then from that point on, replace that v1_ and subsequent v1_'s with the values of v2_ (starting from v2_1). If that's what is wanted, then I believe this does it:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
        1 1 2 .  8  .  .
        2 3 . .  9  .  .
        3 4 5 . 10 11  .
        4 6 7 . 12 13 14
        end
        
        // MAKE ROOM FOR EXTRA v1'S THAT MAY BE NEEDED
        gen v1_4 = .
        gen v1_5 = .
        
        // NEARLY EVERYTHING IS EASIER IN LONG LAYOUG
        reshape long v1_ v2_, i(id)
        // GET RUNNING COUNT OF MISSING VALUES IN V1 PER ID
        by id (_j), sort: gen counter = sum(missing(v1_))
        
        // FILL IN n'TH OBSERVATION OF v1_ FROM THE FIRST
        // MISSING ONE with n'TH OBSERVATION OF v2_
        by id (_j): replace v1_ = v2_[counter] if counter > 0
        
        // CLEAN UP
        drop counter v2_
        
        // RETURN TO WIDE LAYOUT
        reshape wide v1_, i(id)
        // CHANGE TO DESIRED VARIABLE NAMES
        rename v1_* v*
        Note: I don't know where you're going with this. But, as noted, most things in Stata are easier to do in long layout than in wide. So whatever your next steps are, my guess is that they'll be simpler if you stop just before the -reshape wide- at the end and just keep the data in long layout and move on from there.

        Added: In the future, please post example data using the -dataex- command, as I have done here. The HTML table layouts you showed were very difficult to import into Stata: ultimately I had to do that by just manually entering the data. Run -ssc install dataex- to install it, and then run -help datex- to read the simple instructions for using it. When you use -dataex-, those who want to help you can use a simple copy/paste operation to import your example into Stata and know that what they are working with matches exactly the situation you are facing.
        Last edited by Clyde Schechter; 24 Apr 2017, 21:45.

        Comment


        • #5
          Just thought of another way to do this (assuming I have guessed correctly what is wanted). It relies more heavily than the previous one on the specific names of the variables and might fail under other circumstances. But here it is:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
          1 1 2 . 8 . .
          2 3 . . 9 . .
          3 4 5 . 10 11 .
          4 6 7 . 12 13 14
          end
          
          // TRICKY RENAMING OF v1_ v2_ TO A STRAIGHT SEQUENCE v1-v6
          rename *_* **
          rename v# v#, renumber
          
          // GO LONG
          reshape long v, i(id)
          // DROP MISSING VALUES OF v
          drop if missing(v)
          // RENUMBER THE SEQUENCE
          by id (_j), sort: replace _j = _n
          // BACK TO WIDE
          reshape wide

          Comment


          • #6
            Originally posted by Nick Cox View Post
            I haven't tried to understand your code in #1 or #2. The logic seems very complicated. But it doesn't look as if you are using the if command appropriately. See http://www.stata.com/support/faqs/pr...-if-qualifier/

            I am not sure that I understand the original problem. You want

            Code:
            clear
            input ID v1 v2 v3 v4 v5
            1 1 2 8 . .
            2 3 9 . . .
            3 4 5 10 11 .
            4 6 7 12 13 14
            end
            but what rules lead to this? This works, but it may not be following your rules. The problem is that I don't understand your rules. You refer to x, but I don't see what it is.

            First I correct a typo in your first display The variable name v2_3 cannot occur twice.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte(id v1_1 v1_2 v1_3 nonmissing1 v2_1 v2_2 var2_3 nonmissing2)
            1 1 2 . 2 8 . . 1
            2 3 . . 1 9 . . 1
            3 4 5 . 2 10 11 . 2
            4 6 7 . 2 12 13 14 3
            end
            
            drop non*
            
            qui foreach v of var * {
            count if missing(`v')
            if r(N) == _N drop `v'
            }
            
            rename (v*) (v#), addnumber
            
            list
            
            +-----------------------------+
            | id v1 v2 v3 v4 v5 |
            |-----------------------------|
            1. | 1 1 2 8 . . |
            2. | 2 3 . 9 . . |
            3. | 3 4 5 10 11 . |
            4. | 4 6 7 12 13 14 |
            +-----------------------------+

            Thank you very much Nick for the references. The problem with that solution is that there are missing values between non-missing in the final table. Im sorry for not explaining the problem properly.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              Just thought of another way to do this (assuming I have guessed correctly what is wanted). It relies more heavily than the previous one on the specific names of the variables and might fail under other circumstances. But here it is:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
              1 1 2 . 8 . .
              2 3 . . 9 . .
              3 4 5 . 10 11 .
              4 6 7 . 12 13 14
              end
              
              // TRICKY RENAMING OF v1_ v2_ TO A STRAIGHT SEQUENCE v1-v6
              rename *_* **
              rename v# v#, renumber
              
              // GO LONG
              reshape long v, i(id)
              // DROP MISSING VALUES OF v
              drop if missing(v)
              // RENUMBER THE SEQUENCE
              by id (_j), sort: replace _j = _n
              // BACK TO WIDE
              reshape wide
              This worked perfectly Clyde, thank you very much.
              And thanks for the forum recomendations!
              Cheers,

              Ignacio

              Comment


              • #8
                Still don't understand it, as my code produces what you ask for. Oh well.

                Comment


                • #9
                  Nick, the problem with that code is that it does not eliminate missing values in the "series" of variables. In your example, there is a missing value for v2 in id=2 and a non-missing value for v3 in that same row. What I needed is the same, but for that "9" to be in the v2 column instead of the v3 column.

                  Comment


                  • #10
                    Still lost in your words here. What are you saying? That

                    1. The code doesn't work as desired for your full dataset.

                    2. The code doesn't work for your example.

                    #1 I can readily believe as I don't understand your principles, but #2 seems contrary to fact.

                    Comment


                    • #11
                      Thanks for your comments Nick. I mean“t number 2).
                      As I said above, for id=2 the code gives a missing value for v2 and a non-missing value for v3. I need all the values to be concentrated in the first variables for each row. Reshaping, dropping, renumbering and reshaping back works perfectly, as was suggested by Clyde.

                      Comment


                      • #12


                        I see the point now. Sorry about that. To be explicit: The problem is shown in observation 2.

                        Comment


                        • #13
                          If the idea is to shift non-missing values to the left, you can do this without the double reshape using rangestat (from SSC). In this example, each observation is uniquely identified by the variable id. You define a degenerate interval based on id and the Mata function will pick and return non-missing values of X.

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
                          1 1 2 . 8 . .
                          2 3 . . 9 . .
                          3 4 5 . 10 11 .
                          4 6 7 . 12 13 14
                          end
                          
                          * check assumption about the identifier
                          isid id
                          
                          * define a Mata function that select columns of X with non-missing values
                          mata:  
                              mata clear
                              real rowvector nomiss(real rowvector X) {
                                  return(select(X, X[1,.]:< .))
                              }
                          end 
                          
                          rangestat (nomiss) v1_1-v2_3, interval(id 0 0)
                          list, noobs compress

                          Comment


                          • #14
                            Thank you very much Robert. That looks like a very nice solution as well!

                            Comment

                            Working...
                            X