Replacement condition based in multiple (and large number) of variables

Ignacio Parot

Join Date: Apr 2017

Posts: 9
#1

Replacement condition based in multiple (and large number) of variables

24 Apr 2017, 10:07

How can I condition some replacement in the values of other variables in that same observation (not doing it one by one).
For example I have variables v1, v2, v3 and v4, with v4 just missing values. I need to replace v4=x, conditional in that v1, v2 and v3 are not equal to x in that observation.
The original problem is the following: I reshaped and then merged 2 datasets leading to a dataset that looks something like this:

ID v1_1 v1_2 v2_3 non-missing1 v2_1 v2_2 v2_3 non-missing2

1 1 2 . 2 8 . . 1

2 3 . . 1 9 . . 1

3 4 5 . 2 10 11 . 2

4 6 7 . 2 12 13 14 3

And I would need to transform into something like this:
ID v1 v2 v3 v4 v5

1 1 2 8 . .

2 3 9 . . .

3 4 5 10 11 .

4 6 7 12 13 14

If there is a simple way to do this, then my original question is not pertinent anymore.
What I am doing to do the previous transformation is:

local i=1
local j=1
while `i' < 15 {
gen v`j'=.
if _merge==1 replace v`j'=v2_`j'
if _merge==2 replace v`j'=v1_`j'
local i=`i'+1
local j=`j'+1
}

local i=2
while `i' < 15 {
local j=1
while `j' < 10 {
local z = `i' - 1
replace v`i'=v2_`j' if _merge==3 & v`i'==. & v1==v1_1 & v`z'!=v2_`j' & `i'<=nm
replace v`i'=v1_`j' if _merge==3 & v`i'==. & v1==v2_1 & v`z'!=v1_`j' & `i'<=nm
local j=`j'+1
}
local i=`i'+1
}

The condition that is in bold checks that the new value is not equal to the one just before, for example, that the potential replacemente for V5 is not equal to V4. What I need is to change that for a condition that checks that the potential replacement for V5 (v*_j in the code) is not equal no all the rest of the V* variables for that observation. With that, the transformation should work without problems. Is this possible?

Thank you!
Tags: None
Ignacio Parot

Join Date: Apr 2017

Posts: 9
#2

24 Apr 2017, 10:44

Managed to solve this with the following code:

gen nm=nm1+nm2
gen check1=.
gen check2=.

local i=1
local j=1
while `i' < 15 {
gen v`j'=.
if _merge==1 replace v`j'=v2_`j'
if _merge==2 replace v`j'=v1_`j'
local i=`i'+1
local j=`j'+1
}

local i=2
while `i' < 15 {
local j=1
while `j' < 10 {
local z = `i' - 1
forval w = 1(1)14 {
replace check1=1 if v1_`j'==v`w'
replace check2=1 if v2_`j'==v`w'
}
replace v`i'=v2_`j' if _merge==3 & v`i'==. & v1==v1_1 & v`z'!=v2_`j' & `i'<=nm & check2==.
replace v`i'=v1_`j' if _merge==3 & v`i'==. & v1==v2_1 & v`z'!=v1_`j' & `i'<=nm & check1==.
replace check1=.
replace check2=.
local j=`j'+1
}
local i=`i'+1
}

Not very elegant or efficient. If anyone has a better way of doing it, would be great to know.
Thanks!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

24 Apr 2017, 12:09

I haven't tried to understand your code in #1 or #2. The logic seems very complicated. But it doesn't look as if you are using the if command appropriately. See http://www.stata.com/support/faqs/pr...-if-qualifier/

I am not sure that I understand the original problem. You want

Code:

clear
input ID    v1    v2    v3    v4    v5
1    1    2    8    .    .
2    3    9    .    .    .
3    4    5    10    11    .
4    6    7    12    13    14
end

but what rules lead to this? This works, but it may not be following your rules. The problem is that I don't understand your rules. You refer to x, but I don't see what it is.

First I correct a typo in your first display The variable name v2_3 cannot occur twice.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(id v1_1 v1_2 v1_3 nonmissing1 v2_1 v2_2 var2_3 nonmissing2)
1 1 2 . 2  8  .  . 1
2 3 . . 1  9  .  . 1
3 4 5 . 2 10 11  . 2
4 6 7 . 2 12 13 14 3
end

drop non* 

qui foreach v of var * { 
    count if missing(`v') 
    if r(N) == _N drop `v' 
}

rename (v*) (v#), addnumber 

list 

    +-----------------------------+
     | id   v1   v2   v3   v4   v5 |
     |-----------------------------|
  1. |  1    1    2    8    .    . |
  2. |  2    3    .    9    .    . |
  3. |  3    4    5   10   11    . |
  4. |  4    6    7   12   13   14 |
     +-----------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

24 Apr 2017, 21:43

I have a different guess about what Ignacio Parot wants. I think his rule is to count through the v1_'s until a missing value is encountered. Then from that point on, replace that v1_ and subsequent v1_'s with the values of v2_ (starting from v2_1). If that's what is wanted, then I believe this does it:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3) 1 1 2 . 8 . . 2 3 . . 9 . . 3 4 5 . 10 11 . 4 6 7 . 12 13 14 end // MAKE ROOM FOR EXTRA v1'S THAT MAY BE NEEDED gen v1_4 = . gen v1_5 = . // NEARLY EVERYTHING IS EASIER IN LONG LAYOUG reshape long v1_ v2_, i(id) // GET RUNNING COUNT OF MISSING VALUES IN V1 PER ID by id (_j), sort: gen counter = sum(missing(v1_)) // FILL IN n'TH OBSERVATION OF v1_ FROM THE FIRST // MISSING ONE with n'TH OBSERVATION OF v2_ by id (_j): replace v1_ = v2_[counter] if counter > 0 // CLEAN UP drop counter v2_ // RETURN TO WIDE LAYOUT reshape wide v1_, i(id) // CHANGE TO DESIRED VARIABLE NAMES rename v1_* v*

Note: I don't know where you're going with this. But, as noted, most things in Stata are easier to do in long layout than in wide. So whatever your next steps are, my guess is that they'll be simpler if you stop just before the -reshape wide- at the end and just keep the data in long layout and move on from there.

Added: In the future, please post example data using the -dataex- command, as I have done here. The HTML table layouts you showed were very difficult to import into Stata: ultimately I had to do that by just manually entering the data. Run -ssc install dataex- to install it, and then run -help datex- to read the simple instructions for using it. When you use -dataex-, those who want to help you can use a simple copy/paste operation to import your example into Stata and know that what they are working with matches exactly the situation you are facing.

Last edited by Clyde Schechter; 24 Apr 2017, 21:45.
1 like
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

24 Apr 2017, 21:54

Just thought of another way to do this (assuming I have guessed correctly what is wanted). It relies more heavily than the previous one on the specific names of the variables and might fail under other circumstances. But here it is:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
1 1 2 . 8 . .
2 3 . . 9 . .
3 4 5 . 10 11 .
4 6 7 . 12 13 14
end

// TRICKY RENAMING OF v1_ v2_ TO A STRAIGHT SEQUENCE v1-v6
rename *_* **
rename v# v#, renumber

// GO LONG
reshape long v, i(id)
// DROP MISSING VALUES OF v
drop if missing(v)
// RENUMBER THE SEQUENCE
by id (_j), sort: replace _j = _n
// BACK TO WIDE
reshape wide

Comment

Ignacio Parot

Join Date: Apr 2017

Posts: 9
#6

25 Apr 2017, 03:15

Originally posted by Nick Cox View Post

I haven't tried to understand your code in #1 or #2. The logic seems very complicated. But it doesn't look as if you are using the if command appropriately. See http://www.stata.com/support/faqs/pr...-if-qualifier/

I am not sure that I understand the original problem. You want

Code:

clear input ID v1 v2 v3 v4 v5 1 1 2 8 . . 2 3 9 . . . 3 4 5 10 11 . 4 6 7 12 13 14 end

but what rules lead to this? This works, but it may not be following your rules. The problem is that I don't understand your rules. You refer to x, but I don't see what it is.

First I correct a typo in your first display The variable name v2_3 cannot occur twice.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(id v1_1 v1_2 v1_3 nonmissing1 v2_1 v2_2 var2_3 nonmissing2) 1 1 2 . 2 8 . . 1 2 3 . . 1 9 . . 1 3 4 5 . 2 10 11 . 2 4 6 7 . 2 12 13 14 3 end drop non* qui foreach v of var * { count if missing(`v') if r(N) == _N drop `v' } rename (v*) (v#), addnumber list +-----------------------------+ | id v1 v2 v3 v4 v5 | |-----------------------------| 1. | 1 1 2 8 . . | 2. | 2 3 . 9 . . | 3. | 3 4 5 10 11 . | 4. | 4 6 7 12 13 14 | +-----------------------------+

Thank you very much Nick for the references. The problem with that solution is that there are missing values between non-missing in the final table. Im sorry for not explaining the problem properly.
Comment
Ignacio Parot

Join Date: Apr 2017

Posts: 9
#7

25 Apr 2017, 03:16

Originally posted by Clyde Schechter View Post

Just thought of another way to do this (assuming I have guessed correctly what is wanted). It relies more heavily than the previous one on the specific names of the variables and might fail under other circumstances. But here it is:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3) 1 1 2 . 8 . . 2 3 . . 9 . . 3 4 5 . 10 11 . 4 6 7 . 12 13 14 end // TRICKY RENAMING OF v1_ v2_ TO A STRAIGHT SEQUENCE v1-v6 rename *_* ** rename v# v#, renumber // GO LONG reshape long v, i(id) // DROP MISSING VALUES OF v drop if missing(v) // RENUMBER THE SEQUENCE by id (_j), sort: replace _j = _n // BACK TO WIDE reshape wide

This worked perfectly Clyde, thank you very much.
And thanks for the forum recomendations!
Cheers,

Ignacio
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#8

25 Apr 2017, 03:22

Still don't understand it, as my code produces what you ask for. Oh well.
Comment
Ignacio Parot

Join Date: Apr 2017

Posts: 9
#9

25 Apr 2017, 05:58

Nick, the problem with that code is that it does not eliminate missing values in the "series" of variables. In your example, there is a missing value for v2 in id=2 and a non-missing value for v3 in that same row. What I needed is the same, but for that "9" to be in the v2 column instead of the v3 column.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#10

25 Apr 2017, 06:12

Still lost in your words here. What are you saying? That

1. The code doesn't work as desired for your full dataset.

2. The code doesn't work for your example.

#1 I can readily believe as I don't understand your principles, but #2 seems contrary to fact.
Comment
Ignacio Parot

Join Date: Apr 2017

Posts: 9
#11

25 Apr 2017, 06:18

Thanks for your comments Nick. I mean´t number 2).
As I said above, for id=2 the code gives a missing value for v2 and a non-missing value for v3. I need all the values to be concentrated in the first variables for each row. Reshaping, dropping, renumbering and reshaping back works perfectly, as was suggested by Clyde.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#12

25 Apr 2017, 06:38

I see the point now. Sorry about that. To be explicit: The problem is shown in observation 2.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

25 Apr 2017, 08:53

If the idea is to shift non-missing values to the left, you can do this without the double reshape using rangestat (from SSC). In this example, each observation is uniquely identified by the variable id. You define a degenerate interval based on id and the Mata function will pick and return non-missing values of X.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3)
1 1 2 . 8 . .
2 3 . . 9 . .
3 4 5 . 10 11 .
4 6 7 . 12 13 14
end

* check assumption about the identifier
isid id

* define a Mata function that select columns of X with non-missing values
mata:  
    mata clear
    real rowvector nomiss(real rowvector X) {
        return(select(X, X[1,.]:< .))
    }
end 

rangestat (nomiss) v1_1-v2_3, interval(id 0 0)
list, noobs compress

Comment

Ignacio Parot

Join Date: Apr 2017

Posts: 9
#14

26 Apr 2017, 02:37

Thank you very much Robert. That looks like a very nice solution as well!
Comment

Announcement