egen problem

shem shen

Join Date: Mar 2016
Posts: 136

31 Oct 2021, 21:37

Hi list,

I want to generate a variable that equals the mean of the values of the first two nonmissing variables in a three-variable list (var1, var2, var3). At the same time, I also want to record which two of the three variables in my list have been used to generate the new variable (I do not need the variable name; the order of the variable is enough). How can I do that efficiently? Thank you very much!

Code:

webuse auto,clear

replace headroom=. if mpg==17
replace weight=. if mpg==22|mpg==17
//I did the above manipulation so that some observations have only one nonmissing value in the varlist to be considered below
//so that you cannot simply apply egen rowmean to fulfil my task

keep rep78 headroom weight
rename rep78 var1
rename headroom var2
rename weight var3
//how to generate a variable that equals the mean of the first two nonmissing variables in the varlist (var1, var2, var3)
//and record which two of the three variables in my list have been used to generate the new variable?

Last edited by shem shen; 31 Oct 2021, 21:41.

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

31 Oct 2021, 22:09

I've made a solution assuming you have a wide layout, and 3 variables, because it both answers the direct question asked and because 3 variables are the minimally sufficient set needed to demonstrate a solution. It can be extended, however it's cumbersome to do with more than just a few variables. Nevertheless, the logic is pretty easy to grasp, because it proceeds by checking each variable in the sequence, and returning a value when a non-missing value is encountered, or else continuing to the next variable. Begin at the start here.

Code:

clear input byte(a b c) 1 0 . 0 1 0 . . 1 end fillin a b c drop _fillin // Start here * first and second non-missing variables, assuming the variable order of a, b, c gen which1 = cond(!mi(a), "a", cond(!mi(b), "b", cond(!mi(c), "c", ""))) gen which2 = cond(!mi(b), "b", cond(!mi(c), "c", "")) * ... and their values gen nm1 = cond(!mi(a), a, cond(!mi(b), b, cond(!mi(c), c, .))) gen nm2 = cond(!mi(b), b, cond(!mi(c), c, .)) * finally, compute the average, conditional on having 2 values egen avg = rowmean(nm1 nm2) if !mi(nm1, nm2)

That said, if you anticipate needing to use a large(r) varlist, you would certainly be better off (and more efficient) to reframe the question by first reshaping long.
1 like
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

31 Oct 2021, 22:31

Here's a stab at a different solution using a long-format intermediate.

Code:

clear
frame create Work
frame change Work
input byte(v1 v2 v3)
1 0 .
0 1 0
. . 1
end
fillin v?
drop _fillin
sort v?
gen int row = _n , before(v1)

frame put row v? , into(Avg)
frame change Avg
reshape long v, i(row) j(posn)
drop if mi(v)

bys row (posn) : keep if _n <= 2 & _N >= 2
bys row : gen byte newposn = _n

bys row : egen avg = mean(v)
bys row : replace avg = . if _n > 1

drop v
reshape wide posn avg , i(row) j(newposn)
keep row posn? avg1
rename avg1 avg

frame change Work
frlink 1:1 row , frame(Avg) gen(lavg)
frget posn? avg, from(lavg)
drop lavg
frame drop Avg

local myvarlist v1 v2 v3
forval i = 1/2 {
  gen which`i' = word("`myvarlist'", posn`i')
}

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35709

01 Nov 2021, 04:02

#2 is not quite right. Suppose a is missing and b is not and c is missing, Then

Code:

gen nm1 = cond(!mi(a), a, cond(!mi(b), b, cond(!mi(c), c, .)))  gen nm2 = cond(!mi(b), b, cond(!mi(c), c, .))

assigns the value of b to nm1 and also to nm2. The mean will be correct but what is returned is not quite right, because it is implied that there are two non-missing values. The big picture here -- and Leonardo Guizzetti is saying this too -- is that the data structure makes this awkward. What's more, other problems could be as bad or worse.

I don't say this is easy either way, but here's another approach.

Code:

clear
input byte(a b c)
1 0 .
0 1 0
. . 1
end

gen long id = _n 
rename (a b c) (data#), addnumber 

reshape long data, i(id) j(which) 

gen ismissing = missing(data)

bysort id (ismissing which) : gen wanted = cond(missing(data[2]), data[1], (data[1] + data[2]) / 2) 

by id : gen indexes = cond(!missing(data[1]), strofreal(which[1]), "")
by id : replace indexes = indexes + cond(!missing(data[2]), strofreal(which[2]), "") 

list, sepby(id)

     +-------------------------------------------------+
     | id   which   data   ismiss~g   wanted   indexes |
     |-------------------------------------------------|
  1. |  1       1      1          0       .5        12 |
  2. |  1       2      0          0       .5        12 |
  3. |  1       3      .          1       .5        12 |
     |-------------------------------------------------|
  4. |  2       1      0          0       .5        12 |
  5. |  2       2      1          0       .5        12 |
  6. |  2       3      0          0       .5        12 |
     |-------------------------------------------------|
  7. |  3       3      1          0        1         3 |
  8. |  3       1      .          1        1         3 |
  9. |  3       2      .          1        1         3 |
     +-------------------------------------------------+

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#5

01 Nov 2021, 06:27

Thanks for catching my error, Nick Cox . I shouldn't code when tired.
Comment
shem shen

Join Date: Mar 2016

Posts: 136
#6

01 Nov 2021, 19:59

@Nick Cox @Leonardo Guizzetti Thank you both for your time and code! They are extremely helpful. I eventually decided to go with the long format, as it seems much easier.
Comment

Announcement

egen problem

Comment

Comment

Comment

Comment

Comment