Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen problem

    Hi list,

    I want to generate a variable that equals the mean of the values of the first two nonmissing variables in a three-variable list (var1, var2, var3). At the same time, I also want to record which two of the three variables in my list have been used to generate the new variable (I do not need the variable name; the order of the variable is enough). How can I do that efficiently? Thank you very much!

    Code:
    webuse auto,clear
    
    replace headroom=. if mpg==17
    replace weight=. if mpg==22|mpg==17
    //I did the above manipulation so that some observations have only one nonmissing value in the varlist to be considered below
    //so that you cannot simply apply egen rowmean to fulfil my task
    
    keep rep78 headroom weight
    rename rep78 var1
    rename headroom var2
    rename weight var3
    //how to generate a variable that equals the mean of the first two nonmissing variables in the varlist (var1, var2, var3)
    //and record which two of the three variables in my list have been used to generate the new variable?
    Last edited by shem shen; 31 Oct 2021, 21:41.

  • #2
    I've made a solution assuming you have a wide layout, and 3 variables, because it both answers the direct question asked and because 3 variables are the minimally sufficient set needed to demonstrate a solution. It can be extended, however it's cumbersome to do with more than just a few variables. Nevertheless, the logic is pretty easy to grasp, because it proceeds by checking each variable in the sequence, and returning a value when a non-missing value is encountered, or else continuing to the next variable. Begin at the start here.

    Code:
    clear
    input byte(a b c)
    1 0 .
    0 1 0
    . . 1
    end
    fillin a b c
    drop _fillin
    
    // Start here
    
    * first and second non-missing variables, assuming the variable order of a, b, c
    gen which1 = cond(!mi(a), "a", cond(!mi(b), "b", cond(!mi(c), "c", "")))
    gen which2 = cond(!mi(b), "b", cond(!mi(c), "c", ""))
    
    * ... and their values
    gen nm1 = cond(!mi(a), a, cond(!mi(b), b, cond(!mi(c), c, .)))
    gen nm2 = cond(!mi(b), b, cond(!mi(c), c, .))
    
    * finally, compute the average, conditional on having 2 values
    egen avg = rowmean(nm1 nm2) if !mi(nm1, nm2)
    That said, if you anticipate needing to use a large(r) varlist, you would certainly be better off (and more efficient) to reframe the question by first reshaping long.

    Comment


    • #3
      Here's a stab at a different solution using a long-format intermediate.

      Code:
      clear
      frame create Work
      frame change Work
      input byte(v1 v2 v3)
      1 0 .
      0 1 0
      . . 1
      end
      fillin v?
      drop _fillin
      sort v?
      gen int row = _n , before(v1)
      
      frame put row v? , into(Avg)
      frame change Avg
      reshape long v, i(row) j(posn)
      drop if mi(v)
      
      bys row (posn) : keep if _n <= 2 & _N >= 2
      bys row : gen byte newposn = _n
      
      bys row : egen avg = mean(v)
      bys row : replace avg = . if _n > 1
      
      drop v
      reshape wide posn avg , i(row) j(newposn)
      keep row posn? avg1
      rename avg1 avg
      
      frame change Work
      frlink 1:1 row , frame(Avg) gen(lavg)
      frget posn? avg, from(lavg)
      drop lavg
      frame drop Avg
      
      local myvarlist v1 v2 v3
      forval i = 1/2 {
        gen which`i' = word("`myvarlist'", posn`i')
      }

      Comment


      • #4
        #2 is not quite right. Suppose a is missing and b is not and c is missing, Then

        Code:
        gen nm1 = cond(!mi(a), a, cond(!mi(b), b, cond(!mi(c), c, .)))  gen nm2 = cond(!mi(b), b, cond(!mi(c), c, .))
        assigns the value of b to nm1 and also to nm2. The mean will be correct but what is returned is not quite right, because it is implied that there are two non-missing values. The big picture here -- and Leonardo Guizzetti is saying this too -- is that the data structure makes this awkward. What's more, other problems could be as bad or worse.

        I don't say this is easy either way, but here's another approach.


        Code:
        clear
        input byte(a b c)
        1 0 .
        0 1 0
        . . 1
        end
        
        gen long id = _n 
        rename (a b c) (data#), addnumber 
        
        reshape long data, i(id) j(which) 
        
        gen ismissing = missing(data)
        
        bysort id (ismissing which) : gen wanted = cond(missing(data[2]), data[1], (data[1] + data[2]) / 2) 
        
        by id : gen indexes = cond(!missing(data[1]), strofreal(which[1]), "")
        by id : replace indexes = indexes + cond(!missing(data[2]), strofreal(which[2]), "") 
        
        list, sepby(id)
        
             +-------------------------------------------------+
             | id   which   data   ismiss~g   wanted   indexes |
             |-------------------------------------------------|
          1. |  1       1      1          0       .5        12 |
          2. |  1       2      0          0       .5        12 |
          3. |  1       3      .          1       .5        12 |
             |-------------------------------------------------|
          4. |  2       1      0          0       .5        12 |
          5. |  2       2      1          0       .5        12 |
          6. |  2       3      0          0       .5        12 |
             |-------------------------------------------------|
          7. |  3       3      1          0        1         3 |
          8. |  3       1      .          1        1         3 |
          9. |  3       2      .          1        1         3 |
             +-------------------------------------------------+

        Comment


        • #5
          Thanks for catching my error, Nick Cox . I shouldn't code when tired.

          Comment


          • #6
            @Nick Cox @Leonardo Guizzetti Thank you both for your time and code! They are extremely helpful. I eventually decided to go with the long format, as it seems much easier.

            Comment

            Working...
            X