Replace variables within group and set some to missing.

Jack Reimer

Join Date: Sep 2018

Posts: 52
#1

Replace variables within group and set some to missing.

23 Oct 2018, 10:10

I've got a variable with that is sometimes present, sometimes not. For example,

Code:

group_id target_var 1 NA 1 C 1 NA 2 B 2 B 3 A 3 A 3 NA 3 B 3 NA

I would like to have a new variable that replaces all missing (i.e. target_var=="NA") within a particular group with the target var of the other observations in that group but if there are multiple values within that group, I would like them all to remain missing. So ideally the final dataset would look something like

Code:

group_id target_var final_var 1 NA C 1 C C 1 NA C 2 B B 2 B B 3 A A 3 A A 3 NA NA 3 B B 3 NA NA

sort group_id
So far, I've got a very inefficient method

Code:

gen final_var = target_var if target_var!="NA" replace final_var = final_var[_n-1] if final_var[_n] == "" & group_id[_n-1] == group_id[_n] replace final_var = final_var[_n+1] if final_var[_n]=="" & group_id[_n+1] == group_id[_n]

but this does not help me get some of the observations back to missing as would be the case for group_id==3 above. Another way of phrasing this question would be how can I set final_var from missing to an existing value in target_var if there is only one non-missing value? Once I have this, I can replace the remainder with a more simple command.

Last edited by Jack Reimer; 23 Oct 2018, 10:15.
Tags: group, replace, weighting

Nick Cox

Join Date: Mar 2014
Posts: 35698

23 Oct 2018, 11:01

One thing that is immediate is that working in groups usually benefits from exploiting by:.Then there is no need to check whether observations are in the same group or not.

This solution takes your example quite literally and hinges on noticing that "A" "B" "C" all sort before "NA". Then the idea is that a group is OK for change if (and only if) all values in that group are equal to the first value after sorting or are "NA". We note that all values being "NA" suits both criteria, which is not a problem: it doesn't matter if all "NA" are replaced by "NA" in that circumstance.

Code:

clear 
input group_id  str2 target_var
1                 NA
1                 C
1                 NA
2                 B
2                 B
3                 A
3                 A
3                 NA
3                 B
3                 NA
end 

gen long obs = _n 
bysort group_id (target_var) : gen candidate = target_var[1] 
gen OK = inlist(target, candidate, "NA") 
bysort group_id (OK) : gen wanted = cond(OK[1], candidate, target) 
sort obs 
drop obs 
list, sepby(group_id) 

     +----------------------------------------------+
     | group_id   target~r   candid~e   OK   wanted |
     |----------------------------------------------|
  1. |        1         NA          C    1        C |
  2. |        1          C          C    1        C |
  3. |        1         NA          C    1        C |
     |----------------------------------------------|
  4. |        2          B          B    1        B |
  5. |        2          B          B    1        B |
     |----------------------------------------------|
  6. |        3          A          A    1        A |
  7. |        3          A          A    1        A |
  8. |        3         NA          A    1       NA |
  9. |        3          B          A    0        B |
 10. |        3         NA          A    1       NA |
     +----------------------------------------------+

If your real example doesn't match that, then sing out, as a fix is an easy twist on the code above.

Announcement

Replace variables within group and set some to missing.

Comment