Listing observations in a group that are the same on non-missing values of a given string variable

tig som

Join Date: Sep 2022

Posts: 58
#1

Listing observations in a group that are the same on non-missing values of a given string variable

19 Dec 2022, 11:17

I am working on a similar command like this
Stata | FAQ: Listing observations in a group that differ on a variable
However, what if egenotype has a missing value for some, and I don’t want Stata to report this case as similar?

How can I use such commands?
To make my question clear I am changing the observations in the above STATA example as follows

. dataex

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte eid str2 egenotype 0 "vv" 0 "" 0 "" 1 "ww" 1 "ww" 1 "" 1 "" 2 "vv" 2 "ww" 2 "" 3 "ww" 3 "ww" end

------------------ copy up to and including the previous line ------------------

Listed 12 out of 12 observations

I want STATA to list only those samples that are the same in non-missing values of the variable egenotype for each individual.

If I use the command in the above STATA link, that is :

by eid (egenotype), sort: gen same = egenotype[1] == egenotype[_N]
. list eid egenotype if same

+----------------+
| eid egenot~e |
|----------------|
11. | 3 ww |
12. | 3 ww |
+----------------+

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte eid str2 egenotype float same 0 "" 0 0 "" 0 0 "vv" 0 1 "" 0 1 "" 0 1 "ww" 0 1 "ww" 0 2 "" 0 2 "vv" 0 2 "ww" 0 3 "ww" 1 3 "ww" 1 end

------------------ copy up to and including the previous line ------------------

Listed 12 out of 12 observations

Then, as you can see from the result of the command above, stata report only eid 3 as having similar genotypes for each individual. However, I also want to consider the similarity in the non-missing egenotype observed in eid 1 because eid =1 has also a similar egenotype if we focus only on non-missing values similarity in the group.

How do I rearrange the above command to list samples that have similar values in only non-missing values of egenotype?

It would be great to have your tips.
Thank you.
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1058

19 Dec 2022, 11:48

I'd probably go for collapse... it would also work if you have multiples of pair inside a single eid:

Code:

preserve
gen case = 1
collapse (sum) case if !missing(egenotype), by(eid egenotype)
gen duplicated = case >= 2
drop case
tempfile casecount
save `casecount', replace
restore

merge m:1 eid egenotype using `casecount', nogen
list, sepby(eid)

Results:

Code:

     +----------------------------------+
     | eid   egenot~e   same   duplic~d |
     |----------------------------------|
  1. |   0                 0          . |
  2. |   0                 0          . |
  3. |   0         vv      0          0 |
     |----------------------------------|
  4. |   1                 0          . |
  5. |   1                 0          . |
  6. |   1         ww      0          1 |
  7. |   1         ww      0          1 |
     |----------------------------------|
  8. |   2                 0          . |
  9. |   2         vv      0          0 |
 10. |   2         ww      0          0 |
     |----------------------------------|
 11. |   3         ww      1          1 |
 12. |   3         ww      1          1 |
     +----------------------------------+

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1379

19 Dec 2022, 20:11

Here is alternative code which should give you the same output as #2:

Code:

bysort eid egenotype: gen count = _N
gen similar = (count > 1) if !missing(egenotype)
drop count

this produces:

Code:

. list, sepby(eid) noobs

  +--------------------------+
  | eid   egenot~e   similar |
  |--------------------------|
  |   0                    . |
  |   0                    . |
  |   0         vv         0 |
  |--------------------------|
  |   1                    . |
  |   1                    . |
  |   1         ww         1 |
  |   1         ww         1 |
  |--------------------------|
  |   2                    . |
  |   2         vv         0 |
  |   2         ww         0 |
  |--------------------------|
  |   3         ww         1 |
  |   3         ww         1 |
  +--------------------------+

Comment

tig som

Join Date: Sep 2022

Posts: 58
#4

20 Dec 2022, 07:42

Thank you very much. I have lots of data, and I also need this similarity check every now and then. In this case, collapsing and merging may need more care. Therefore, I will go with the latter suggestion. Thank you very much Ken Chui and Hamanshu Kumar!!
Comment

Announcement

Listing observations in a group that are the same on non-missing values of a given string variable

Comment

Comment

Comment