Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Listing observations in a group that are the same on non-missing values of a given string variable

    I am working on a similar command like this
    Stata | FAQ: Listing observations in a group that differ on a variable
    However, what if egenotype has a missing value for some, and I don’t want Stata to report this case as similar?

    How can I use such commands?
    To make my question clear I am changing the observations in the above STATA example as follows



    . dataex

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte eid str2 egenotype
    0 "vv"
    0 ""  
    0 ""  
    1 "ww"
    1 "ww"
    1 ""  
    1 ""  
    2 "vv"
    2 "ww"
    2 ""  
    3 "ww"
    3 "ww"
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 12 out of 12 observations



    I want STATA to list only those samples that are the same in non-missing values of the variable egenotype for each individual.


    If I use the command in the above STATA link, that is :


    by eid (egenotype), sort: gen same = egenotype[1] == egenotype[_N]
    . list eid egenotype if same



    +----------------+
    | eid egenot~e |
    |----------------|
    11. | 3 ww |
    12. | 3 ww |
    +----------------+


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte eid str2 egenotype float same
    0 ""   0
    0 ""   0
    0 "vv" 0
    1 ""   0
    1 ""   0
    1 "ww" 0
    1 "ww" 0
    2 ""   0
    2 "vv" 0
    2 "ww" 0
    3 "ww" 1
    3 "ww" 1
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 12 out of 12 observations




    Then, as you can see from the result of the command above, stata report only eid 3 as having similar genotypes for each individual. However, I also want to consider the similarity in the non-missing egenotype observed in eid 1 because eid =1 has also a similar egenotype if we focus only on non-missing values similarity in the group.

    How do I rearrange the above command to list samples that have similar values in only non-missing values of egenotype?


    It would be great to have your tips.
    Thank you.


  • #2
    I'd probably go for collapse... it would also work if you have multiples of pair inside a single eid:

    Code:
    preserve
    gen case = 1
    collapse (sum) case if !missing(egenotype), by(eid egenotype)
    gen duplicated = case >= 2
    drop case
    tempfile casecount
    save `casecount', replace
    restore
    
    merge m:1 eid egenotype using `casecount', nogen
    list, sepby(eid)
    Results:
    Code:
         +----------------------------------+
         | eid   egenot~e   same   duplic~d |
         |----------------------------------|
      1. |   0                 0          . |
      2. |   0                 0          . |
      3. |   0         vv      0          0 |
         |----------------------------------|
      4. |   1                 0          . |
      5. |   1                 0          . |
      6. |   1         ww      0          1 |
      7. |   1         ww      0          1 |
         |----------------------------------|
      8. |   2                 0          . |
      9. |   2         vv      0          0 |
     10. |   2         ww      0          0 |
         |----------------------------------|
     11. |   3         ww      1          1 |
     12. |   3         ww      1          1 |
         +----------------------------------+

    Comment


    • #3
      Here is alternative code which should give you the same output as #2:

      Code:
      bysort eid egenotype: gen count = _N
      gen similar = (count > 1) if !missing(egenotype)
      drop count
      this produces:
      Code:
      . list, sepby(eid) noobs
      
        +--------------------------+
        | eid   egenot~e   similar |
        |--------------------------|
        |   0                    . |
        |   0                    . |
        |   0         vv         0 |
        |--------------------------|
        |   1                    . |
        |   1                    . |
        |   1         ww         1 |
        |   1         ww         1 |
        |--------------------------|
        |   2                    . |
        |   2         vv         0 |
        |   2         ww         0 |
        |--------------------------|
        |   3         ww         1 |
        |   3         ww         1 |
        +--------------------------+

      Comment


      • #4
        Thank you very much. I have lots of data, and I also need this similarity check every now and then. In this case, collapsing and merging may need more care. Therefore, I will go with the latter suggestion. Thank you very much Ken Chui and Hamanshu Kumar!!

        Comment

        Working...
        X