Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find overlap of values across columns

    Hi,

    I have a data set of the following structure:

    a1 a2 a3 b1 b2 b3
    11 13 17 9 13 22
    6 22 24 12 19 22
    12 14 29 7 12 29

    with variables of type A and variables of type B.
    I want to find the overlap in the values of the A and B type variables so that the table looks like:

    a1 a2 a3 b1 b2 b3 ol1 ol2 ol3
    11 13 17 9 13 22 13 . .
    6 22 24 12 19 22 . . .
    12 14 29 7 12 29 12 29 .

    where the OL variables present the overlaps.
    Is there an easy way to do this?

    Many thanks!
    Bobby
    Last edited by Bobby Wood; 21 Jun 2018, 14:26.

  • #2
    You could probably loop over all combinations of a? and b? to fill o? where a? matches b? (so many of these types of statements in loops: replace ol1 = a1 if a1==b1 & mi(ol1) & !mi(a1)....) , but I think using merge to match each a? with b? might be faster.

    Here's a toy example of how this would work:

    Code:
    clear
    input a1    a2    a3    b1    b2    b3    ol1    ol2    ol3
    11    13    17    9    13    22    13    .    .
    6    22    24    12    19    22    .    .    .
    12    14    29    7    12    29    12    29    .
    end
    
    list
    drop ol* //we'll remake this
    g obs = _n
     
     
    loc i = 1 //for ol
    forval n = 1/3 { //number of a's
        forval m = 1/3 { //number of b's
    preserve
    keep a`n' obs
    rename a`n' b`m'
    sa lookup.dta, replace
    restore
    *drop b*
    
        
        merge 1:1 obs b`m' using lookup.dta, update //  keep(1 3)
        drop if _m==2
        sort obs
        g ol`i' = b`m' if _m==3
        drop _merge
        loc `++i'
        
        }
        }
        
    **shift left for ol's (I bet there's a faster way to cascade right to left)
    preserve
    keep ol*
    g i = _n
    reshape long ol, i(i) j(j)
    drop if mi(ol)
    bys i (j): g j2 = _n
    drop j
    reshape wide ol , i(i) j(j2)
    drop i
    sa shifted.dta, replace
    restore
    
    drop ol*
    merge 1:1 _n using shifted
    drop _m
    Last edited by eric_a_booth; 21 Jun 2018, 16:16.
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      Because I did some heavy reshaping today, I approached this as an attempt to answer the question "how many times can I reshape the same data in one program?"

      Here is the result, with an identifier added and three extra observations to test a subtlety I stumbled across that led to the complicated generate command.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int id byte(a1 a2 a3 b1 b2 b3)
      101 11 13 17  9 13 22
      102  6 22 24 12 19 22
      103 12 14 29  7 12 29
      201 12 12 29  7 12 29
      202 12 12 29  7 11 29
      203 12 12 29  7 11 33
      end
      reshape long a b, i(id) j(seq)
      rename (a b) (v_=)
      reshape long v_, i(id seq) j(var) string
      bysort id v_ (var) : generate o_ = v_ if _N>=2 & var[1]!=var[_N] & inlist(_n,1,_N)
      reshape wide v_ o_, i(id seq) j(var) string
      rename (v_*) (*)
      drop o_b
      bysort id (o_a): generate o = o_a[seq]
      drop o_a
      reshape wide a b o, i(id) j(seq)
      order id a* b* o*
      list, noobs sep(0)
      Code:
      . list, noobs sep(0)
      
        +--------------------------------------------------+
        |  id   a1   a2   a3   b1   b2   b3   o1   o2   o3 |
        |--------------------------------------------------|
        | 101   11   13   17    9   13   22   13    .    . |
        | 102    6   22   24   12   19   22   22    .    . |
        | 103   12   14   29    7   12   29   12   29    . |
        | 201   12   12   29    7   12   29   12   29    . |
        | 202   12   12   29    7   11   29   29    .    . |
        | 203   12   12   29    7   11   33    .    .    . |
        +--------------------------------------------------+
      I have to say, that is fairly opaque code. You pretty much have to add list commands after every reshape and run it to get any idea of what's going on.

      Comment


      • #4
        The logic of William is brilliant, but there are still some steps in the code to be cut off.

        Code:
        reshape long b, i(id)
        gen _o =b if inlist(b, a1,a2,a3)
        bys id (_o): gen o=_o[_j]
        drop _o
        reshape wide b o, i(id)
        order id a* b* o*

        Comment


        • #5
          Romalpa's elegant code is a great improvement over the clutter of my code.

          I concerned myself with the possibility that a single value might occur more than once among the a's or among the b's, although the sample data in post #1 did not suggest that as a possibility. My objective was that no value appear more than once among the o's. If indeed repeated a's or b's are possible, then Romalpa's code handles repeated values among the a's, and adding a single line to the code (in red, below) takes care of repeated values among the b's as well.
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input int id byte(a1 a2 a3 b1 b2 b3)
          101 11 13 17  9 13 22
          102  6 22 24 12 19 22
          103 12 14 29  7 12 29
          201 12 12 29  7 12 29
          202 12 12 29  7 11 29
          203 12 12 29  7 11 33
          301 12 14 29  7 12 12
          end
          reshape long b, i(id)
          gen _o =b if inlist(b, a1,a2,a3)
          bys id (_o): replace _o=. if _n>1
          bys id (_o): gen o=_o[_j]
          drop _o
          reshape wide b o, i(id)
          order id a* b* o*

          Comment


          • #6
            Hi, thanks very much for the help and the great suggestions! In my case there are no repeated values in the A's and B's, forgot to mention this.
            I very much appreciate the amazing stata community! Thanks, Bobby!
            Last edited by Bobby Wood; 22 Jun 2018, 06:09.

            Comment


            • #7
              I do applaud the idea of William on erasing repeated values (if any) on the list of overlap values. Such suggestion would have its effect to make the solution more generally applicable.

              However, it seems to me that the suggested red-line of William would keep only the smallest "overlap" value, while Bobby might want to keep them all. Thus, I would suggest below modification, which only deletes duplications in the overlap-contained variables.
              Code:
              bys id (_o): replace _o=. if _o==_o[_n-1]

              Comment


              • #8
                I just experiences that the inlist command doesn't work in my case as my b's go from b1 to b50 ("expression too long"-error).
                As I see the inlist command only accepts up to ten variables.
                Is there an alternative to inlist where this works?

                Comment


                • #9
                  The limits of inlist are 10 elements if the elements are string and 250 elements if numeric.
                  So one option is to make sure you are evaluating numeric elements (you dont provide the code you're running so I"m not sure how to help with this approach).
                  Other options that I commonly use include:

                  1. creating an indicator for the condition in a loop:

                  Code:
                      gen var = .
                      forval n = 1/50 {
                          replace var = 1 if b`n'=="1"
                          }
                          **now do something with your flag 'var'
                  or
                  2. breaking up inlist into several statements using an or(|) logical operator. I"m not sure how many of these subexpressions you can link with logical operators before you hit a limit.

                  Example:
                  Code:
                      **hits the 10 element limit:
                      g z = 1 if inlist("1", x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9, x10,x11,x12,x101)
                      **does not hit the limit:
                      g z2 = 1 if inlist("1", x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9) | inlist("1", x10, x11, x12, x101)
                  Last edited by eric_a_booth; 22 Jun 2018, 07:22.
                  Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

                  Comment


                  • #10
                    Another alternative to the code using the inlist function (not command) is using the code that I supplied in post #3.

                    The objectives of my approach in that post were to create code that did not assume you had only the three a and b variables shown in your sample data, and to not assume that in each observation, all the values of a were distinct and all the values of b were distinct. It succeeds at both of those.

                    Switching topics now, in post #7 Romalpa points out an error in my suggested modification in post #5 to her code from post #4. She is correct, and the code below corrects my code from post #5, again on the line in red. I will add that Romalpa's alternative from post #7 does not completely remove duplicate values of o for the newly-added id 402.
                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input int id byte(a1 a2 a3 b1 b2 b3)
                    101 11 13 17  9 13 22
                    102  6 22 24 12 19 22
                    103 12 14 29  7 12 29
                    201 12 12 29  7 12 29
                    202 12 12 29  7 11 29
                    203 12 12 29  7 11 33
                    301 12 14 29  7 12 12
                    401 12 14 29 12 12 29
                    402 12 14 29 12 12 12
                    end
                    reshape long b, i(id)
                    gen _o =b if inlist(b, a1,a2,a3)
                    bys id _o: replace _o=. if _n>1
                    bys id (_o): gen o=_o[_j]
                    drop _o
                    reshape wide b o, i(id)
                    order id a* b* o*

                    Comment


                    • #11
                      Here's another approach that can identify overlaps in a long layout with a single reshape and no loops or inlist(). It works even if there are duplicates within group a or b variables. If desired, the results can be converted to a wide layout and merged back with the initial data:

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input int id byte(a1 a2 a3 b1 b2 b3)
                      101 11 13 17  9 13 22
                      102  6 22 24 12 19 22
                      103 12 14 29  7 12 29
                      201 12 12 29  7 12 29
                      202 12 12 29  7 11 29
                      203 12 12 29  7 11 33
                      301 12 14 29  7 12 12
                      401 12 14 29 12 12 29
                      402 12 14 29 12 12 12
                      403 12 14 29 29 14 12
                      end
                      save "dataex.dta", replace
                      
                      * verify assumptions about the data
                      isid id, sort
                      
                      * convert all variables from group a and b to long
                      rename (a* b*) ol=
                      reshape long ol, i(id) j(source) string
                      
                      * identify overlaps between group a and b
                      gen group_a = substr(source,1,1) == "a"
                      bysort id ol (group_a): drop if group_a[1] == group_a[_N]
                      by id ol: keep if _n == 1
                      list, sepby(id)
                      
                      * if desired, reshape to wide and combine with original data
                      drop group_a source
                      by id: gen j = _n
                      reshape wide ol, i(id) j(j)
                      merge 1:1 id using "dataex.dta", assert(match using) nogen
                      
                      isid id, sort
                      list
                      Here are the results from the first list statement:
                      Code:
                      . list, sepby(id)
                      
                           +-----------------------------+
                           |  id   source   ol   group_a |
                           |-----------------------------|
                        1. | 101       b2   13         0 |
                           |-----------------------------|
                        2. | 102       b3   22         0 |
                           |-----------------------------|
                        3. | 103       b2   12         0 |
                        4. | 103       b3   29         0 |
                           |-----------------------------|
                        5. | 201       b2   12         0 |
                        6. | 201       b3   29         0 |
                           |-----------------------------|
                        7. | 202       b3   29         0 |
                           |-----------------------------|
                        8. | 301       b2   12         0 |
                           |-----------------------------|
                        9. | 401       b2   12         0 |
                       10. | 401       b3   29         0 |
                           |-----------------------------|
                       11. | 402       b2   12         0 |
                           |-----------------------------|
                       12. | 403       b3   12         0 |
                       13. | 403       b2   14         0 |
                       14. | 403       b1   29         0 |
                           +-----------------------------+
                      and the results from the second list statement:
                      Code:
                      . list
                      
                           +-----------------------------------------------------+
                           |  id   ol1   ol2   ol3   a1   a2   a3   b1   b2   b3 |
                           |-----------------------------------------------------|
                        1. | 101    13     .     .   11   13   17    9   13   22 |
                        2. | 102    22     .     .    6   22   24   12   19   22 |
                        3. | 103    12    29     .   12   14   29    7   12   29 |
                        4. | 201    12    29     .   12   12   29    7   12   29 |
                        5. | 202    29     .     .   12   12   29    7   11   29 |
                           |-----------------------------------------------------|
                        6. | 203     .     .     .   12   12   29    7   11   33 |
                        7. | 301    12     .     .   12   14   29    7   12   12 |
                        8. | 401    12    29     .   12   14   29   12   12   29 |
                        9. | 402    12     .     .   12   14   29   12   12   12 |
                       10. | 403    12    14    29   12   14   29   29   14   12 |
                           +-----------------------------------------------------+
                      
                      .

                      Comment


                      • #12
                        On large datasets, to avoid reshaping, an alternative solution is to work through the data row-by-row, storing the values of variables a* and b* in 2 macro lists and manipulating them with the extended macro function list.

                        Without generalizing to handle all kinds of values, this code works for the given example of the data.

                        Code:
                        clear
                        input int id byte(a1 a2 a3 b1 b2 b3)
                        101 11 13 17  9 13 22
                        102  6 22 24 12 19 22
                        103 12 14 29  7 12 29
                        201 12 12 29  7 12 29
                        202 12 12 29  7 11 29
                        203 12 12 29  7 11 33
                        301 12 14 29  7 12 12
                        401 12 14 29 12 12 29
                        402 12 14 29 12 12 12
                        403 12 14 29 29 14 12
                        end
                        
                        qui tostring a* b*, format("%03.0f") gen(sa1-sa3 sb1-sb3)
                        
                        egen sa=concat(sa*), punct(" ")
                        egen sb=concat(sb*), punct(" ")
                        
                        g ol=""
                        
                        forval i=1/`=_N' {
                            local sa = sa[`i']
                            local sb = sb[`i']
                            local o: list sa & sb
                            local o: list uniq o
                            local o: list sort o
                            qui replace ol="`o'" in `i'
                        }
                        
                        qui split ol, destring
                        
                        drop sa* sb* ol
                        qui compress
                        
                        list

                        Comment


                        • #13
                          1. Kevin's solution is an interesting one, which shows the proper effect of using list and split for string array. Notice that your line with tostring seems not necessary, since concat can cover this work.

                          2. Since reshape seems not favored for large dataset, the below code, using pure loops, is believed to be faster for large dataset.
                          Code:
                          local Nvars=floor(`c(k)'/2)    //Nvars =3 in the example
                          
                          gen o0=-99999
                          forval k=1/`Nvars'{
                          gen o`k'=.
                          forval i=1/`Nvars'{
                          forval j=1/`Nvars'{
                          replace o`k'= b`j' if b`j'==a`i' & o`k' > b`j' & b`j' > o`=`k'-1'
                          }
                          }
                          }
                          drop o0

                          Comment

                          Working...
                          X