Find overlap of values across columns

Bobby Wood

Join Date: Jul 2017

Posts: 39
#1

Find overlap of values across columns

21 Jun 2018, 14:19

Hi,

I have a data set of the following structure:

a1 a2 a3 b1 b2 b3

11 13 17 9 13 22

6 22 24 12 19 22

12 14 29 7 12 29

with variables of type A and variables of type B.
I want to find the overlap in the values of the A and B type variables so that the table looks like:

a1 a2 a3 b1 b2 b3 ol1 ol2 ol3

11 13 17 9 13 22 13 . .

6 22 24 12 19 22 . . .

12 14 29 7 12 29 12 29 .

where the OL variables present the overlaps.
Is there an easy way to do this?

Many thanks!
Bobby

Last edited by Bobby Wood; 21 Jun 2018, 14:26.
Tags: None

eric_a_booth

Join Date: Apr 2014
Posts: 292

21 Jun 2018, 16:04

You could probably loop over all combinations of a? and b? to fill o? where a? matches b? (so many of these types of statements in loops: replace ol1 = a1 if a1==b1 & mi(ol1) & !mi(a1)....) , but I think using merge to match each a? with b? might be faster.

Here's a toy example of how this would work:

Code:

clear
input a1    a2    a3    b1    b2    b3    ol1    ol2    ol3
11    13    17    9    13    22    13    .    .
6    22    24    12    19    22    .    .    .
12    14    29    7    12    29    12    29    .
end

list
drop ol* //we'll remake this
g obs = _n
 
 
loc i = 1 //for ol
forval n = 1/3 { //number of a's
    forval m = 1/3 { //number of b's
preserve
keep a`n' obs
rename a`n' b`m'
sa lookup.dta, replace
restore
*drop b*

    
    merge 1:1 obs b`m' using lookup.dta, update //  keep(1 3)
    drop if _m==2
    sort obs
    g ol`i' = b`m' if _m==3
    drop _merge
    loc `++i'
    
    }
    }
    
**shift left for ol's (I bet there's a faster way to cascade right to left)
preserve
keep ol*
g i = _n
reshape long ol, i(i) j(j)
drop if mi(ol)
bys i (j): g j2 = _n
drop j
reshape wide ol , i(i) j(j2)
drop i
sa shifted.dta, replace
restore

drop ol*
merge 1:1 _n using shifted
drop _m

Last edited by eric_a_booth; 21 Jun 2018, 16:16.

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 Jun 2018, 18:09

Because I did some heavy reshaping today, I approached this as an attempt to answer the question "how many times can I reshape the same data in one program?"

Here is the result, with an identifier added and three extra observations to test a subtlety I stumbled across that led to the complicated generate command.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte(a1 a2 a3 b1 b2 b3)
101 11 13 17  9 13 22
102  6 22 24 12 19 22
103 12 14 29  7 12 29
201 12 12 29  7 12 29
202 12 12 29  7 11 29
203 12 12 29  7 11 33
end
reshape long a b, i(id) j(seq)
rename (a b) (v_=)
reshape long v_, i(id seq) j(var) string
bysort id v_ (var) : generate o_ = v_ if _N>=2 & var[1]!=var[_N] & inlist(_n,1,_N)
reshape wide v_ o_, i(id seq) j(var) string
rename (v_*) (*)
drop o_b
bysort id (o_a): generate o = o_a[seq]
drop o_a
reshape wide a b o, i(id) j(seq)
order id a* b* o*
list, noobs sep(0)

Code:

. list, noobs sep(0)

  +--------------------------------------------------+
  |  id   a1   a2   a3   b1   b2   b3   o1   o2   o3 |
  |--------------------------------------------------|
  | 101   11   13   17    9   13   22   13    .    . |
  | 102    6   22   24   12   19   22   22    .    . |
  | 103   12   14   29    7   12   29   12   29    . |
  | 201   12   12   29    7   12   29   12   29    . |
  | 202   12   12   29    7   11   29   29    .    . |
  | 203   12   12   29    7   11   33    .    .    . |
  +--------------------------------------------------+

I have to say, that is fairly opaque code. You pretty much have to add list commands after every reshape and run it to get any idea of what's going on.

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#4

21 Jun 2018, 22:05

The logic of William is brilliant, but there are still some steps in the code to be cut off.

Code:

reshape long b, i(id) gen _o =b if inlist(b, a1,a2,a3) bys id (_o): gen o=_o[_j] drop _o reshape wide b o, i(id) order id a* b* o*
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

22 Jun 2018, 04:58

Romalpa's elegant code is a great improvement over the clutter of my code.

I concerned myself with the possibility that a single value might occur more than once among the a's or among the b's, although the sample data in post #1 did not suggest that as a possibility. My objective was that no value appear more than once among the o's. If indeed repeated a's or b's are possible, then Romalpa's code handles repeated values among the a's, and adding a single line to the code (in red, below) takes care of repeated values among the b's as well.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int id byte(a1 a2 a3 b1 b2 b3) 101 11 13 17 9 13 22 102 6 22 24 12 19 22 103 12 14 29 7 12 29 201 12 12 29 7 12 29 202 12 12 29 7 11 29 203 12 12 29 7 11 33 301 12 14 29 7 12 12 end reshape long b, i(id) gen _o =b if inlist(b, a1,a2,a3) bys id (_o): replace _o=. if _n>1 bys id (_o): gen o=_o[_j] drop _o reshape wide b o, i(id) order id a* b* o*
3 likes
Comment
Bobby Wood

Join Date: Jul 2017

Posts: 39
#6

22 Jun 2018, 06:01

Hi, thanks very much for the help and the great suggestions! In my case there are no repeated values in the A's and B's, forgot to mention this.
I very much appreciate the amazing stata community! Thanks, Bobby!

Last edited by Bobby Wood; 22 Jun 2018, 06:09.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#7

22 Jun 2018, 06:12

I do applaud the idea of William on erasing repeated values (if any) on the list of overlap values. Such suggestion would have its effect to make the solution more generally applicable.

However, it seems to me that the suggested red-line of William would keep only the smallest "overlap" value, while Bobby might want to keep them all. Thus, I would suggest below modification, which only deletes duplications in the overlap-contained variables.

Code:

bys id (_o): replace _o=. if _o==_o[_n-1]
Comment
Bobby Wood

Join Date: Jul 2017

Posts: 39
#8

22 Jun 2018, 06:46

I just experiences that the inlist command doesn't work in my case as my b's go from b1 to b50 ("expression too long"-error).
As I see the inlist command only accepts up to ten variables.
Is there an alternative to inlist where this works?
Comment
eric_a_booth

Join Date: Apr 2014

Posts: 292
#9

22 Jun 2018, 07:15

The limits of inlist are 10 elements if the elements are string and 250 elements if numeric.
So one option is to make sure you are evaluating numeric elements (you dont provide the code you're running so I"m not sure how to help with this approach).
Other options that I commonly use include:

1. creating an indicator for the condition in a loop:

Code:

gen var = . forval n = 1/50 { replace var = 1 if b`n'=="1" } **now do something with your flag 'var'

or
2. breaking up inlist into several statements using an or(|) logical operator. I"m not sure how many of these subexpressions you can link with logical operators before you hit a limit.

Example:

Code:

**hits the 10 element limit: g z = 1 if inlist("1", x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9, x10,x11,x12,x101) **does not hit the limit: g z2 = 1 if inlist("1", x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9) | inlist("1", x10, x11, x12, x101)

Last edited by eric_a_booth; 22 Jun 2018, 07:22.

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

22 Jun 2018, 08:12

Another alternative to the code using the inlist function (not command) is using the code that I supplied in post #3.

The objectives of my approach in that post were to create code that did not assume you had only the three a and b variables shown in your sample data, and to not assume that in each observation, all the values of a were distinct and all the values of b were distinct. It succeeds at both of those.

Switching topics now, in post #7 Romalpa points out an error in my suggested modification in post #5 to her code from post #4. She is correct, and the code below corrects my code from post #5, again on the line in red. I will add that Romalpa's alternative from post #7 does not completely remove duplicate values of o for the newly-added id 402.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int id byte(a1 a2 a3 b1 b2 b3) 101 11 13 17 9 13 22 102 6 22 24 12 19 22 103 12 14 29 7 12 29 201 12 12 29 7 12 29 202 12 12 29 7 11 29 203 12 12 29 7 11 33 301 12 14 29 7 12 12 401 12 14 29 12 12 29 402 12 14 29 12 12 12 end reshape long b, i(id) gen _o =b if inlist(b, a1,a2,a3) bys id _o: replace _o=. if _n>1 bys id (_o): gen o=_o[_j] drop _o reshape wide b o, i(id) order id a* b* o*
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

22 Jun 2018, 09:21

Here's another approach that can identify overlaps in a long layout with a single reshape and no loops or inlist(). It works even if there are duplicates within group a or b variables. If desired, the results can be converted to a wide layout and merged back with the initial data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte(a1 a2 a3 b1 b2 b3)
101 11 13 17  9 13 22
102  6 22 24 12 19 22
103 12 14 29  7 12 29
201 12 12 29  7 12 29
202 12 12 29  7 11 29
203 12 12 29  7 11 33
301 12 14 29  7 12 12
401 12 14 29 12 12 29
402 12 14 29 12 12 12
403 12 14 29 29 14 12
end
save "dataex.dta", replace

* verify assumptions about the data
isid id, sort

* convert all variables from group a and b to long
rename (a* b*) ol=
reshape long ol, i(id) j(source) string

* identify overlaps between group a and b
gen group_a = substr(source,1,1) == "a"
bysort id ol (group_a): drop if group_a[1] == group_a[_N]
by id ol: keep if _n == 1
list, sepby(id)

* if desired, reshape to wide and combine with original data
drop group_a source
by id: gen j = _n
reshape wide ol, i(id) j(j)
merge 1:1 id using "dataex.dta", assert(match using) nogen

isid id, sort
list

Here are the results from the first list statement:

Code:

. list, sepby(id)

     +-----------------------------+
     |  id   source   ol   group_a |
     |-----------------------------|
  1. | 101       b2   13         0 |
     |-----------------------------|
  2. | 102       b3   22         0 |
     |-----------------------------|
  3. | 103       b2   12         0 |
  4. | 103       b3   29         0 |
     |-----------------------------|
  5. | 201       b2   12         0 |
  6. | 201       b3   29         0 |
     |-----------------------------|
  7. | 202       b3   29         0 |
     |-----------------------------|
  8. | 301       b2   12         0 |
     |-----------------------------|
  9. | 401       b2   12         0 |
 10. | 401       b3   29         0 |
     |-----------------------------|
 11. | 402       b2   12         0 |
     |-----------------------------|
 12. | 403       b3   12         0 |
 13. | 403       b2   14         0 |
 14. | 403       b1   29         0 |
     +-----------------------------+

and the results from the second list statement:

Code:

. list

     +-----------------------------------------------------+
     |  id   ol1   ol2   ol3   a1   a2   a3   b1   b2   b3 |
     |-----------------------------------------------------|
  1. | 101    13     .     .   11   13   17    9   13   22 |
  2. | 102    22     .     .    6   22   24   12   19   22 |
  3. | 103    12    29     .   12   14   29    7   12   29 |
  4. | 201    12    29     .   12   12   29    7   12   29 |
  5. | 202    29     .     .   12   12   29    7   11   29 |
     |-----------------------------------------------------|
  6. | 203     .     .     .   12   12   29    7   11   33 |
  7. | 301    12     .     .   12   14   29    7   12   12 |
  8. | 401    12    29     .   12   14   29   12   12   29 |
  9. | 402    12     .     .   12   14   29   12   12   12 |
 10. | 403    12    14    29   12   14   29   29   14   12 |
     +-----------------------------------------------------+

.

Comment

Kelvin Foo

Join Date: May 2017
Posts: 6

#12

24 Jun 2018, 06:56

On large datasets, to avoid reshaping, an alternative solution is to work through the data row-by-row, storing the values of variables a* and b* in 2 macro lists and manipulating them with the extended macro function list.

Without generalizing to handle all kinds of values, this code works for the given example of the data.

Code:

clear
input int id byte(a1 a2 a3 b1 b2 b3)
101 11 13 17  9 13 22
102  6 22 24 12 19 22
103 12 14 29  7 12 29
201 12 12 29  7 12 29
202 12 12 29  7 11 29
203 12 12 29  7 11 33
301 12 14 29  7 12 12
401 12 14 29 12 12 29
402 12 14 29 12 12 12
403 12 14 29 29 14 12
end

qui tostring a* b*, format("%03.0f") gen(sa1-sa3 sb1-sb3)

egen sa=concat(sa*), punct(" ")
egen sb=concat(sb*), punct(" ")

g ol=""

forval i=1/`=_N' {
    local sa = sa[`i']
    local sb = sb[`i']
    local o: list sa & sb
    local o: list uniq o
    local o: list sort o
    qui replace ol="`o'" in `i'
}

qui split ol, destring

drop sa* sb* ol
qui compress

list

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#13

24 Jun 2018, 09:30

1. Kevin's solution is an interesting one, which shows the proper effect of using list and split for string array. Notice that your line with tostring seems not necessary, since concat can cover this work.

2. Since reshape seems not favored for large dataset, the below code, using pure loops, is believed to be faster for large dataset.

Code:

local Nvars=floor(`c(k)'/2) //Nvars =3 in the example gen o0=-99999 forval k=1/`Nvars'{ gen o`k'=. forval i=1/`Nvars'{ forval j=1/`Nvars'{ replace o`k'= b`j' if b`j'==a`i' & o`k' > b`j' & b`j' > o`=`k'-1' } } } drop o0
Comment

Announcement