Seeking alternative for merge m:m or vlookup multiple files - identify values from triad of relationships

Ashish Sharma

Join Date: Nov 2020
Posts: 9

Seeking alternative for merge m:m or vlookup multiple files - identify values from triad of relationships

22 Jun 2023, 17:01

Dear Statalisters,

I’m seeking help to resolve an annoying problem – it might be nothing major for some of you, but trying to solve this issue, given the dyadic nature of the data, is spinning my head. I have two unbalanced panel datasets:

Dataset 1 – dyadic relationships recorded between two partners for multiple years.
Dataset 2 – relationships of partners in dataset 1 with others (excluding their dyadic relationships) over multiple years.

Both entities in dataset 1 (id and id_partner) may have multiple other relationships with other entities in dataset 2. I’m interested in identifying the following:

Step 1. Check whether the focal entity in dataset 1 (id) has any other relationships with other entities in dataset 2 (id1-id2) for a given year.

The focal entity (id) can appear multiple times in both datasets as it can have dyadic relations with multiple other entities, e.g. A related with B, A related with E, and so on. However, this is also the case with the partner entity (id_partner), which can also be related to multiple focal entities. In short, I’m dealing with a m:m case.
Similarly, Dataset 2 can have multiple records of focal and partner entities over multiple years.
As my main focus is the focal entity, I can remove duplicate id-year records from dataset 1 and use merge 1:m id year using dataset 2 to find matches for the focal entity-year record, which is fairly straightforward. Although, focal entity (id) from dataset 1 can appear in either id1 or id2 in dataset 2. But that’s not a challenge as I can use merge twice – once looking for id-year in id1-year and then looking for id-year in id2-year.

Step 2. If step 1 is True, check whether any of the identified partners of the focal entity also have any relationships with focal entity’s partner (id_partner - as identified in dataset 1).

This is where it gets tricky because the partner (id_partner) in dataset 1 is also related with multiple focal entities (id) – think of the worst case of infidelity where everyone in the society is having an affair with (almost) everyone else!
Similar to in step 1, I can use merge to find matches of id_partner from dataset 1 in dataset 2. However, now I’m again back to having two separate datasets – one with focal entity (id in dataset 1) and its matched relationships from dataset 2, and another with focal entity’s partner entity (id_partner in dataset 1) and its matched relationships from dataset 2.
I tried the vlookup function but my understanding is that it is limited to finding matches in the same file and cannot be used to look up matches in another dataset.
Eventually, I need to get the data in a format where I have focal entity (from dataset 1) + its identified matches (from dataset 2) + a tag for whether the particular match is also related to the focal entity’s partner entity (id_partner in dataset 1).

In brief, the relationships in dataset1 and 2 are somewhat like this:

Adam is the focal entity (dataset 1 - id).
Adam is dating Bella in year 1 and 2, Chelsea in year 2, and Dakota in year 1 and 3 (dataset 1 - id_partner).
Adam attends college with Brad in year 1 and 3, and with Chandler in year 2. Chandler is also related to Bella in year 2.

Dataset 1
year	id	id_partner
1	Adam	Bella
1	Adam	Dakota
2	Adam	Bella
2	Chelsea	Adam
3	Adam	Dakota
Dataset 2
year	id1	id2	id_partner
1	Adam	Brad	No
2	Chandler	Adam	Yes
3	Adam	Brad	No

*Example generated by -dataex-. (dataset 1)
clear
input float year str10(id id_partner)
2016 "194535257" "10001"
2017 "194535257" "10001"
2012 "10001" "3236"
2013 "10001" "3236"
2014 "10001" "3236"
2015 "10001" "3236"
2016 "10001" "3236"
2017 "10001" "3236"
2018 "10001" "3236"
2019 "10001" "3236"
2020 "10001" "3236"
2014 "10001" "4598"
2015 "10001" "4598"
2016 "10001" "4598"
2017 "10001" "4598"
end

*Example generated by -dataex- (dataset 2)
clear
input float year str10(id1 id2)
2010 "1871295" "1000"
2011 "1871295" "1000"
2012 "1871295" "1000"
2013 "1871295" "1000"
2019 "194528198" "100000012"
2020 "194528198" "100000012"
2021 "194528198" "100000012"
2022 "194528198" "100000012"
2012 "10001" "11122"
2013 "10001" "11122"
2014 "10001" "11122"
2015 "10001" "11122"
2016 "10001" "11122"
2006 "10001" "1508418"
2007 "10001" "1508418"
end

Thanks for your help!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

22 Jun 2023, 19:16

Unfortunately, in the example you show, while there are some secondary relationships found in dataset2, none of them link back to anybody in dataset1. This makes it very difficult to develop and test a solution to your problem. Please post example data that includes all of the kinds of linkages you are looking for.
Comment

Ashish Sharma

Join Date: Nov 2020
Posts: 9

22 Jun 2023, 20:35

Hi Clyde,

Thanks for your quick response. I understand your frustration - unfortunately, both datasets contain almost half a million obs. each, making it difficult to capture the variety of linkages via a short dataex capture. I'm posting 100 rows each so as not to make the post look like a spam. However, please let me know if there's a more efficient way of sharing the nature of the data here.

Dataset 1:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10(id id_partner) float year
"194535257" "10001" 2016
"194535257" "10001" 2017
"218119"    "10002" 2003
"244142"    "10002" 2003
"2283"      "10002" 2003
"1679"      "10002" 2003
"1679"      "10002" 2004
"244142"    "10002" 2004
"218119"    "10002" 2004
"704"       "10002" 2005
"244142"    "10002" 2005
"215725"    "10002" 2006
"1818294"   "10002" 2006
"704"       "10002" 2006
"234064"    "10002" 2006
"244142"    "10002" 2006
"1818294"   "10002" 2007
"234064"    "10002" 2007
"215725"    "10002" 2007
"234009"    "10002" 2007
"1818294"   "10002" 2008
"234009"    "10002" 2008
"215725"    "10002" 2008
"233197"    "10002" 2010
"233197"    "10002" 2011
"2159"      "10004" 2003
"2159"      "10004" 2004
"5179"      "10004" 2004
"9855"      "10004" 2005
"5179"      "10004" 2005
"222664"    "10005" 2003
"1426396"   "10005" 2003
"6774"      "10005" 2003
"8548"      "10005" 2003
"11168"     "10005" 2003
"228445"    "10005" 2003
"9551"      "10005" 2003
"6667"      "10005" 2003
"3885"      "10005" 2003
"7188"      "10005" 2003
"9886"      "10005" 2003
"9364"      "10005" 2003
"6217"      "10005" 2004
"11168"     "10005" 2004
"9886"      "10005" 2004
"8548"      "10005" 2004
"228445"    "10005" 2004
"15601"     "10005" 2004
"3885"      "10005" 2004
"257401"    "10005" 2004
"1426396"   "10005" 2004
"7188"      "10005" 2004
"9364"      "10005" 2004
"222664"    "10005" 2004
"6774"      "10005" 2004
"11257"     "10005" 2005
"228445"    "10005" 2005
"11168"     "10005" 2005
"222664"    "10005" 2005
"257401"    "10005" 2005
"8548"      "10005" 2005
"6217"      "10005" 2005
"6774"      "10005" 2005
"3885"      "10005" 2005
"5172"      "10005" 2005
"9364"      "10005" 2005
"9886"      "10005" 2005
"11257"     "10005" 2006
"228445"    "10005" 2006
"257401"    "10005" 2006
"6774"      "10005" 2006
"9886"      "10005" 2006
"3885"      "10005" 2006
"1740357"   "10005" 2006
"743"       "10005" 2006
"5172"      "10005" 2006
"8548"      "10005" 2006
"229955"    "10005" 2007
"9886"      "10005" 2007
"1740357"   "10005" 2007
"228445"    "10005" 2007
"6774"      "10005" 2007
"257401"    "10005" 2007
"8548"      "10005" 2007
"7124"      "10005" 2007
"743"       "10005" 2007
"15813"     "10007" 2003
"244091"    "10007" 2003
"245"       "10007" 2003
"971"       "10007" 2003
"6259"      "10007" 2003
"244091"    "10007" 2004
"971"       "10007" 2004
"6259"      "10007" 2004
"7649"      "10007" 2005
"6259"      "10007" 2005
"971"       "10007" 2005
"7649"      "10007" 2006
"6259"      "10007" 2006
"216210"    "1001"  2003
"8061"      "1001"  2003
"233066"    "1001"  2003
"2816"      "1001"  2003
"3325"      "1001"  2003
"242027"    "1001"  2003
"5906"      "1001"  2003
"3305"      "1001"  2003
"1660327"   "1001"  2004
"216210"    "1001"  2004
"233066"    "1001"  2004
"1443631"   "1001"  2004
"8061"      "1001"  2004
"242027"    "1001"  2004
"1723"      "1001"  2004
"5463"      "1001"  2004
"3325"      "1001"  2004
"2816"      "1001"  2004
"9170"      "1001"  2005
"233066"    "1001"  2005
"5463"      "1001"  2005
"1093"      "1001"  2005
"1722384"   "1001"  2005
"242027"    "1001"  2005
"2939"      "1001"  2005
"663"       "1001"  2005
"1723"      "1001"  2005
"4937"      "1001"  2005
"5505"      "1001"  2005
"2816"      "1001"  2005
"1620"      "1001"  2005
"3325"      "1001"  2005
"3488"      "1001"  2005
"1443631"   "1001"  2005
"8061"      "1001"  2005
"11012"     "1001"  2005
"2939"      "1001"  2006
"11012"     "1001"  2006
"1723"      "1001"  2006
"3488"      "1001"  2006
"1722384"   "1001"  2006
"1093"      "1001"  2006
"8061"      "1001"  2006
"663"       "1001"  2006
"2816"      "1001"  2006
"233066"    "1001"  2006
"1620"      "1001"  2006
"5505"      "1001"  2006
"1576198"   "1001"  2006
"227050"    "1001"  2006
"5463"      "1001"  2006
"242027"    "1001"  2006
"9170"      "1001"  2006
"5885"      "1001"  2007
"1723"      "1001"  2007
"1722384"   "1001"  2007
"233066"    "1001"  2007
"1093"      "1001"  2007
"11012"     "1001"  2007
"242027"    "1001"  2007
"663"       "1001"  2007
"9170"      "1001"  2007
"215142"    "1001"  2007
"8061"      "1001"  2007
"2937"      "1001"  2007
"1576198"   "1001"  2007
"2939"      "1001"  2007
"1723"      "1001"  2008
"1576198"   "1001"  2008
"663"       "1001"  2008
"6255"      "1001"  2008
"215142"    "1001"  2008
"2937"      "1001"  2008
"2939"      "1001"  2008
"9170"      "1001"  2008
"8061"      "1001"  2008
"1093"      "1001"  2008
"5885"      "1001"  2008
"1722384"   "1001"  2008
"233066"    "1001"  2008
"242027"    "1001"  2008
"11012"     "1001"  2008
"1723"      "1001"  2009
"6255"      "1001"  2009
"8061"      "1001"  2009
"2937"      "1001"  2009
"7106"      "1001"  2009
"11012"     "1001"  2009
"663"       "1001"  2009
"233066"    "1001"  2009
"7106"      "1001"  2010
"1723"      "1001"  2010
"233066"    "1001"  2010
"11012"     "1001"  2010
"2937"      "1001"  2010
"663"       "1001"  2010
"227050"    "1001"  2010
"6255"      "1001"  2010
"231184"    "1001"  2011
"663"       "1001"  2011
"7106"      "1001"  2011
end

Dataset 2:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9(id1 id2) int year
"10001" "1603201"  2011
"10001" "1629140"  2011
"10001" "1629142"  2011
"10001" "1629144"  2011
"10001" "234125"   2011
"10001" "9588"     2011
"10001" "11122"    2012
"10001" "1603201"  2012
"10001" "1629142"  2012
"10001" "1629144"  2012
"10001" "1825013"  2012
"10001" "234125"   2012
"10001" "9588"     2012
"10001" "11122"    2013
"10001" "1603201"  2013
"10001" "1629142"  2013
"10001" "1825013"  2013
"10001" "234125"   2013
"10001" "9588"     2013
"10001" "11122"    2014
"10001" "1603201"  2014
"10001" "1629142"  2014
"10001" "1825013"  2014
"10001" "234125"   2014
"10001" "9588"     2014
"10001" "11122"    2015
"10001" "1603201"  2015
"10001" "1629142"  2015
"10001" "1825013"  2015
"10001" "234125"   2015
"10001" "80059014" 2015
"10001" "9588"     2015
"10001" "11122"    2016
"10001" "1629142"  2016
"10001" "1825013"  2016
"10001" "234125"   2016
"10001" "76913976" 2016
"10001" "80059014" 2016
"10001" "1629142"  2017
"10001" "1825013"  2017
"10001" "234125"   2017
"10001" "7096940"  2017
"10001" "76913976" 2017
"10001" "80059014" 2017
"10001" "1825013"  2018
"10001" "234125"   2018
"10001" "6105"     2018
"10001" "7096940"  2018
"10001" "76913976" 2018
"10001" "80059014" 2018
"10001" "1825013"  2019
"10001" "234125"   2019
"10001" "6105"     2019
"10001" "7096940"  2019
"10001" "76913976" 2019
"10001" "1825013"  2020
"10001" "234125"   2020
"10001" "7096940"  2020
"10001" "76913976" 2020
"10001" "1825013"  2021
"10001" "234125"   2021
"10001" "7096940"  2021
"10001" "76913976" 2021
"10001" "1825013"  2022
"10001" "234125"   2022
"10001" "7096940"  2022
"10001" "76913976" 2022
"10002" "10564"    2003
"10002" "1533793"  2003
"10002" "21281"    2003
"10002" "218975"   2003
"10002" "22258"    2003
"10002" "2400"     2003
"10002" "245966"   2003
"10002" "3033"     2003
"10002" "317"      2003
"10002" "4205"     2003
"10002" "4828"     2003
"10002" "6259"     2003
"10002" "7302"     2003
"10002" "7384"     2003
"10002" "1533793"  2004
"10002" "21281"    2004
"10002" "218975"   2004
"10002" "235671"   2004
"10002" "2400"     2004
"10002" "3033"     2004
"10002" "317"      2004
"10002" "4828"     2004
"10002" "4990"     2004
"10002" "7302"     2004
"10002" "7384"     2004
"10002" "11545"    2005
"10002" "1253"     2005
"10002" "21281"    2005
"10002" "218975"   2005
"10002" "235671"   2005
"10002" "2400"     2005
"10002" "317"      2005
"10002" "4828"     2005
"10002" "4990"     2005
"10002" "7384"     2005
"10002" "11545"    2006
"10002" "1253"     2006
"10002" "15847"    2006
"10002" "1754387"  2006
"10002" "21281"    2006
"10002" "218975"   2006
"10002" "231167"   2006
"10002" "2400"     2006
"10002" "317"      2006
"10002" "4828"     2006
"10002" "7384"     2006
"10002" "11545"    2007
"10002" "1558211"  2007
"10002" "15847"    2007
"10002" "1754387"  2007
"10002" "21281"    2007
"10002" "218975"   2007
"10002" "231167"   2007
"10002" "2400"     2007
"10002" "317"      2007
"10002" "4828"     2007
"10002" "6116"     2007
"10002" "7161"     2007
"10002" "7384"     2007
"10002" "1558211"  2008
"10002" "15847"    2008
"10002" "1754387"  2008
"10002" "218975"   2008
"10002" "234064"   2008
"10002" "317"      2008
"10002" "6116"     2008
"10002" "7161"     2008
"10002" "7384"     2008
"10002" "1754387"  2009
"10002" "234064"   2009
"10002" "6116"     2009
"10002" "7161"     2009
"10002" "1754387"  2010
"10002" "234064"   2010
"10002" "6116"     2010
"10002" "7161"     2010
"10002" "1754387"  2011
"10002" "7457"     2012
"10004" "1522428"  2003
"10004" "4091"     2003
"10004" "7158"     2003
"10004" "1522428"  2004
"10004" "15906"    2004
"10004" "4091"     2004
"10004" "5594"     2004
"10004" "7158"     2004
"10004" "1522428"  2005
"10004" "15906"    2005
"10004" "213121"   2005
"10004" "392"      2005
"10004" "5594"     2005
"10004" "7158"     2005
"10004" "7390"     2005
"10004" "1522428"  2006
"10004" "15906"    2006
"10004" "213121"   2006
"10004" "392"      2006
"10004" "5594"     2006
"10004" "7158"     2006
"10004" "7390"     2006
"10004" "1522428"  2007
"10004" "15906"    2007
"10004" "392"      2007
"10004" "7158"     2007
"10004" "9855"     2007
"10004" "1522428"  2008
"10004" "15906"    2008
"10004" "392"      2008
"10004" "7158"     2008
"10004" "9855"     2008
"10005" "1516122"  2003
"10005" "216304"   2003
"10005" "231068"   2003
"10005" "5179"     2003
"10005" "1516122"  2004
"10005" "21002"    2004
"10005" "216304"   2004
"10005" "3179"     2004
"10005" "5179"     2004
"10005" "6624"     2004
"10005" "6975"     2004
"10005" "1516122"  2005
"10005" "2402"     2005
"10005" "3179"     2005
"10005" "5179"     2005
"10005" "6624"     2005
"10005" "6975"     2005
"10005" "7106"     2005
"10005" "7649"     2005
"10005" "11168"    2006
"10005" "2402"     2006
"10005" "3179"     2006
"10005" "6624"     2006
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

22 Jun 2023, 21:11

Thanks for trying. Unfortunately, as best I can tell, this expanded example also does not contain any linkages back to the primary partner.

In addition, I am taken aback that this data set is so large. The solution I had in mind definitely will not work in a data set that large--it will greatly exceed available memory. So I'm going to have ponder this one a while and see if I can figure out a method that is feasible in a large data set. It's not obvious to me that this can be done at all. But give me some time to ponder it.
Comment

Ashish Sharma

Join Date: Nov 2020
Posts: 9

22 Jun 2023, 22:09

Clyde,

I share your frustration and I very much appreciate you giving it some thought. I know for sure that the common relationships exist in the datasets - I see those when I perform separate merges for each of the two dyadic partners and eyeball the data. However, given the file size and infrequency of such matches makes it much harder to pinpoint the exact sample. Aside, what are your thought on an m:m merge? In the meanwhile, I'm wondering whether this is that one rare case that requires the use of a m:m merge.
Would something on the lines of the following make sense?

Instead of using the master file with one unique record (id) for primary entity, I keep all records of id-partner dyads as is. For example - instead of only one variable, containing unique id values, I retain all dyads in the master file and use both id and partner id variables.
Assign each record an observation id (obs), like in Table 1
Use m:m merge to match with dataset 2 – Table 2. This way I can retain all dyads and identify them in the master file by obs id.
Then, probably, I can use a second merge by using only partner id as the master and compare the partner ids in the two. However, I’m not confident if that still solves the problem.

Table 1. Before merge (Master)
obs	id	partner id
1	10001	4598
2	10006	2186
3	100180561	1631797
4	100180561	22575357
5	100180561	6244
6	100181219	114386029
7	100181219	122334162
8	10023	127323000
9	10023	49
10	100284708	10082

Table 2. After merge
obs	id (id1)	partner id	match (id2)	_merge
1	10001	4598	1825013	Matched (3)
1	10001	4598	234125	Matched (3)
1	10001	4598	7096940	Matched (3)
1	10001	4598	76913976	Matched (3)
2	10006	2186	7863092	Matched (3)
2	10006	2186	2186	Matched (3)
2	10006	2186	8751378	Matched (3)
3	100180561	1631797	20721363	Matched (3)
4	100180561	22575357	441	Matched (3)
5	100180561	6244	117389962	Matched (3)
5	100180561	6244	63715665	Matched (3)
6	100181219	114386029		Master only (1)
7	100181219	122334162		Master only (1)
8	10023	127323000		Master only (1)
9	10023	49		Master only (1)
10	100284708	10082	6204	Matched (3)

I appreciate your time and help with this issue!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

23 Jun 2023, 08:32

Aside, what are your thought on an m:m merge?

Easiest question in the world: don't do it. It will give you incorrect results. I have written often in this Forum about the dangers of -merge m:m-. I won't repeat it all here. Suffice it to say that it is almost never the right solution, and when it does produce correct results it is almost always because a 1:1, 1:m, or m:1 merge would have produced the same thing. The situations where -merge m:m- is correct are so rare that, for practical purposes, you should make it a rule to never use it.

It is true that your problem requires associating multiple observations in one data set with multiple observations in another. But -merge m:m- does not do that correctly. The command that does is -joinby-, and it is at the root of the approach I have been thinking about. The problem is that the results of that combined data set will be too large. So I'm trying to come up with a different approach that won't require creating a gargantuan intermediate data set.
Comment
Ashish Sharma

Join Date: Nov 2020

Posts: 9
#7

23 Jun 2023, 11:49

Thanks, Clyde. Yes, I have seen the threads, advising against the use of m:m merge. That's exactly why I'm not confident using it as a solution. Hopefully, there is some efficient way to find common matches to a dyad using more than one dataset.

On a side note: do you think using joinby can work more efficiently after I create two files with matches for each of the two partners in the dyad, i.e. -- (1) first, find matches for main entity (id) in dataset 2 and save as one set, (2) then find matches for the partner entity in dataset 2 and save as another set, (3) and then use joinby on the two sets? That way, the total number of combinations I have to work with is smaller than the whole sample. I understand the joinby creates pairwise combinations of all records, but can't tell how efficient it will be when used on the smaller (matched) datasets.

Thanks for your time and giving it a thought. I hope there's some solution as the problem to find common matches to a dyad may not be so rare.

Last edited by Ashish Sharma; 23 Jun 2023, 11:59.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

23 Jun 2023, 12:50

On a side note: do you think using joinby can work more efficiently after I create two files with matches for each of the two partners in the dyad, i.e. -- (1) first, find matches for main entity (id) in dataset 2 and save as one set, (2) then find matches for the partner entity in dataset 2 and save as another set, (3) and then use joinby on the two sets? That way, the total number of combinations I have to work with is smaller than the whole sample. I understand the joinby creates pairwise combinations of all records, but can't tell how efficient it will be when used on the smaller (matched) datasets.

Yes, that is the general direction in which my own thoughts have been moving. It's definitely worth a try. It's a little tricky to actually set up the -joinby- commands because, as with -merge-, the key variables have to have identical names in the data sets being joined, and in this case we are sometimes trying to pair up variables that have different names in the two data sets. So some dancing with -rename- is required. Anyway, yes, I would encourage you to pursue this approach. I still worry whether this will reduce the size of the intermediate data set required enough, but it's well worth a try.
Comment

Announcement

Seeking alternative for merge m:m or vlookup multiple files - identify values from triad of relationships

Comment

Comment

Comment

Comment

Comment

Comment

Comment