Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seeking alternative for merge m:m or vlookup multiple files - identify values from triad of relationships

    Dear Statalisters,

    I’m seeking help to resolve an annoying problem – it might be nothing major for some of you, but trying to solve this issue, given the dyadic nature of the data, is spinning my head. I have two unbalanced panel datasets:

    Dataset 1 – dyadic relationships recorded between two partners for multiple years.
    Dataset 2 – relationships of partners in dataset 1 with others (excluding their dyadic relationships) over multiple years.

    Both entities in dataset 1 (id and id_partner) may have multiple other relationships with other entities in dataset 2. I’m interested in identifying the following:

    Step 1. Check whether the focal entity in dataset 1 (id) has any other relationships with other entities in dataset 2 (id1-id2) for a given year.
    • The focal entity (id) can appear multiple times in both datasets as it can have dyadic relations with multiple other entities, e.g. A related with B, A related with E, and so on. However, this is also the case with the partner entity (id_partner), which can also be related to multiple focal entities. In short, I’m dealing with a m:m case.
    • Similarly, Dataset 2 can have multiple records of focal and partner entities over multiple years.
    • As my main focus is the focal entity, I can remove duplicate id-year records from dataset 1 and use merge 1:m id year using dataset 2 to find matches for the focal entity-year record, which is fairly straightforward. Although, focal entity (id) from dataset 1 can appear in either id1 or id2 in dataset 2. But that’s not a challenge as I can use merge twice – once looking for id-year in id1-year and then looking for id-year in id2-year.
    Step 2. If step 1 is True, check whether any of the identified partners of the focal entity also have any relationships with focal entity’s partner (id_partner - as identified in dataset 1).
    • This is where it gets tricky because the partner (id_partner) in dataset 1 is also related with multiple focal entities (id) – think of the worst case of infidelity where everyone in the society is having an affair with (almost) everyone else!
    • Similar to in step 1, I can use merge to find matches of id_partner from dataset 1 in dataset 2. However, now I’m again back to having two separate datasets – one with focal entity (id in dataset 1) and its matched relationships from dataset 2, and another with focal entity’s partner entity (id_partner in dataset 1) and its matched relationships from dataset 2.
    • I tried the vlookup function but my understanding is that it is limited to finding matches in the same file and cannot be used to look up matches in another dataset.
    • Eventually, I need to get the data in a format where I have focal entity (from dataset 1) + its identified matches (from dataset 2) + a tag for whether the particular match is also related to the focal entity’s partner entity (id_partner in dataset 1).
    In brief, the relationships in dataset1 and 2 are somewhat like this:
    • Adam is the focal entity (dataset 1 - id).
    • Adam is dating Bella in year 1 and 2, Chelsea in year 2, and Dakota in year 1 and 3 (dataset 1 - id_partner).
    • Adam attends college with Brad in year 1 and 3, and with Chandler in year 2. Chandler is also related to Bella in year 2.
    Dataset 1
    year id id_partner
    1 Adam Bella
    1 Adam Dakota
    2 Adam Bella
    2 Chelsea Adam
    3 Adam Dakota
    Dataset 2
    year id1 id2 id_partner
    1 Adam Brad No
    2 Chandler Adam Yes
    3 Adam Brad No



    *Example generated by -dataex-. (dataset 1)
    clear
    input float year str10(id id_partner)
    2016 "194535257" "10001"
    2017 "194535257" "10001"
    2012 "10001" "3236"
    2013 "10001" "3236"
    2014 "10001" "3236"
    2015 "10001" "3236"
    2016 "10001" "3236"
    2017 "10001" "3236"
    2018 "10001" "3236"
    2019 "10001" "3236"
    2020 "10001" "3236"
    2014 "10001" "4598"
    2015 "10001" "4598"
    2016 "10001" "4598"
    2017 "10001" "4598"
    end

    *Example generated by -dataex- (dataset 2)
    clear
    input float year str10(id1 id2)
    2010 "1871295" "1000"
    2011 "1871295" "1000"
    2012 "1871295" "1000"
    2013 "1871295" "1000"
    2019 "194528198" "100000012"
    2020 "194528198" "100000012"
    2021 "194528198" "100000012"
    2022 "194528198" "100000012"
    2012 "10001" "11122"
    2013 "10001" "11122"
    2014 "10001" "11122"
    2015 "10001" "11122"
    2016 "10001" "11122"
    2006 "10001" "1508418"
    2007 "10001" "1508418"
    end


    Thanks for your help!

  • #2
    Unfortunately, in the example you show, while there are some secondary relationships found in dataset2, none of them link back to anybody in dataset1. This makes it very difficult to develop and test a solution to your problem. Please post example data that includes all of the kinds of linkages you are looking for.

    Comment


    • #3
      Hi Clyde,

      Thanks for your quick response. I understand your frustration - unfortunately, both datasets contain almost half a million obs. each, making it difficult to capture the variety of linkages via a short dataex capture. I'm posting 100 rows each so as not to make the post look like a spam. However, please let me know if there's a more efficient way of sharing the nature of the data here.

      Dataset 1:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str10(id id_partner) float year
      "194535257" "10001" 2016
      "194535257" "10001" 2017
      "218119"    "10002" 2003
      "244142"    "10002" 2003
      "2283"      "10002" 2003
      "1679"      "10002" 2003
      "1679"      "10002" 2004
      "244142"    "10002" 2004
      "218119"    "10002" 2004
      "704"       "10002" 2005
      "244142"    "10002" 2005
      "215725"    "10002" 2006
      "1818294"   "10002" 2006
      "704"       "10002" 2006
      "234064"    "10002" 2006
      "244142"    "10002" 2006
      "1818294"   "10002" 2007
      "234064"    "10002" 2007
      "215725"    "10002" 2007
      "234009"    "10002" 2007
      "1818294"   "10002" 2008
      "234009"    "10002" 2008
      "215725"    "10002" 2008
      "233197"    "10002" 2010
      "233197"    "10002" 2011
      "2159"      "10004" 2003
      "2159"      "10004" 2004
      "5179"      "10004" 2004
      "9855"      "10004" 2005
      "5179"      "10004" 2005
      "222664"    "10005" 2003
      "1426396"   "10005" 2003
      "6774"      "10005" 2003
      "8548"      "10005" 2003
      "11168"     "10005" 2003
      "228445"    "10005" 2003
      "9551"      "10005" 2003
      "6667"      "10005" 2003
      "3885"      "10005" 2003
      "7188"      "10005" 2003
      "9886"      "10005" 2003
      "9364"      "10005" 2003
      "6217"      "10005" 2004
      "11168"     "10005" 2004
      "9886"      "10005" 2004
      "8548"      "10005" 2004
      "228445"    "10005" 2004
      "15601"     "10005" 2004
      "3885"      "10005" 2004
      "257401"    "10005" 2004
      "1426396"   "10005" 2004
      "7188"      "10005" 2004
      "9364"      "10005" 2004
      "222664"    "10005" 2004
      "6774"      "10005" 2004
      "11257"     "10005" 2005
      "228445"    "10005" 2005
      "11168"     "10005" 2005
      "222664"    "10005" 2005
      "257401"    "10005" 2005
      "8548"      "10005" 2005
      "6217"      "10005" 2005
      "6774"      "10005" 2005
      "3885"      "10005" 2005
      "5172"      "10005" 2005
      "9364"      "10005" 2005
      "9886"      "10005" 2005
      "11257"     "10005" 2006
      "228445"    "10005" 2006
      "257401"    "10005" 2006
      "6774"      "10005" 2006
      "9886"      "10005" 2006
      "3885"      "10005" 2006
      "1740357"   "10005" 2006
      "743"       "10005" 2006
      "5172"      "10005" 2006
      "8548"      "10005" 2006
      "229955"    "10005" 2007
      "9886"      "10005" 2007
      "1740357"   "10005" 2007
      "228445"    "10005" 2007
      "6774"      "10005" 2007
      "257401"    "10005" 2007
      "8548"      "10005" 2007
      "7124"      "10005" 2007
      "743"       "10005" 2007
      "15813"     "10007" 2003
      "244091"    "10007" 2003
      "245"       "10007" 2003
      "971"       "10007" 2003
      "6259"      "10007" 2003
      "244091"    "10007" 2004
      "971"       "10007" 2004
      "6259"      "10007" 2004
      "7649"      "10007" 2005
      "6259"      "10007" 2005
      "971"       "10007" 2005
      "7649"      "10007" 2006
      "6259"      "10007" 2006
      "216210"    "1001"  2003
      "8061"      "1001"  2003
      "233066"    "1001"  2003
      "2816"      "1001"  2003
      "3325"      "1001"  2003
      "242027"    "1001"  2003
      "5906"      "1001"  2003
      "3305"      "1001"  2003
      "1660327"   "1001"  2004
      "216210"    "1001"  2004
      "233066"    "1001"  2004
      "1443631"   "1001"  2004
      "8061"      "1001"  2004
      "242027"    "1001"  2004
      "1723"      "1001"  2004
      "5463"      "1001"  2004
      "3325"      "1001"  2004
      "2816"      "1001"  2004
      "9170"      "1001"  2005
      "233066"    "1001"  2005
      "5463"      "1001"  2005
      "1093"      "1001"  2005
      "1722384"   "1001"  2005
      "242027"    "1001"  2005
      "2939"      "1001"  2005
      "663"       "1001"  2005
      "1723"      "1001"  2005
      "4937"      "1001"  2005
      "5505"      "1001"  2005
      "2816"      "1001"  2005
      "1620"      "1001"  2005
      "3325"      "1001"  2005
      "3488"      "1001"  2005
      "1443631"   "1001"  2005
      "8061"      "1001"  2005
      "11012"     "1001"  2005
      "2939"      "1001"  2006
      "11012"     "1001"  2006
      "1723"      "1001"  2006
      "3488"      "1001"  2006
      "1722384"   "1001"  2006
      "1093"      "1001"  2006
      "8061"      "1001"  2006
      "663"       "1001"  2006
      "2816"      "1001"  2006
      "233066"    "1001"  2006
      "1620"      "1001"  2006
      "5505"      "1001"  2006
      "1576198"   "1001"  2006
      "227050"    "1001"  2006
      "5463"      "1001"  2006
      "242027"    "1001"  2006
      "9170"      "1001"  2006
      "5885"      "1001"  2007
      "1723"      "1001"  2007
      "1722384"   "1001"  2007
      "233066"    "1001"  2007
      "1093"      "1001"  2007
      "11012"     "1001"  2007
      "242027"    "1001"  2007
      "663"       "1001"  2007
      "9170"      "1001"  2007
      "215142"    "1001"  2007
      "8061"      "1001"  2007
      "2937"      "1001"  2007
      "1576198"   "1001"  2007
      "2939"      "1001"  2007
      "1723"      "1001"  2008
      "1576198"   "1001"  2008
      "663"       "1001"  2008
      "6255"      "1001"  2008
      "215142"    "1001"  2008
      "2937"      "1001"  2008
      "2939"      "1001"  2008
      "9170"      "1001"  2008
      "8061"      "1001"  2008
      "1093"      "1001"  2008
      "5885"      "1001"  2008
      "1722384"   "1001"  2008
      "233066"    "1001"  2008
      "242027"    "1001"  2008
      "11012"     "1001"  2008
      "1723"      "1001"  2009
      "6255"      "1001"  2009
      "8061"      "1001"  2009
      "2937"      "1001"  2009
      "7106"      "1001"  2009
      "11012"     "1001"  2009
      "663"       "1001"  2009
      "233066"    "1001"  2009
      "7106"      "1001"  2010
      "1723"      "1001"  2010
      "233066"    "1001"  2010
      "11012"     "1001"  2010
      "2937"      "1001"  2010
      "663"       "1001"  2010
      "227050"    "1001"  2010
      "6255"      "1001"  2010
      "231184"    "1001"  2011
      "663"       "1001"  2011
      "7106"      "1001"  2011
      end


      Dataset 2:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str9(id1 id2) int year
      "10001" "1603201"  2011
      "10001" "1629140"  2011
      "10001" "1629142"  2011
      "10001" "1629144"  2011
      "10001" "234125"   2011
      "10001" "9588"     2011
      "10001" "11122"    2012
      "10001" "1603201"  2012
      "10001" "1629142"  2012
      "10001" "1629144"  2012
      "10001" "1825013"  2012
      "10001" "234125"   2012
      "10001" "9588"     2012
      "10001" "11122"    2013
      "10001" "1603201"  2013
      "10001" "1629142"  2013
      "10001" "1825013"  2013
      "10001" "234125"   2013
      "10001" "9588"     2013
      "10001" "11122"    2014
      "10001" "1603201"  2014
      "10001" "1629142"  2014
      "10001" "1825013"  2014
      "10001" "234125"   2014
      "10001" "9588"     2014
      "10001" "11122"    2015
      "10001" "1603201"  2015
      "10001" "1629142"  2015
      "10001" "1825013"  2015
      "10001" "234125"   2015
      "10001" "80059014" 2015
      "10001" "9588"     2015
      "10001" "11122"    2016
      "10001" "1629142"  2016
      "10001" "1825013"  2016
      "10001" "234125"   2016
      "10001" "76913976" 2016
      "10001" "80059014" 2016
      "10001" "1629142"  2017
      "10001" "1825013"  2017
      "10001" "234125"   2017
      "10001" "7096940"  2017
      "10001" "76913976" 2017
      "10001" "80059014" 2017
      "10001" "1825013"  2018
      "10001" "234125"   2018
      "10001" "6105"     2018
      "10001" "7096940"  2018
      "10001" "76913976" 2018
      "10001" "80059014" 2018
      "10001" "1825013"  2019
      "10001" "234125"   2019
      "10001" "6105"     2019
      "10001" "7096940"  2019
      "10001" "76913976" 2019
      "10001" "1825013"  2020
      "10001" "234125"   2020
      "10001" "7096940"  2020
      "10001" "76913976" 2020
      "10001" "1825013"  2021
      "10001" "234125"   2021
      "10001" "7096940"  2021
      "10001" "76913976" 2021
      "10001" "1825013"  2022
      "10001" "234125"   2022
      "10001" "7096940"  2022
      "10001" "76913976" 2022
      "10002" "10564"    2003
      "10002" "1533793"  2003
      "10002" "21281"    2003
      "10002" "218975"   2003
      "10002" "22258"    2003
      "10002" "2400"     2003
      "10002" "245966"   2003
      "10002" "3033"     2003
      "10002" "317"      2003
      "10002" "4205"     2003
      "10002" "4828"     2003
      "10002" "6259"     2003
      "10002" "7302"     2003
      "10002" "7384"     2003
      "10002" "1533793"  2004
      "10002" "21281"    2004
      "10002" "218975"   2004
      "10002" "235671"   2004
      "10002" "2400"     2004
      "10002" "3033"     2004
      "10002" "317"      2004
      "10002" "4828"     2004
      "10002" "4990"     2004
      "10002" "7302"     2004
      "10002" "7384"     2004
      "10002" "11545"    2005
      "10002" "1253"     2005
      "10002" "21281"    2005
      "10002" "218975"   2005
      "10002" "235671"   2005
      "10002" "2400"     2005
      "10002" "317"      2005
      "10002" "4828"     2005
      "10002" "4990"     2005
      "10002" "7384"     2005
      "10002" "11545"    2006
      "10002" "1253"     2006
      "10002" "15847"    2006
      "10002" "1754387"  2006
      "10002" "21281"    2006
      "10002" "218975"   2006
      "10002" "231167"   2006
      "10002" "2400"     2006
      "10002" "317"      2006
      "10002" "4828"     2006
      "10002" "7384"     2006
      "10002" "11545"    2007
      "10002" "1558211"  2007
      "10002" "15847"    2007
      "10002" "1754387"  2007
      "10002" "21281"    2007
      "10002" "218975"   2007
      "10002" "231167"   2007
      "10002" "2400"     2007
      "10002" "317"      2007
      "10002" "4828"     2007
      "10002" "6116"     2007
      "10002" "7161"     2007
      "10002" "7384"     2007
      "10002" "1558211"  2008
      "10002" "15847"    2008
      "10002" "1754387"  2008
      "10002" "218975"   2008
      "10002" "234064"   2008
      "10002" "317"      2008
      "10002" "6116"     2008
      "10002" "7161"     2008
      "10002" "7384"     2008
      "10002" "1754387"  2009
      "10002" "234064"   2009
      "10002" "6116"     2009
      "10002" "7161"     2009
      "10002" "1754387"  2010
      "10002" "234064"   2010
      "10002" "6116"     2010
      "10002" "7161"     2010
      "10002" "1754387"  2011
      "10002" "7457"     2012
      "10004" "1522428"  2003
      "10004" "4091"     2003
      "10004" "7158"     2003
      "10004" "1522428"  2004
      "10004" "15906"    2004
      "10004" "4091"     2004
      "10004" "5594"     2004
      "10004" "7158"     2004
      "10004" "1522428"  2005
      "10004" "15906"    2005
      "10004" "213121"   2005
      "10004" "392"      2005
      "10004" "5594"     2005
      "10004" "7158"     2005
      "10004" "7390"     2005
      "10004" "1522428"  2006
      "10004" "15906"    2006
      "10004" "213121"   2006
      "10004" "392"      2006
      "10004" "5594"     2006
      "10004" "7158"     2006
      "10004" "7390"     2006
      "10004" "1522428"  2007
      "10004" "15906"    2007
      "10004" "392"      2007
      "10004" "7158"     2007
      "10004" "9855"     2007
      "10004" "1522428"  2008
      "10004" "15906"    2008
      "10004" "392"      2008
      "10004" "7158"     2008
      "10004" "9855"     2008
      "10005" "1516122"  2003
      "10005" "216304"   2003
      "10005" "231068"   2003
      "10005" "5179"     2003
      "10005" "1516122"  2004
      "10005" "21002"    2004
      "10005" "216304"   2004
      "10005" "3179"     2004
      "10005" "5179"     2004
      "10005" "6624"     2004
      "10005" "6975"     2004
      "10005" "1516122"  2005
      "10005" "2402"     2005
      "10005" "3179"     2005
      "10005" "5179"     2005
      "10005" "6624"     2005
      "10005" "6975"     2005
      "10005" "7106"     2005
      "10005" "7649"     2005
      "10005" "11168"    2006
      "10005" "2402"     2006
      "10005" "3179"     2006
      "10005" "6624"     2006
      end

      Comment


      • #4
        Thanks for trying. Unfortunately, as best I can tell, this expanded example also does not contain any linkages back to the primary partner.

        In addition, I am taken aback that this data set is so large. The solution I had in mind definitely will not work in a data set that large--it will greatly exceed available memory. So I'm going to have ponder this one a while and see if I can figure out a method that is feasible in a large data set. It's not obvious to me that this can be done at all. But give me some time to ponder it.

        Comment


        • #5
          Clyde,

          I share your frustration and I very much appreciate you giving it some thought. I know for sure that the common relationships exist in the datasets - I see those when I perform separate merges for each of the two dyadic partners and eyeball the data. However, given the file size and infrequency of such matches makes it much harder to pinpoint the exact sample. Aside, what are your thought on an m:m merge? In the meanwhile, I'm wondering whether this is that one rare case that requires the use of a m:m merge.
          Would something on the lines of the following make sense?
          • Instead of using the master file with one unique record (id) for primary entity, I keep all records of id-partner dyads as is. For example - instead of only one variable, containing unique id values, I retain all dyads in the master file and use both id and partner id variables.
          • Assign each record an observation id (obs), like in Table 1
          • Use m:m merge to match with dataset 2 – Table 2. This way I can retain all dyads and identify them in the master file by obs id.
          • Then, probably, I can use a second merge by using only partner id as the master and compare the partner ids in the two. However, I’m not confident if that still solves the problem.
          Table 1. Before merge (Master)
          obs id partner id
          1 10001 4598
          2 10006 2186
          3 100180561 1631797
          4 100180561 22575357
          5 100180561 6244
          6 100181219 114386029
          7 100181219 122334162
          8 10023 127323000
          9 10023 49
          10 100284708 10082
          Table 2. After merge
          obs id (id1) partner id match (id2) _merge
          1 10001 4598 1825013 Matched (3)
          1 10001 4598 234125 Matched (3)
          1 10001 4598 7096940 Matched (3)
          1 10001 4598 76913976 Matched (3)
          2 10006 2186 7863092 Matched (3)
          2 10006 2186 2186 Matched (3)
          2 10006 2186 8751378 Matched (3)
          3 100180561 1631797 20721363 Matched (3)
          4 100180561 22575357 441 Matched (3)
          5 100180561 6244 117389962 Matched (3)
          5 100180561 6244 63715665 Matched (3)
          6 100181219 114386029 Master only (1)
          7 100181219 122334162 Master only (1)
          8 10023 127323000 Master only (1)
          9 10023 49 Master only (1)
          10 100284708 10082 6204 Matched (3)
          I appreciate your time and help with this issue!

          Comment


          • #6
            Aside, what are your thought on an m:m merge?
            Easiest question in the world: don't do it. It will give you incorrect results. I have written often in this Forum about the dangers of -merge m:m-. I won't repeat it all here. Suffice it to say that it is almost never the right solution, and when it does produce correct results it is almost always because a 1:1, 1:m, or m:1 merge would have produced the same thing. The situations where -merge m:m- is correct are so rare that, for practical purposes, you should make it a rule to never use it.

            It is true that your problem requires associating multiple observations in one data set with multiple observations in another. But -merge m:m- does not do that correctly. The command that does is -joinby-, and it is at the root of the approach I have been thinking about. The problem is that the results of that combined data set will be too large. So I'm trying to come up with a different approach that won't require creating a gargantuan intermediate data set.

            Comment


            • #7
              Thanks, Clyde. Yes, I have seen the threads, advising against the use of m:m merge. That's exactly why I'm not confident using it as a solution. Hopefully, there is some efficient way to find common matches to a dyad using more than one dataset.

              On a side note: do you think using joinby can work more efficiently after I create two files with matches for each of the two partners in the dyad, i.e. -- (1) first, find matches for main entity (id) in dataset 2 and save as one set, (2) then find matches for the partner entity in dataset 2 and save as another set, (3) and then use joinby on the two sets? That way, the total number of combinations I have to work with is smaller than the whole sample. I understand the joinby creates pairwise combinations of all records, but can't tell how efficient it will be when used on the smaller (matched) datasets.

              Thanks for your time and giving it a thought. I hope there's some solution as the problem to find common matches to a dyad may not be so rare.
              Last edited by Ashish Sharma; 23 Jun 2023, 11:59.

              Comment


              • #8
                On a side note: do you think using joinby can work more efficiently after I create two files with matches for each of the two partners in the dyad, i.e. -- (1) first, find matches for main entity (id) in dataset 2 and save as one set, (2) then find matches for the partner entity in dataset 2 and save as another set, (3) and then use joinby on the two sets? That way, the total number of combinations I have to work with is smaller than the whole sample. I understand the joinby creates pairwise combinations of all records, but can't tell how efficient it will be when used on the smaller (matched) datasets.
                Yes, that is the general direction in which my own thoughts have been moving. It's definitely worth a try. It's a little tricky to actually set up the -joinby- commands because, as with -merge-, the key variables have to have identical names in the data sets being joined, and in this case we are sometimes trying to pair up variables that have different names in the two data sets. So some dancing with -rename- is required. Anyway, yes, I would encourage you to pursue this approach. I still worry whether this will reduce the size of the intermediate data set required enough, but it's well worth a try.

                Comment

                Working...
                X