Merging datasets with m:m

Michael Ev

Join Date: Feb 2018

Posts: 5
#1

Merging datasets with m:m

23 Feb 2018, 17:06

I know that the Stata user guide recommends not using the merge m:m command, but believe it may be the best option in this scenario.

I have two datasets of different sizes (DatasetA contains 1,000,000+ observations; DatasetB contains 500,000+ observations).

I'd like to join the two datasets by the merge command but run into problems here. The column common to both datasets is a uniqueID assigned to individuals. However, in each dataset, it is possible that this uniqueID appears once, more than once, or not at all. Take the following example of four individuals:

Dataset1

UniqueID Variable1

1 x

1 y

2 z

2 h

2 j

3 k

4 m

Dataset2

UniqueID Variable2

1 a

2 b

4 c

2 d

I would like to merge the two datasets. merge 1:1 is not appropriate since UniqueID does not uniquely identify single observations in both datasets. Neither are merge 1:m or merge m:1 appropriate since in neither dataset do the UniqueID's identify single observations.

I'd assumed therefore that the only appropriate merge was merge m:m. However, this causes additional problems as it adds observations to the merged dataset.

I understand that this process entrains a whole new class of problems with the dataset, but do not see a better way to merge. If anyone has any suggestions or can provide a better understanding of where I'm going wrong please let me know.

Thank you.
Tags: 1:m, data, m:1, m:m, merge
Joseph Coveney

Join Date: Apr 2014

Posts: 4433
#2

23 Feb 2018, 18:41

You could try something like that below. With more than a million observations in one dataset and a half-million in the other, you might want to look into using one of the user-written fast reshape commands that are floating around.

.ÿversionÿ15.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿinputÿbyteÿUniqueIDÿstr1ÿVariable1

ÿÿÿÿÿUniqueIDÿÿVariable1
ÿÿ1.ÿ1ÿÿÿÿÿÿÿx
ÿÿ2.ÿ1ÿÿÿÿÿÿÿy
ÿÿ3.ÿ2ÿÿÿÿÿÿÿz
ÿÿ4.ÿ2ÿÿÿÿÿÿÿh
ÿÿ5.ÿ2ÿÿÿÿÿÿÿj
ÿÿ6.ÿ3ÿÿÿÿÿÿÿk
ÿÿ7.ÿ4ÿÿÿÿÿÿÿm
ÿÿ8.ÿend

.ÿ
.ÿtempfileÿDataset1

.ÿquietlyÿsaveÿ`Dataset1'

.ÿ
.ÿdropÿ_all

.ÿ
.ÿinputÿbyteÿUniqueIDÿstr1ÿVariable2

ÿÿÿÿÿUniqueIDÿÿVariable2
ÿÿ1.ÿ1ÿÿÿÿÿÿÿa
ÿÿ2.ÿ2ÿÿÿÿÿÿÿb
ÿÿ3.ÿ4ÿÿÿÿÿÿÿc
ÿÿ4.ÿ2ÿÿÿÿÿÿÿd
ÿÿ5.ÿend

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿbysortÿUniqueIDÿ(Variable2):ÿgenerateÿlongÿrowÿ=ÿ_n

.ÿquietlyÿreshapeÿwideÿVariable2,ÿi(UniqueID)ÿj(row)

.ÿ
.ÿtempfileÿDataset2

.ÿquietlyÿsaveÿ`Dataset2'

.ÿ
.ÿuseÿ`Dataset1'

.ÿbysortÿUniqueIDÿ(Variable1):ÿgenerateÿlongÿrowÿ=ÿ_n

.ÿquietlyÿreshapeÿwideÿVariable1,ÿi(UniqueID)ÿj(row)

.ÿ
.ÿmergeÿ1:1ÿUniqueIDÿusingÿ`Dataset2',ÿnogenerateÿnoreport

.ÿ
.ÿ//ÿRecommendÿtoÿquitÿhere,ÿbutÿyouÿcould:
.ÿquietlyÿreshapeÿlongÿVariable1ÿVariable2,ÿi(UniqueID)ÿj(discard)

.ÿquietlyÿdropÿifÿmissing(Variable1)ÿ&ÿmissing(Variable2)

.ÿsortÿUniqueIDÿdiscard

.ÿdropÿdiscard

.ÿlist,ÿnoobsÿsepby(UniqueID)ÿabbreviate(20)

ÿÿ+----------------------------------+
ÿÿ|ÿUniqueIDÿÿÿVariable1ÿÿÿVariable2ÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿxÿÿÿÿÿÿÿÿÿÿÿaÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿyÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿhÿÿÿÿÿÿÿÿÿÿÿbÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿjÿÿÿÿÿÿÿÿÿÿÿdÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿzÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿÿÿkÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿÿÿmÿÿÿÿÿÿÿÿÿÿÿcÿ|
ÿÿ+----------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.

In any event, I recommend thinking through just what it is you're after. Anytime that you find yourself having to do something like an m:m merge, you've got something terribly wrong in your data model.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4433
#3

23 Feb 2018, 18:56

. . . or maybe you're looking for joinby instead of an m:m merge?

.ÿversionÿ15.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿinputÿbyteÿUniqueIDÿstr1ÿVariable1

ÿÿÿÿÿUniqueIDÿÿVariable1
ÿÿ1.ÿ1ÿÿÿÿÿÿÿx
ÿÿ2.ÿ1ÿÿÿÿÿÿÿy
ÿÿ3.ÿ2ÿÿÿÿÿÿÿz
ÿÿ4.ÿ2ÿÿÿÿÿÿÿh
ÿÿ5.ÿ2ÿÿÿÿÿÿÿj
ÿÿ6.ÿ3ÿÿÿÿÿÿÿk
ÿÿ7.ÿ4ÿÿÿÿÿÿÿm
ÿÿ8.ÿend

.ÿ
.ÿtempfileÿDataset1

.ÿquietlyÿsaveÿ`Dataset1'

.ÿ
.ÿdropÿ_all

.ÿ
.ÿinputÿbyteÿUniqueIDÿstr1ÿVariable2

ÿÿÿÿÿUniqueIDÿÿVariable2
ÿÿ1.ÿ1ÿÿÿÿÿÿÿa
ÿÿ2.ÿ2ÿÿÿÿÿÿÿb
ÿÿ3.ÿ4ÿÿÿÿÿÿÿc
ÿÿ4.ÿ2ÿÿÿÿÿÿÿd
ÿÿ5.ÿend

.ÿ
.ÿjoinbyÿUniqueIDÿusingÿ`Dataset1',ÿunmatched(both)

.ÿsortÿUniqueIDÿV*

.ÿ
.ÿlistÿUÿV*,ÿnoobsÿsepby(UniqueID)ÿabbreviate(20)

ÿÿ+----------------------------------+
ÿÿ|ÿUniqueIDÿÿÿVariable2ÿÿÿVariable1ÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿaÿÿÿÿÿÿÿÿÿÿÿxÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿaÿÿÿÿÿÿÿÿÿÿÿyÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿhÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿjÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿzÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿhÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿjÿ|
ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿzÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿkÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿÿÿcÿÿÿÿÿÿÿÿÿÿÿmÿ|
ÿÿ+----------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.

You're going to have one unwieldy dataset and, again, I'd recommend thinking things through a little.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Feb 2018, 19:28

Welcome to Statalist.

I am less confident that I understand what you seek than Joseph is. You have shown us samples of your input datasets, but you have not shown us what you need the output to look like once you combine these two sample dataset. Note that I write "combine" precisely because it is not the name of any built-in Stata command. There are (at least) three commands in Stata that could be relevant to your task.

merge is not relevant to your task, since as you note merge 1:1, merge 1:m, and merge m:1 are not helpful, and there is not an experienced Stata user on this list who would not agree with the Stata documentation's dismissal of merge m:m. Do heed the advice in the Stata Data Management Reference Manual discussion of m:m merges.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.

joinby is a command that people who tell Statalist they think they need merge m:m often find to be what they actually need; especially users familiar with SQL.
append sticks the two datasets end-to-end

Here's your data combined using joinby.

Code:

. joinby UniqueID using `Dataset1', unmatched(both)

. sort UniqueID Variable1 Variable2

. order UniqueID Variable1 Variable2

. list, sepby(UniqueID)

     +----------------------------------------------------------------+
     | UniqueID   Variab~1   Variab~2                          _merge |
     |----------------------------------------------------------------|
  1. |        1          x          a   both in master and using data |
  2. |        1          y          a   both in master and using data |
     |----------------------------------------------------------------|
  3. |        2          h          b   both in master and using data |
  4. |        2          h          d   both in master and using data |
  5. |        2          j          b   both in master and using data |
  6. |        2          j          d   both in master and using data |
  7. |        2          z          b   both in master and using data |
  8. |        2          z          d   both in master and using data |
     |----------------------------------------------------------------|
  9. |        3          k                         only in using data |
     |----------------------------------------------------------------|
 10. |        4          m          c   both in master and using data |
     +----------------------------------------------------------------+

Here's your dataset combined using append.

Code:

. append using `Dataset1'

. sort UniqueID Variable1 Variable2

. order UniqueID Variable1 Variable2

. list, sepby(UniqueID)

     +--------------------------------+
     | UniqueID   Variab~1   Variab~2 |
     |--------------------------------|
  1. |        1                     a |
  2. |        1          x            |
  3. |        1          y            |
     |--------------------------------|
  4. |        2                     b |
  5. |        2                     d |
  6. |        2          h            |
  7. |        2          j            |
  8. |        2          z            |
     |--------------------------------|
  9. |        3          k            |
     |--------------------------------|
 10. |        4                     c |
 11. |        4          m            |
     +--------------------------------+

Comment

Michael Ev

Join Date: Feb 2018

Posts: 5
#5

28 Feb 2018, 12:43

Thank you for the responses. Unfortunately, joinby would essentially create new observations of Variable1 and Variable 2 that do not exist in the data, which I can't allow in my analysis. As it turns out, the only solution that makes sense to my understanding is to append, sort, and sepby as William has shown in #4 which produces

Code:

. list, sepby(UniqueID) +--------------------------------+ | UniqueID Variab~1 Variab~2 | |--------------------------------| 1. | 1 x . | 2. | 1 y . | 3. | 1 . a | |--------------------------------| 4. | 2 h . | 5. | 2 j . | 6. | 2 z . | 7. | 2 . b | 8. | 2 . d | |--------------------------------| 9. | 3 k . | |--------------------------------| 10. | 4 m . | 11. | 4 . c | +--------------------------------+

Now, my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1). However, I'm unsure how to approach this since, having used append, in no single line of observation will I see a value of Variable1 and Variable2. One solution as I see it would be to treat all observations for a given UniqueID as one observation, but I don't know of a way of doing this that retains all the information. Any suggestions?

Thanks again for your help.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

28 Feb 2018, 13:49

Unfortunately, joinby would essentially create new observations of Variable1 and Variable 2 that do not exist in the data, which I can't allow in my analysis.

It turns out, now that you've explained the objective you are trying to achieve, that either joinby or append is the first step in achieving what you want.

Starting with the output of joinby:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte uniqueid str1(variable1 variable2)
1 "x" "a"
1 "y" "a"
2 "h" "b"
2 "h" "d"
2 "j" "b"
2 "j" "d"
2 "z" "b"
2 "z" "d"
3 "k" ""
4 "m" "c"
end
list, sepby(uniqueid)
levelsof variable1
foreach v in `r(levels)' {
    generate v1_`v' = variable1=="`v'"
}
levelsof variable2
foreach v in `r(levels)' {
    generate v2_`v' = variable2=="`v'"
}
collapse (max) v1_* v2_*, by(uniqueid)
list, clean noobs

Code:

. list, sepby(uniqueid)

     +--------------------------------+
     | uniqueid   variab~1   variab~2 |
     |--------------------------------|
  1. |        1          x          a |
  2. |        1          y          a |
     |--------------------------------|
  3. |        2          h          b |
  4. |        2          h          d |
  5. |        2          j          b |
  6. |        2          j          d |
  7. |        2          z          b |
  8. |        2          z          d |
     |--------------------------------|
  9. |        3          k            |
     |--------------------------------|
 10. |        4          m          c |
     +--------------------------------+

. levelsof variable1
`"h"' `"j"' `"k"' `"m"' `"x"' `"y"' `"z"'

. foreach v in `r(levels)' {
  2.         generate v1_`v' = variable1=="`v'"
  3. }

. levelsof variable2
`"a"' `"b"' `"c"' `"d"'

. foreach v in `r(levels)' {
  2.         generate v2_`v' = variable2=="`v'"
  3. }

. collapse (max) v1_* v2_*, by(uniqueid)

. list, clean noobs

    uniqueid   v1_h   v1_j   v1_k   v1_m   v1_x   v1_y   v1_z   v2_a   v2_b   v2_c   v2_d  
           1      0      0      0      0      1      1      0      1      0      0      0  
           2      1      1      0      0      0      0      1      0      1      0      1  
           3      0      0      1      0      0      0      0      0      0      0      0  
           4      0      0      0      1      0      0      0      0      0      1      0

Starting with the output of append:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte uniqueid str2(variable1 variable2)
1 ""  "a"
1 "x" ""
1 "y" ""
2 ""  "b"
2 ""  "d"
2 "h" ""
2 "j" ""
2 "z" ""
3 "k" ""
4 ""  "c"
4 "m" ""
end
list, sepby(uniqueid)
levelsof variable1
foreach v in `r(levels)' {
    generate v1_`v' = variable1=="`v'"
}
levelsof variable2
foreach v in `r(levels)' {
    generate v2_`v' = variable2=="`v'"
}
collapse (max) v1_* v2_*, by(uniqueid)
list, clean noobs

Code:

. levelsof variable1
`"h"' `"j"' `"k"' `"m"' `"x"' `"y"' `"z"'

. foreach v in `r(levels)' {
  2.         generate v1_`v' = variable1=="`v'"
  3. }

. levelsof variable2
`"a"' `"b"' `"c"' `"d"'

. foreach v in `r(levels)' {
  2.         generate v2_`v' = variable2=="`v'"
  3. }

. collapse (max) v1_* v2_*, by(uniqueid)

. list, clean noobs

    uniqueid   v1_h   v1_j   v1_k   v1_m   v1_x   v1_y   v1_z   v2_a   v2_b   v2_c   v2_d  
           1      0      0      0      0      1      1      0      1      0      0      0  
           2      1      1      0      0      0      0      1      0      1      0      1  
           3      0      0      1      0      0      0      0      0      0      0      0  
           4      0      0      0      1      0      0      0      0      0      1      0

But truth be told, now that you've described your objective, neither joinby nor append is necessary; you can start by applying reshape to your two datasets before a simple merge 1:1:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte uniqueid str1 variable2
1 "a"
2 "b"
4 "c"
2 "d"
end
generate v_ = 1
reshape wide v_, i(uniqueid) j(variable2) string
tempfile v2
save "`v2'"
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte uniqueid str1 variable1
1 "x"
1 "y"
2 "z"
2 "h"
2 "j"
3 "k"
4 "m"
end
generate v_ = 1
reshape wide v_, i(uniqueid) j(variable1) string
merge 1:1 uniqueid using "`v2'"
drop _merge
list, clean noobs

Code:

. list, clean noobs

    uniqueid   v_h   v_j   v_k   v_m   v_x   v_y   v_z   v_a   v_b   v_c   v_d  
           1     .     .     .     .     1     1     .     1     .     .     .  
           2     1     1     .     .     .     .     1     .     1     .     1  
           3     .     .     1     .     .     .     .     .     .     .     .  
           4     .     .     .     1     .     .     .     .     .     1     .

Comment

Michael Ev

Join Date: Feb 2018

Posts: 5
#7

08 Mar 2018, 16:34

Thank you again, William. The effect of reshape is exactly what I would like to have. Given the size of my database, however, I run into other problems; namely, my combinations of UniqueID and Variable1, as well as my combinations of UniqueID and Variable2, do not uniquely identify observations. To recap, I have over 1,000,000 obs in one dataset and over 500,000 in the other.

If it helps, I'll give more information on the data itself. Variable1 and Variable2 each represent dates of contact in Month Year format (Variable1 representing one type of contact, Variable2 another). duplicates in UniqueID Variable1, therefore, represent an individual who received type 1 of contact more than once in a given month.

duplicates of this type represent about 13% of the observations in the smaller dataset and about 2% of the observations in the larger dataset.

Since, as stated above, my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1) a reshape by UniqueID and Month would be great. (i.e. where there are duplicates in UniqueID Variable1, reshape.) However, it is here where I run into the problem of this combination not uniquely representing an observation.

I see my options as the following:

1) collapse duplicates in UniqueID and Variable1 (and/or UniqueID and Variable2) to eliminate the problem, thus allowing a merge 1:1. This would mean losing observations but would allow for an easy merge.

2) reshape if there is another way around the problem of not uniquely identifying observations in UniqueID and Variable1 (and/or UniqueID and Variable2). In my view, this could theoretically allow me to hold on to observations if only there is a way around the problem of non-unique identification.

What would you recommend for options?

Thank you again.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

08 Mar 2018, 19:35

This is a significant expansion on your problem. The examples in post #1 omit this crucial point. That was not helpful. To improve your future posts, both on this topic and on future topics, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

With that said, if you have multiple observations in Dataset1 with the same UniqueID and Variable1, how do they differ? Your examples in post #1 suggest there are only two variables in each dataset: is that not the case? In Dataset1, are there other variables beside UniqueID and Variable1, and will they also the same in both observations, or do they differ?

In general, take a step back, and follow the FAQ guidelines to help other members accurately and completely understand the problem for which you are asking their help in solving.
1 like
Comment

Michael Ev

Join Date: Feb 2018
Posts: 5

10 Mar 2018, 11:46

Thanks again and apologies.

There are many variables in each dataset. In Dataset1, there is only one other variable, ContactID, that further differentiates the observations (duplicates still arise, but they make up a very small percentage (< 1%) of the total observations).

Code:

. duplicates report UniqueID Variable1 ContactID

Duplicates in terms of UniqueID Variable1 ContactID

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |      2020687             0
        2 |         4506          2253
        3 |          501           334
        4 |          216           162
        5 |          125           100
        6 |           72            60
        7 |           35            30
        8 |           32            28
        9 |           18            16
       10 |           10             9
       11 |           11            10
       16 |           16            15
       29 |           29            28
--------------------------------------

Here is a sample of the duplicates list

Code:

. duplicates list UniqueID Variable1 ContactID, sepby(UniqueID Variable1)

Duplicates in terms of UniqueID Variable1 ContactID

  +---------------------------------------------------+
  | group:      obs:   UniqueID   Variab~1   Contac~D |
  |---------------------------------------------------|
  |      1    608922         22   Aug 2008     443165 |
  |      1    608923         22   Aug 2008     443165 |
  |      1    608925         22   Aug 2008     443165 |
  |---------------------------------------------------|
  |      2   8188695         33   Mar 2011     151437 |
  |      2   9104043         33   Mar 2011     151437 |
  |---------------------------------------------------|
  |      3    421390         44   Oct 2014     743997 |
  |      3    421391         44   Oct 2014     743997 |
  |---------------------------------------------------|

In Dataset2 there is another variable, Variable3, that will uniquely identify observations.

Code:

. duplicates list UniqueID Variable2 Variable3

Duplicates in terms of UniqueID Variable2 Variable3

(0 observations are duplicates)

The trouble as I saw it is that neither Variable3 nor ContactID is present in both datasets. The only variable common to both datasets is UniqueID. For this reason, I was hoping to match and merge the datasets by UniqueID and the date variables (Variable1 and Variable2), which I could rename and match by easily enough, assuming I can be free of duplicates.

As I see my options, I can drop the duplicates in UniqueID Variable1 ContactID from Dataset1, meaning I would lose < 1% of my observations in that dataset. Alternatively, I could collapse those duplicates using one of the command's options.

In Dataset2 I can uniquely identify observations, but only with a variable, Variable3, which is not present in Dataset1.

If possible, I could attempt a reshape, but am unaware of how to accomplish this using three variables that uniquely identify observations. Keeping in mind that, if possible, I'd like to match by UniqueID and the date variables (Variable1 and Variable2).

I hope this makes things clearer. I appreciate any help you could provide.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

10 Mar 2018, 13:17

As I see my options

I'm sorry to say this, but that is part of the problem. You don't necessarily see all the options available to you, and you don't tell us what the end result you want is. You started by telling us you needed to merge m:m without giving a clue what you hope to do with the results.

In post #5 you told us

my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1)

but in post #7 you then told us

Variable1 and Variable2 each represent dates of contact in Month Year format (Variable1 representing one type of contact, Variable2 another)

to which I reply "what would "the conditional probability of November 2012 in Dataset2 given July 2011 in Dataset1" mean?

Then in post #9 you tell us

I was hoping to match and merge the datasets by UniqueID and the date variables

but in post #1 you did not indicate that the values Variable1 and Variable2 were anything like each other, and you did not suggest they could be use to find matching observations in the two datasets. (And, I will add, as monthly dates they no longer seem suited to the reshape-based solutions I outlined earlier.)

At this point it is not clear to me what you are trying to use for matching: certainly UniqueID, as you suggested in post #1, but do you also want Variable1 in Dataset1 to match Variable2 in Dataset2, as you seem to suggest in post #11?

And then the next question: If you have two observations in Dataset1 that both "match" to the same three observations in Dataset2, what do you want the related observation(s) in the dataset you are preparing for analysis to be?

Is it perhaps six observations, comprising the 6 possible combinations of one of the two observations from Dataset1 with one of the three observations from Dataset2?

With that said, it would be helpful if you were to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

It would be particularly helpful to post small hand-made examples of your data in Dataset1 and in Dataset2, perhaps with just a few of the relevant variables and a few UniqueIDs - the same UniqueIDs in both Datasets, which should be chosen to exhibit the problems you are trying to overcome.

Sample data should be extracted and posted using the dataex command. If you are running Stata version 15.1, then dataex is part of your installation already. If running an earlier Stata, run ssc install dataex to get the command. Either way, run help dataex to read the simple instructions for using it. It takes just seconds to use dataex to provide a way for those who want to help you to create a complete and faithful replica of your example data in their own Stata. That, in turn, eliminates all sorts of questions that are left unanswered by descriptions or even by data listings and tables. And it enables people to test out code, so that you get the right answer the first time.
2 likes
Comment

Michael Ev

Join Date: Feb 2018
Posts: 5

#11

12 Mar 2018, 15:03

I'll do my best to make it clearer.

Here is an example of Dataset1:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long UniqueID float Variable1 long ContactID
  18669 574 2342
  18669 574 2389
  18669 574 2502
  12929 575 2730
  12929 575 2732
  12929 575 2732
  12929 575 3024
3582222 576 3420
  25691 576 3850
  21392 576 3959
  21392 576 4002
  25691 576 4165
  32084 576 4203
  32084 576 4203
  32084 576 4203
  18669 574 4241
  32084 576 4276
    171 576 4329
  18669 574 4340
    171 576 4359
  18669 574 4410
  21392 576 4690
  32675 576 4747
3582222 576 4979
  32675 576 4988
  32675 576 5068
  32675 576 5302
  32675 576 5353
  32675 576 5355
  21392 576 5474
end
format %tmMon_CCYY Variable1

Then, using duplicates I get:

Code:

. duplicates list UniqueID Variable1

Duplicates in terms of UniqueID Variable1

  +-------------------------------------+
  | group:   obs:   UniqueID   Variab~1 |
  |-------------------------------------|
  |      1     18        171   Jan 2008 |
  |      1     20        171   Jan 2008 |
  |      2      4      12929   Dec 2007 |
  |      2      5      12929   Dec 2007 |
  |      2      6      12929   Dec 2007 |
  |-------------------------------------|
  |      2      7      12929   Dec 2007 |
  |      3      1      18669   Nov 2007 |
  |      3      2      18669   Nov 2007 |
  |      3      3      18669   Nov 2007 |
  |      3     16      18669   Nov 2007 |
  |-------------------------------------|
  |      3     19      18669   Nov 2007 |
  |      3     21      18669   Nov 2007 |
  |      4     10      21392   Jan 2008 |
  |      4     11      21392   Jan 2008 |
  |      4     22      21392   Jan 2008 |
  |-------------------------------------|
  |      4     30      21392   Jan 2008 |
  |      5      9      25691   Jan 2008 |
  |      5     12      25691   Jan 2008 |
  |      6     13      32084   Jan 2008 |
  |      6     14      32084   Jan 2008 |
  |-------------------------------------|
  |      6     15      32084   Jan 2008 |
  |      6     17      32084   Jan 2008 |
  |      7     23      32675   Jan 2008 |
  |      7     25      32675   Jan 2008 |
  |      7     26      32675   Jan 2008 |
  |-------------------------------------|
  |      7     27      32675   Jan 2008 |
  |      7     28      32675   Jan 2008 |
  |      7     29      32675   Jan 2008 |
  |      8      8    3582222   Jan 2008 |
  |      8     24    3582222   Jan 2008 |
  +-------------------------------------+

and

Code:

. duplicates list UniqueID Variable1 ContactID

Duplicates in terms of UniqueID Variable1 ContactID

  +------------------------------------------------+
  | group:   obs:   UniqueID   Variab~1   Contac~D |
  |------------------------------------------------|
  |      1      5      12929   Dec 2007       2732 |
  |      1      6      12929   Dec 2007       2732 |
  |      2     13      32084   Jan 2008       4203 |
  |      2     14      32084   Jan 2008       4203 |
  |      2     15      32084   Jan 2008       4203 |
  +------------------------------------------------+

We see that there are duplicates in terms of UniqueID Variable1 ContactID. This occurs in the entire dataset but accounts for < 1% of total observations. Duplicates in terms of UniqueID Variable1 are much more common in the entire dataset, accounting for about 13% of all observations. If it helps in organizing the data, I don't mind dropping the < 1% of observations that are duplicates in terms of UniqueID Variable1 ContactID.

Here is an example of Dataset2:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long UniqueID float Variable2 long Variable3
3582222 603 33992021
 105644 599 63578643
  21392 619 56826249
 105644 592 46591042
  21392 577 90913024
    171 619 97666371
  21392 593 36899590
  42758 596 78411587
  42758 628 53321131
  21392 590 45797938
  42758 595 32833038
3582222 609 22359564
  21392 577 94424799
 105644 593  2232671
    171 609 11651118
    171 604 74211695
  21392 581 48465261
  21392 619 18428450
  21392 576 97812574
  12929 643 64244629
  12929 586 23459878
  21392 623 30345123
  21392 601 25334736
 105644 582  2299135
  21392 576  5537417
  21392 596 63780379
  42758 584   929190
  42758 606 74332876
  21392 590 92067799
  42758 643 32950881
end
format %tmMon_CCYY Variable2

We can see that Variable3 uniquely identifies every observation (this holds for the entire dataset).

Then, using duplicates I get:

Code:

. duplicates li UniqueID Variable2

Duplicates in terms of UniqueID Variable2

  +-------------------------------------+
  | group:   obs:   UniqueID   Variab~2 |
  |-------------------------------------|
  |      1     19      21392   Jan 2008 |
  |      1     25      21392   Jan 2008 |
  |      2      5      21392   Feb 2008 |
  |      2     13      21392   Feb 2008 |
  |      3     10      21392   Mar 2009 |
  |-------------------------------------|
  |      3     29      21392   Mar 2009 |
  |      4      3      21392   Aug 2011 |
  |      4     18      21392   Aug 2011 |
  +-------------------------------------+

You'll notice that there are UniqueIDs present in Dataset1 that are not present in Dataset2 and vice versa. This is representative of the two datasets.

Variable1 and Variable2 show the dates for two different types of contact. Here we have more than one occurrence of type 2 contact for a given individual in the same month. When I speak of the conditional probability of Variable2 given Variable1, I mean to find the conditional probability of type 2 contact occurring conditional on type 1 contact having already occurred (If the date in Variable1 is the same as the date in Variable2, I can consider that as having already occurred as well). I also want to find the conditional probability of type 2 contact occurring conditional on type 1 contact having occurred within the same month.

As mentioned, UniqueID is the only variable common to both datasets.

At this point it is not clear to me what you are trying to use for matching: certainly UniqueID, as you suggested in post #1

Definitely matching by UniqueID.

but do you also want Variable1 in Dataset1 to match Variable2 in Dataset2, as you seem to suggest in post #11?

And then the next question: If you have two observations in Dataset1 that both "match" to the same three observations in Dataset2, what do you want the related observation(s) in the dataset you are preparing for analysis to be?

In this scenario, I would want to match one observation in Dataset1 to one observation in Dataset2. It is the possibility of this scenario that led me to consider matching by Variable1 and Variable2. However, I would like a format that retains both Variable1 and Variable2 in the final dataset (this is why I was considering a reshape). If this scenario persists when matching by Variable1 and Variable2, matching the first observation in Dataset1 with the first observation in Dataset2 is fine for my purposes - I do not need to differentiate further. The supposed third observation in Dataset2 I would want to keep as an observation but unmatched with an observation in Dataset1.

I would have the same preference in the event where I have two observations in Dataset2 that both match to the same three observations in Dataset1.

Is it perhaps six observations, comprising the 6 possible combinations of one of the two observations from Dataset1 with one of the three observations from Dataset2?

No. I do not need to consider all possible combinations of these matched observations.

Thank you for your help and consideration.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

12 Mar 2018, 15:30

Here is something that is at least a start at what you want. For each UniqueID it matches the "first" observation in Dataset1 to the "first" observation in Dataset2, and so on, where "first" is the order they appear in the dataset, no consideration of the dates given by Variable1 and Variable2. I'm using just a subset of the sample data to keep the results reasonably short, but they represent the various possibilities.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long UniqueID float Variable2 long Variable3
3582222 603 33992021
 105644 599 63578643
  21392 619 56826249
 105644 592 46591042
  21392 577 90913024
    171 619 97666371
  21392 593 36899590
  21392 590 45797938
3582222 609 22359564
  21392 577 94424799
 105644 593  2232671
    171 609 11651118
    171 604 74211695
  21392 581 48465261
  21392 619 18428450
  21392 576 97812574
  12929 643 64244629
  12929 586 23459878
  21392 623 30345123
  21392 601 25334736
 105644 582  2299135
  21392 576  5537417
  21392 596 63780379
  21392 590 92067799
end
format %tmMon_CCYY Variable2
generate obs_v2 = _n
bysort UniqueID (obs_v2): generate seq = _n

tempfile v2
save "`v2'"

clear
input long UniqueID float Variable1 long ContactID
  12929 575 2730
  12929 575 2732
  12929 575 2732
  12929 575 3024
3582222 576 3420
  25691 576 3850
  21392 576 3959
  21392 576 4002
  25691 576 4165
    171 576 4329
    171 576 4359
  21392 576 4690
3582222 576 4979
  21392 576 5474
end
format %tmMon_CCYY Variable1
generate obs_v1 = _n
bysort UniqueID (obs_v1): generate seq = _n

merge 1:1 UniqueID seq using "`v2'"
order UniqueID seq 
sort UniqueID seq
list, noobs sepby(UniqueID) abbreviate(9)

Code:

. list, noobs sepby(UniqueID) abbreviate(9)

  +----------------------------------------------------------------------------------------------------+
  | UniqueID   seq   Variable1   ContactID   obs_v1   Variable2   Variable3   obs_v2            _merge |
  |----------------------------------------------------------------------------------------------------|
  |      171     1    Jan 2008        4329       10    Aug 2011    97666371        6       matched (3) |
  |      171     2    Jan 2008        4359       11    Oct 2010    11651118       12       matched (3) |
  |      171     3           .           .        .    May 2010    74211695       13    using only (2) |
  |----------------------------------------------------------------------------------------------------|
  |    12929     1    Dec 2007        2730        1    Aug 2013    64244629       17       matched (3) |
  |    12929     2    Dec 2007        2732        2    Nov 2008    23459878       18       matched (3) |
  |    12929     3    Dec 2007        2732        3           .           .        .   master only (1) |
  |    12929     4    Dec 2007        3024        4           .           .        .   master only (1) |
  |----------------------------------------------------------------------------------------------------|
  |    21392     1    Jan 2008        3959        7    Aug 2011    56826249        3       matched (3) |
  |    21392     2    Jan 2008        4002        8    Feb 2008    90913024        5       matched (3) |
  |    21392     3    Jan 2008        4690       12    Jun 2009    36899590        7       matched (3) |
  |    21392     4    Jan 2008        5474       14    Mar 2009    45797938        8       matched (3) |
  |    21392     5           .           .        .    Feb 2008    94424799       10    using only (2) |
  |    21392     6           .           .        .    Jun 2008    48465261       14    using only (2) |
  |    21392     7           .           .        .    Aug 2011    18428450       15    using only (2) |
  |    21392     8           .           .        .    Jan 2008    97812574       16    using only (2) |
  |    21392     9           .           .        .    Dec 2011    30345123       19    using only (2) |
  |    21392    10           .           .        .    Feb 2010    25334736       20    using only (2) |
  |    21392    11           .           .        .    Jan 2008     5537417       22    using only (2) |
  |    21392    12           .           .        .    Sep 2009    63780379       23    using only (2) |
  |    21392    13           .           .        .    Mar 2009    92067799       24    using only (2) |
  |----------------------------------------------------------------------------------------------------|
  |    25691     1    Jan 2008        3850        6           .           .        .   master only (1) |
  |    25691     2    Jan 2008        4165        9           .           .        .   master only (1) |
  |----------------------------------------------------------------------------------------------------|
  |   105644     1           .           .        .    Dec 2009    63578643        2    using only (2) |
  |   105644     2           .           .        .    May 2009    46591042        4    using only (2) |
  |   105644     3           .           .        .    Jun 2009     2232671       11    using only (2) |
  |   105644     4           .           .        .    Jul 2008     2299135       21    using only (2) |
  |----------------------------------------------------------------------------------------------------|
  |  3582222     1    Jan 2008        3420        5    Apr 2010    33992021        1       matched (3) |
  |  3582222     2    Jan 2008        4979       13    Oct 2010    22359564        9       matched (3) |
  +----------------------------------------------------------------------------------------------------+

Comment

Kate Aleva

Join Date: May 2021
Posts: 1

#13

24 May 2021, 12:14

Hello,
I need to merge two data sets.

The first data set includes the start date and the end date of a billing period for each house ID.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long house_id int(bill_start bill_end)
555 21923 21951
555 21952 21980
555 21981 22011
555 22012 22042
555 22043 22071
557 21939 21971
557 21972 21998
557 21999 22032
557 22033 22057
557 22058 22085
end
format %td bill_start
format %td bill_end

The second data set includes the meter read dates for each house ID.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long house_id float read_date
555 21951
555 21976
555 21980
555 22011
555 22042
555 22051
555 22071
556 21951
556 21983
556 22011
556 22041
556 22071
557 21971
557 21998
557 22032
557 22036
557 22050
557 22057
557 22085
end
format %td read_date

In the first data set, there are multiple billing periods for one house ID; the same applies to the second data set (many meter read dates for a house ID).
The thing is that for a given house, there can be multiple meter read dates that are within a certain billing period.

I need to merge the two data sets such that in the final data set for each house ID, I have the list of the billing periods with corresponding meter read dates (a ‘corresponding’ meter read date is the one that is within the date of a billing period). In other words, I want the output file to look like this (_merge==3):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long house_id float(bill_start bill_end read_date)
555 21923 21951 21951
555 21952 21980 21976
555 21952 21980 21980
555 21981 22011 22011
555 22012 22042 22042
555 22043 22071 22051
555 22043 22071 22071
557 21939 21971 21971
557 21972 21998 21998
557 21999 22032 22032
557 22033 22057 22036
557 22033 22057 22050
557 22033 22057 22057
557 22058 22085 22085
end
format %td bill_start
format %td bill_end
format %td read_date

I did the m:m merge command, but it did not give me what I wanted: I just got the houses with billing periods from the first data set with shifted meter read dates from the second data set.

Could you please help?

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2421
#14

24 May 2021, 13:35

Your problem fits with the capacities of the community-contributed program rangestat (-ssc describe rangestat-), which I haven't used before, but whose help file was such that it was easy to apply to your situation. I presume this does what you want:

Code:

ssc install rangestat // required by rangejoin ssc install rangejoin rangejoin read_date bill_start bill_end using "read.dta", by(house_id)
Comment

UniqueID	Variable1
1	x
1	y
2	z
2	h
2	j
3	k
4	m

UniqueID	Variable2
1	a
2	b
4	c
2	d

Announcement