Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets with m:m

    I know that the Stata user guide recommends not using the merge m:m command, but believe it may be the best option in this scenario.

    I have two datasets of different sizes (DatasetA contains 1,000,000+ observations; DatasetB contains 500,000+ observations).

    I'd like to join the two datasets by the merge command but run into problems here. The column common to both datasets is a uniqueID assigned to individuals. However, in each dataset, it is possible that this uniqueID appears once, more than once, or not at all. Take the following example of four individuals:

    Dataset1
    UniqueID Variable1
    1 x
    1 y
    2 z
    2 h
    2 j
    3 k
    4 m

    Dataset2
    UniqueID Variable2
    1 a
    2 b
    4 c
    2 d

    I would like to merge the two datasets. merge 1:1 is not appropriate since UniqueID does not uniquely identify single observations in both datasets. Neither are merge 1:m or merge m:1 appropriate since in neither dataset do the UniqueID's identify single observations.

    I'd assumed therefore that the only appropriate merge was merge m:m. However, this causes additional problems as it adds observations to the merged dataset.

    I understand that this process entrains a whole new class of problems with the dataset, but do not see a better way to merge. If anyone has any suggestions or can provide a better understanding of where I'm going wrong please let me know.

    Thank you.

  • #2
    You could try something like that below. With more than a million observations in one dataset and a half-million in the other, you might want to look into using one of the user-written fast reshape commands that are floating around.

    .ÿversionÿ15.1

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿinputÿbyteÿUniqueIDÿstr1ÿVariable1

    ÿÿÿÿÿUniqueIDÿÿVariable1
    ÿÿ1.ÿ1ÿÿÿÿÿÿÿx
    ÿÿ2.ÿ1ÿÿÿÿÿÿÿy
    ÿÿ3.ÿ2ÿÿÿÿÿÿÿz
    ÿÿ4.ÿ2ÿÿÿÿÿÿÿh
    ÿÿ5.ÿ2ÿÿÿÿÿÿÿj
    ÿÿ6.ÿ3ÿÿÿÿÿÿÿk
    ÿÿ7.ÿ4ÿÿÿÿÿÿÿm
    ÿÿ8.ÿend

    .ÿ
    .ÿtempfileÿDataset1

    .ÿquietlyÿsaveÿ`Dataset1'

    .ÿ
    .ÿdropÿ_all

    .ÿ
    .ÿinputÿbyteÿUniqueIDÿstr1ÿVariable2

    ÿÿÿÿÿUniqueIDÿÿVariable2
    ÿÿ1.ÿ1ÿÿÿÿÿÿÿa
    ÿÿ2.ÿ2ÿÿÿÿÿÿÿb
    ÿÿ3.ÿ4ÿÿÿÿÿÿÿc
    ÿÿ4.ÿ2ÿÿÿÿÿÿÿd
    ÿÿ5.ÿend

    .ÿ
    .ÿ*
    .ÿ*ÿBeginÿhere
    .ÿ*
    .ÿbysortÿUniqueIDÿ(Variable2):ÿgenerateÿlongÿrowÿ=ÿ_n

    .ÿquietlyÿreshapeÿwideÿVariable2,ÿi(UniqueID)ÿj(row)

    .ÿ
    .ÿtempfileÿDataset2

    .ÿquietlyÿsaveÿ`Dataset2'

    .ÿ
    .ÿuseÿ`Dataset1'

    .ÿbysortÿUniqueIDÿ(Variable1):ÿgenerateÿlongÿrowÿ=ÿ_n

    .ÿquietlyÿreshapeÿwideÿVariable1,ÿi(UniqueID)ÿj(row)

    .ÿ
    .ÿmergeÿ1:1ÿUniqueIDÿusingÿ`Dataset2',ÿnogenerateÿnoreport

    .ÿ
    .ÿ//ÿRecommendÿtoÿquitÿhere,ÿbutÿyouÿcould:
    .ÿquietlyÿreshapeÿlongÿVariable1ÿVariable2,ÿi(UniqueID)ÿj(discard)

    .ÿquietlyÿdropÿifÿmissing(Variable1)ÿ&ÿmissing(Variable2)

    .ÿsortÿUniqueIDÿdiscard

    .ÿdropÿdiscard

    .ÿlist,ÿnoobsÿsepby(UniqueID)ÿabbreviate(20)

    ÿÿ+----------------------------------+
    ÿÿ|ÿUniqueIDÿÿÿVariable1ÿÿÿVariable2ÿ|
    ÿÿ|----------------------------------|
    ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿxÿÿÿÿÿÿÿÿÿÿÿaÿ|
    ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿyÿÿÿÿÿÿÿÿÿÿÿÿÿ|
    ÿÿ|----------------------------------|
    ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿhÿÿÿÿÿÿÿÿÿÿÿbÿ|
    ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿjÿÿÿÿÿÿÿÿÿÿÿdÿ|
    ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿzÿÿÿÿÿÿÿÿÿÿÿÿÿ|
    ÿÿ|----------------------------------|
    ÿÿ|ÿÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿÿÿkÿÿÿÿÿÿÿÿÿÿÿÿÿ|
    ÿÿ|----------------------------------|
    ÿÿ|ÿÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿÿÿmÿÿÿÿÿÿÿÿÿÿÿcÿ|
    ÿÿ+----------------------------------+

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .


    In any event, I recommend thinking through just what it is you're after. Anytime that you find yourself having to do something like an m:m merge, you've got something terribly wrong in your data model.

    Comment


    • #3
      . . . or maybe you're looking for joinby instead of an m:m merge?

      .ÿversionÿ15.1

      .ÿ
      .ÿclearÿ*

      .ÿ
      .ÿinputÿbyteÿUniqueIDÿstr1ÿVariable1

      ÿÿÿÿÿUniqueIDÿÿVariable1
      ÿÿ1.ÿ1ÿÿÿÿÿÿÿx
      ÿÿ2.ÿ1ÿÿÿÿÿÿÿy
      ÿÿ3.ÿ2ÿÿÿÿÿÿÿz
      ÿÿ4.ÿ2ÿÿÿÿÿÿÿh
      ÿÿ5.ÿ2ÿÿÿÿÿÿÿj
      ÿÿ6.ÿ3ÿÿÿÿÿÿÿk
      ÿÿ7.ÿ4ÿÿÿÿÿÿÿm
      ÿÿ8.ÿend

      .ÿ
      .ÿtempfileÿDataset1

      .ÿquietlyÿsaveÿ`Dataset1'

      .ÿ
      .ÿdropÿ_all

      .ÿ
      .ÿinputÿbyteÿUniqueIDÿstr1ÿVariable2

      ÿÿÿÿÿUniqueIDÿÿVariable2
      ÿÿ1.ÿ1ÿÿÿÿÿÿÿa
      ÿÿ2.ÿ2ÿÿÿÿÿÿÿb
      ÿÿ3.ÿ4ÿÿÿÿÿÿÿc
      ÿÿ4.ÿ2ÿÿÿÿÿÿÿd
      ÿÿ5.ÿend

      .ÿ
      .ÿjoinbyÿUniqueIDÿusingÿ`Dataset1',ÿunmatched(both)

      .ÿsortÿUniqueIDÿV*

      .ÿ
      .ÿlistÿUÿV*,ÿnoobsÿsepby(UniqueID)ÿabbreviate(20)

      ÿÿ+----------------------------------+
      ÿÿ|ÿUniqueIDÿÿÿVariable2ÿÿÿVariable1ÿ|
      ÿÿ|----------------------------------|
      ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿaÿÿÿÿÿÿÿÿÿÿÿxÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿaÿÿÿÿÿÿÿÿÿÿÿyÿ|
      ÿÿ|----------------------------------|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿhÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿjÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿbÿÿÿÿÿÿÿÿÿÿÿzÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿhÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿjÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿdÿÿÿÿÿÿÿÿÿÿÿzÿ|
      ÿÿ|----------------------------------|
      ÿÿ|ÿÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿkÿ|
      ÿÿ|----------------------------------|
      ÿÿ|ÿÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿÿÿcÿÿÿÿÿÿÿÿÿÿÿmÿ|
      ÿÿ+----------------------------------+

      .ÿ
      .ÿexit

      endÿofÿdo-file


      .


      You're going to have one unwieldy dataset and, again, I'd recommend thinking things through a little.

      Comment


      • #4
        Welcome to Statalist.

        I am less confident that I understand what you seek than Joseph is. You have shown us samples of your input datasets, but you have not shown us what you need the output to look like once you combine these two sample dataset. Note that I write "combine" precisely because it is not the name of any built-in Stata command. There are (at least) three commands in Stata that could be relevant to your task.
        • merge is not relevant to your task, since as you note merge 1:1, merge 1:m, and merge m:1 are not helpful, and there is not an experienced Stata user on this list who would not agree with the Stata documentation's dismissal of merge m:m. Do heed the advice in the Stata Data Management Reference Manual discussion of m:m merges.
        Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
        • joinby is a command that people who tell Statalist they think they need merge m:m often find to be what they actually need; especially users familiar with SQL.
        • append sticks the two datasets end-to-end
        Here's your data combined using joinby.
        Code:
        . joinby UniqueID using `Dataset1', unmatched(both)
        
        . sort UniqueID Variable1 Variable2
        
        . order UniqueID Variable1 Variable2
        
        . list, sepby(UniqueID)
        
             +----------------------------------------------------------------+
             | UniqueID   Variab~1   Variab~2                          _merge |
             |----------------------------------------------------------------|
          1. |        1          x          a   both in master and using data |
          2. |        1          y          a   both in master and using data |
             |----------------------------------------------------------------|
          3. |        2          h          b   both in master and using data |
          4. |        2          h          d   both in master and using data |
          5. |        2          j          b   both in master and using data |
          6. |        2          j          d   both in master and using data |
          7. |        2          z          b   both in master and using data |
          8. |        2          z          d   both in master and using data |
             |----------------------------------------------------------------|
          9. |        3          k                         only in using data |
             |----------------------------------------------------------------|
         10. |        4          m          c   both in master and using data |
             +----------------------------------------------------------------+
        Here's your dataset combined using append.
        Code:
        . append using `Dataset1'
        
        . sort UniqueID Variable1 Variable2
        
        . order UniqueID Variable1 Variable2
        
        . list, sepby(UniqueID)
        
             +--------------------------------+
             | UniqueID   Variab~1   Variab~2 |
             |--------------------------------|
          1. |        1                     a |
          2. |        1          x            |
          3. |        1          y            |
             |--------------------------------|
          4. |        2                     b |
          5. |        2                     d |
          6. |        2          h            |
          7. |        2          j            |
          8. |        2          z            |
             |--------------------------------|
          9. |        3          k            |
             |--------------------------------|
         10. |        4                     c |
         11. |        4          m            |
             +--------------------------------+

        Comment


        • #5
          Thank you for the responses. Unfortunately, joinby would essentially create new observations of Variable1 and Variable 2 that do not exist in the data, which I can't allow in my analysis. As it turns out, the only solution that makes sense to my understanding is to append, sort, and sepby as William has shown in #4 which produces

          Code:
          . list, sepby(UniqueID)
          
               +--------------------------------+
               | UniqueID   Variab~1   Variab~2 |
               |--------------------------------|
            1. |        1          x          . |
            2. |        1          y          . |
            3. |        1          .          a |
               |--------------------------------|
            4. |        2          h          . |
            5. |        2          j          . |
            6. |        2          z          . |
            7. |        2          .          b |
            8. |        2          .          d |
               |--------------------------------|
            9. |        3          k          . |
               |--------------------------------|
           10. |        4          m          . |
           11. |        4          .          c |
               +--------------------------------+
          Now, my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1). However, I'm unsure how to approach this since, having used append, in no single line of observation will I see a value of Variable1 and Variable2. One solution as I see it would be to treat all observations for a given UniqueID as one observation, but I don't know of a way of doing this that retains all the information. Any suggestions?

          Thanks again for your help.

          Comment


          • #6
            Unfortunately, joinby would essentially create new observations of Variable1 and Variable 2 that do not exist in the data, which I can't allow in my analysis.
            It turns out, now that you've explained the objective you are trying to achieve, that either joinby or append is the first step in achieving what you want.

            Starting with the output of joinby:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte uniqueid str1(variable1 variable2)
            1 "x" "a"
            1 "y" "a"
            2 "h" "b"
            2 "h" "d"
            2 "j" "b"
            2 "j" "d"
            2 "z" "b"
            2 "z" "d"
            3 "k" ""
            4 "m" "c"
            end
            list, sepby(uniqueid)
            levelsof variable1
            foreach v in `r(levels)' {
                generate v1_`v' = variable1=="`v'"
            }
            levelsof variable2
            foreach v in `r(levels)' {
                generate v2_`v' = variable2=="`v'"
            }
            collapse (max) v1_* v2_*, by(uniqueid)
            list, clean noobs
            Code:
            . list, sepby(uniqueid)
            
                 +--------------------------------+
                 | uniqueid   variab~1   variab~2 |
                 |--------------------------------|
              1. |        1          x          a |
              2. |        1          y          a |
                 |--------------------------------|
              3. |        2          h          b |
              4. |        2          h          d |
              5. |        2          j          b |
              6. |        2          j          d |
              7. |        2          z          b |
              8. |        2          z          d |
                 |--------------------------------|
              9. |        3          k            |
                 |--------------------------------|
             10. |        4          m          c |
                 +--------------------------------+
            
            . levelsof variable1
            `"h"' `"j"' `"k"' `"m"' `"x"' `"y"' `"z"'
            
            . foreach v in `r(levels)' {
              2.         generate v1_`v' = variable1=="`v'"
              3. }
            
            . levelsof variable2
            `"a"' `"b"' `"c"' `"d"'
            
            . foreach v in `r(levels)' {
              2.         generate v2_`v' = variable2=="`v'"
              3. }
            
            . collapse (max) v1_* v2_*, by(uniqueid)
            
            . list, clean noobs
            
                uniqueid   v1_h   v1_j   v1_k   v1_m   v1_x   v1_y   v1_z   v2_a   v2_b   v2_c   v2_d  
                       1      0      0      0      0      1      1      0      1      0      0      0  
                       2      1      1      0      0      0      0      1      0      1      0      1  
                       3      0      0      1      0      0      0      0      0      0      0      0  
                       4      0      0      0      1      0      0      0      0      0      1      0
            Starting with the output of append:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte uniqueid str2(variable1 variable2)
            1 ""  "a"
            1 "x" ""
            1 "y" ""
            2 ""  "b"
            2 ""  "d"
            2 "h" ""
            2 "j" ""
            2 "z" ""
            3 "k" ""
            4 ""  "c"
            4 "m" ""
            end
            list, sepby(uniqueid)
            levelsof variable1
            foreach v in `r(levels)' {
                generate v1_`v' = variable1=="`v'"
            }
            levelsof variable2
            foreach v in `r(levels)' {
                generate v2_`v' = variable2=="`v'"
            }
            collapse (max) v1_* v2_*, by(uniqueid)
            list, clean noobs
            Code:
            . levelsof variable1
            `"h"' `"j"' `"k"' `"m"' `"x"' `"y"' `"z"'
            
            . foreach v in `r(levels)' {
              2.         generate v1_`v' = variable1=="`v'"
              3. }
            
            . levelsof variable2
            `"a"' `"b"' `"c"' `"d"'
            
            . foreach v in `r(levels)' {
              2.         generate v2_`v' = variable2=="`v'"
              3. }
            
            . collapse (max) v1_* v2_*, by(uniqueid)
            
            . list, clean noobs
            
                uniqueid   v1_h   v1_j   v1_k   v1_m   v1_x   v1_y   v1_z   v2_a   v2_b   v2_c   v2_d  
                       1      0      0      0      0      1      1      0      1      0      0      0  
                       2      1      1      0      0      0      0      1      0      1      0      1  
                       3      0      0      1      0      0      0      0      0      0      0      0  
                       4      0      0      0      1      0      0      0      0      0      1      0
            But truth be told, now that you've described your objective, neither joinby nor append is necessary; you can start by applying reshape to your two datasets before a simple merge 1:1:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte uniqueid str1 variable2
            1 "a"
            2 "b"
            4 "c"
            2 "d"
            end
            generate v_ = 1
            reshape wide v_, i(uniqueid) j(variable2) string
            tempfile v2
            save "`v2'"
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte uniqueid str1 variable1
            1 "x"
            1 "y"
            2 "z"
            2 "h"
            2 "j"
            3 "k"
            4 "m"
            end
            generate v_ = 1
            reshape wide v_, i(uniqueid) j(variable1) string
            merge 1:1 uniqueid using "`v2'"
            drop _merge
            list, clean noobs
            Code:
            . list, clean noobs
            
                uniqueid   v_h   v_j   v_k   v_m   v_x   v_y   v_z   v_a   v_b   v_c   v_d  
                       1     .     .     .     .     1     1     .     1     .     .     .  
                       2     1     1     .     .     .     .     1     .     1     .     1  
                       3     .     .     1     .     .     .     .     .     .     .     .  
                       4     .     .     .     1     .     .     .     .     .     1     .

            Comment


            • #7
              Thank you again, William. The effect of reshape is exactly what I would like to have. Given the size of my database, however, I run into other problems; namely, my combinations of UniqueID and Variable1, as well as my combinations of UniqueID and Variable2, do not uniquely identify observations. To recap, I have over 1,000,000 obs in one dataset and over 500,000 in the other.

              If it helps, I'll give more information on the data itself. Variable1 and Variable2 each represent dates of contact in Month Year format (Variable1 representing one type of contact, Variable2 another). duplicates in UniqueID Variable1, therefore, represent an individual who received type 1 of contact more than once in a given month.

              duplicates of this type represent about 13% of the observations in the smaller dataset and about 2% of the observations in the larger dataset.

              Since, as stated above, my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1) a reshape by UniqueID and Month would be great. (i.e. where there are duplicates in UniqueID Variable1, reshape.) However, it is here where I run into the problem of this combination not uniquely representing an observation.

              I see my options as the following:

              1) collapse duplicates in UniqueID and Variable1 (and/or UniqueID and Variable2) to eliminate the problem, thus allowing a merge 1:1. This would mean losing observations but would allow for an easy merge.

              2) reshape if there is another way around the problem of not uniquely identifying observations in UniqueID and Variable1 (and/or UniqueID and Variable2). In my view, this could theoretically allow me to hold on to observations if only there is a way around the problem of non-unique identification.

              What would you recommend for options?

              Thank you again.

              Comment


              • #8
                This is a significant expansion on your problem. The examples in post #1 omit this crucial point. That was not helpful. To improve your future posts, both on this topic and on future topics, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

                The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

                With that said, if you have multiple observations in Dataset1 with the same UniqueID and Variable1, how do they differ? Your examples in post #1 suggest there are only two variables in each dataset: is that not the case? In Dataset1, are there other variables beside UniqueID and Variable1, and will they also the same in both observations, or do they differ?

                In general, take a step back, and follow the FAQ guidelines to help other members accurately and completely understand the problem for which you are asking their help in solving.

                Comment


                • #9
                  Thanks again and apologies.

                  There are many variables in each dataset. In Dataset1, there is only one other variable, ContactID, that further differentiates the observations (duplicates still arise, but they make up a very small percentage (< 1%) of the total observations).

                  Code:
                  . duplicates report UniqueID Variable1 ContactID
                  
                  Duplicates in terms of UniqueID Variable1 ContactID
                  
                  --------------------------------------
                     copies | observations       surplus
                  ----------+---------------------------
                          1 |      2020687             0
                          2 |         4506          2253
                          3 |          501           334
                          4 |          216           162
                          5 |          125           100
                          6 |           72            60
                          7 |           35            30
                          8 |           32            28
                          9 |           18            16
                         10 |           10             9
                         11 |           11            10
                         16 |           16            15
                         29 |           29            28
                  --------------------------------------
                  Here is a sample of the duplicates list

                  Code:
                  . duplicates list UniqueID Variable1 ContactID, sepby(UniqueID Variable1)
                  
                  Duplicates in terms of UniqueID Variable1 ContactID
                  
                    +---------------------------------------------------+
                    | group:      obs:   UniqueID   Variab~1   Contac~D |
                    |---------------------------------------------------|
                    |      1    608922         22   Aug 2008     443165 |
                    |      1    608923         22   Aug 2008     443165 |
                    |      1    608925         22   Aug 2008     443165 |
                    |---------------------------------------------------|
                    |      2   8188695         33   Mar 2011     151437 |
                    |      2   9104043         33   Mar 2011     151437 |
                    |---------------------------------------------------|
                    |      3    421390         44   Oct 2014     743997 |
                    |      3    421391         44   Oct 2014     743997 |
                    |---------------------------------------------------|
                  In Dataset2 there is another variable, Variable3, that will uniquely identify observations.

                  Code:
                  . duplicates list UniqueID Variable2 Variable3
                  
                  Duplicates in terms of UniqueID Variable2 Variable3
                  
                  (0 observations are duplicates)
                  The trouble as I saw it is that neither Variable3 nor ContactID is present in both datasets. The only variable common to both datasets is UniqueID. For this reason, I was hoping to match and merge the datasets by UniqueID and the date variables (Variable1 and Variable2), which I could rename and match by easily enough, assuming I can be free of duplicates.

                  As I see my options, I can drop the duplicates in UniqueID Variable1 ContactID from Dataset1, meaning I would lose < 1% of my observations in that dataset. Alternatively, I could collapse those duplicates using one of the command's options.

                  In Dataset2 I can uniquely identify observations, but only with a variable, Variable3, which is not present in Dataset1.

                  If possible, I could attempt a reshape, but am unaware of how to accomplish this using three variables that uniquely identify observations. Keeping in mind that, if possible, I'd like to match by UniqueID and the date variables (Variable1 and Variable2).

                  I hope this makes things clearer. I appreciate any help you could provide.













                  Comment


                  • #10
                    As I see my options
                    I'm sorry to say this, but that is part of the problem. You don't necessarily see all the options available to you, and you don't tell us what the end result you want is. You started by telling us you needed to merge m:m without giving a clue what you hope to do with the results.

                    In post #5 you told us

                    my ultimate goal is to find relationships between Variable1 and Variable2 (eg. the conditional probability of Variable2 given Variable1)
                    but in post #7 you then told us

                    Variable1 and Variable2 each represent dates of contact in Month Year format (Variable1 representing one type of contact, Variable2 another)
                    to which I reply "what would "the conditional probability of November 2012 in Dataset2 given July 2011 in Dataset1" mean?

                    Then in post #9 you tell us

                    I was hoping to match and merge the datasets by UniqueID and the date variables
                    but in post #1 you did not indicate that the values Variable1 and Variable2 were anything like each other, and you did not suggest they could be use to find matching observations in the two datasets. (And, I will add, as monthly dates they no longer seem suited to the reshape-based solutions I outlined earlier.)

                    At this point it is not clear to me what you are trying to use for matching: certainly UniqueID, as you suggested in post #1, but do you also want Variable1 in Dataset1 to match Variable2 in Dataset2, as you seem to suggest in post #11?

                    And then the next question: If you have two observations in Dataset1 that both "match" to the same three observations in Dataset2, what do you want the related observation(s) in the dataset you are preparing for analysis to be?

                    Is it perhaps six observations, comprising the 6 possible combinations of one of the two observations from Dataset1 with one of the three observations from Dataset2?

                    With that said, it would be helpful if you were to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

                    It would be particularly helpful to post small hand-made examples of your data in Dataset1 and in Dataset2, perhaps with just a few of the relevant variables and a few UniqueIDs - the same UniqueIDs in both Datasets, which should be chosen to exhibit the problems you are trying to overcome.

                    Sample data should be extracted and posted using the dataex command. If you are running Stata version 15.1, then dataex is part of your installation already. If running an earlier Stata, run ssc install dataex to get the command. Either way, run help dataex to read the simple instructions for using it. It takes just seconds to use dataex to provide a way for those who want to help you to create a complete and faithful replica of your example data in their own Stata. That, in turn, eliminates all sorts of questions that are left unanswered by descriptions or even by data listings and tables. And it enables people to test out code, so that you get the right answer the first time.

                    Comment


                    • #11
                      I'll do my best to make it clearer.

                      Here is an example of Dataset1:

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input long UniqueID float Variable1 long ContactID
                        18669 574 2342
                        18669 574 2389
                        18669 574 2502
                        12929 575 2730
                        12929 575 2732
                        12929 575 2732
                        12929 575 3024
                      3582222 576 3420
                        25691 576 3850
                        21392 576 3959
                        21392 576 4002
                        25691 576 4165
                        32084 576 4203
                        32084 576 4203
                        32084 576 4203
                        18669 574 4241
                        32084 576 4276
                          171 576 4329
                        18669 574 4340
                          171 576 4359
                        18669 574 4410
                        21392 576 4690
                        32675 576 4747
                      3582222 576 4979
                        32675 576 4988
                        32675 576 5068
                        32675 576 5302
                        32675 576 5353
                        32675 576 5355
                        21392 576 5474
                      end
                      format %tmMon_CCYY Variable1
                      Then, using duplicates I get:
                      Code:
                      . duplicates list UniqueID Variable1
                      
                      Duplicates in terms of UniqueID Variable1
                      
                        +-------------------------------------+
                        | group:   obs:   UniqueID   Variab~1 |
                        |-------------------------------------|
                        |      1     18        171   Jan 2008 |
                        |      1     20        171   Jan 2008 |
                        |      2      4      12929   Dec 2007 |
                        |      2      5      12929   Dec 2007 |
                        |      2      6      12929   Dec 2007 |
                        |-------------------------------------|
                        |      2      7      12929   Dec 2007 |
                        |      3      1      18669   Nov 2007 |
                        |      3      2      18669   Nov 2007 |
                        |      3      3      18669   Nov 2007 |
                        |      3     16      18669   Nov 2007 |
                        |-------------------------------------|
                        |      3     19      18669   Nov 2007 |
                        |      3     21      18669   Nov 2007 |
                        |      4     10      21392   Jan 2008 |
                        |      4     11      21392   Jan 2008 |
                        |      4     22      21392   Jan 2008 |
                        |-------------------------------------|
                        |      4     30      21392   Jan 2008 |
                        |      5      9      25691   Jan 2008 |
                        |      5     12      25691   Jan 2008 |
                        |      6     13      32084   Jan 2008 |
                        |      6     14      32084   Jan 2008 |
                        |-------------------------------------|
                        |      6     15      32084   Jan 2008 |
                        |      6     17      32084   Jan 2008 |
                        |      7     23      32675   Jan 2008 |
                        |      7     25      32675   Jan 2008 |
                        |      7     26      32675   Jan 2008 |
                        |-------------------------------------|
                        |      7     27      32675   Jan 2008 |
                        |      7     28      32675   Jan 2008 |
                        |      7     29      32675   Jan 2008 |
                        |      8      8    3582222   Jan 2008 |
                        |      8     24    3582222   Jan 2008 |
                        +-------------------------------------+
                      and

                      Code:
                      . duplicates list UniqueID Variable1 ContactID
                      
                      Duplicates in terms of UniqueID Variable1 ContactID
                      
                        +------------------------------------------------+
                        | group:   obs:   UniqueID   Variab~1   Contac~D |
                        |------------------------------------------------|
                        |      1      5      12929   Dec 2007       2732 |
                        |      1      6      12929   Dec 2007       2732 |
                        |      2     13      32084   Jan 2008       4203 |
                        |      2     14      32084   Jan 2008       4203 |
                        |      2     15      32084   Jan 2008       4203 |
                        +------------------------------------------------+
                      We see that there are duplicates in terms of UniqueID Variable1 ContactID. This occurs in the entire dataset but accounts for < 1% of total observations. Duplicates in terms of UniqueID Variable1 are much more common in the entire dataset, accounting for about 13% of all observations. If it helps in organizing the data, I don't mind dropping the < 1% of observations that are duplicates in terms of UniqueID Variable1 ContactID.


                      Here is an example of Dataset2:

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input long UniqueID float Variable2 long Variable3
                      3582222 603 33992021
                       105644 599 63578643
                        21392 619 56826249
                       105644 592 46591042
                        21392 577 90913024
                          171 619 97666371
                        21392 593 36899590
                        42758 596 78411587
                        42758 628 53321131
                        21392 590 45797938
                        42758 595 32833038
                      3582222 609 22359564
                        21392 577 94424799
                       105644 593  2232671
                          171 609 11651118
                          171 604 74211695
                        21392 581 48465261
                        21392 619 18428450
                        21392 576 97812574
                        12929 643 64244629
                        12929 586 23459878
                        21392 623 30345123
                        21392 601 25334736
                       105644 582  2299135
                        21392 576  5537417
                        21392 596 63780379
                        42758 584   929190
                        42758 606 74332876
                        21392 590 92067799
                        42758 643 32950881
                      end
                      format %tmMon_CCYY Variable2

                      We can see that Variable3 uniquely identifies every observation (this holds for the entire dataset).

                      Then, using duplicates I get:

                      Code:
                      . duplicates li UniqueID Variable2
                      
                      Duplicates in terms of UniqueID Variable2
                      
                        +-------------------------------------+
                        | group:   obs:   UniqueID   Variab~2 |
                        |-------------------------------------|
                        |      1     19      21392   Jan 2008 |
                        |      1     25      21392   Jan 2008 |
                        |      2      5      21392   Feb 2008 |
                        |      2     13      21392   Feb 2008 |
                        |      3     10      21392   Mar 2009 |
                        |-------------------------------------|
                        |      3     29      21392   Mar 2009 |
                        |      4      3      21392   Aug 2011 |
                        |      4     18      21392   Aug 2011 |
                        +-------------------------------------+
                      You'll notice that there are UniqueIDs present in Dataset1 that are not present in Dataset2 and vice versa. This is representative of the two datasets.

                      Variable1 and Variable2 show the dates for two different types of contact. Here we have more than one occurrence of type 2 contact for a given individual in the same month. When I speak of the conditional probability of Variable2 given Variable1, I mean to find the conditional probability of type 2 contact occurring conditional on type 1 contact having already occurred (If the date in Variable1 is the same as the date in Variable2, I can consider that as having already occurred as well). I also want to find the conditional probability of type 2 contact occurring conditional on type 1 contact having occurred within the same month.

                      As mentioned, UniqueID is the only variable common to both datasets.

                      At this point it is not clear to me what you are trying to use for matching: certainly UniqueID, as you suggested in post #1
                      Definitely matching by UniqueID.

                      but do you also want Variable1 in Dataset1 to match Variable2 in Dataset2, as you seem to suggest in post #11?
                      And then the next question: If you have two observations in Dataset1 that both "match" to the same three observations in Dataset2, what do you want the related observation(s) in the dataset you are preparing for analysis to be?
                      In this scenario, I would want to match one observation in Dataset1 to one observation in Dataset2. It is the possibility of this scenario that led me to consider matching by Variable1 and Variable2. However, I would like a format that retains both Variable1 and Variable2 in the final dataset (this is why I was considering a reshape). If this scenario persists when matching by Variable1 and Variable2, matching the first observation in Dataset1 with the first observation in Dataset2 is fine for my purposes - I do not need to differentiate further. The supposed third observation in Dataset2 I would want to keep as an observation but unmatched with an observation in Dataset1.

                      I would have the same preference in the event where I have two observations in Dataset2 that both match to the same three observations in Dataset1.

                      Is it perhaps six observations, comprising the 6 possible combinations of one of the two observations from Dataset1 with one of the three observations from Dataset2?
                      No. I do not need to consider all possible combinations of these matched observations.

                      Thank you for your help and consideration.

                      Comment


                      • #12
                        Here is something that is at least a start at what you want. For each UniqueID it matches the "first" observation in Dataset1 to the "first" observation in Dataset2, and so on, where "first" is the order they appear in the dataset, no consideration of the dates given by Variable1 and Variable2. I'm using just a subset of the sample data to keep the results reasonably short, but they represent the various possibilities.
                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input long UniqueID float Variable2 long Variable3
                        3582222 603 33992021
                         105644 599 63578643
                          21392 619 56826249
                         105644 592 46591042
                          21392 577 90913024
                            171 619 97666371
                          21392 593 36899590
                          21392 590 45797938
                        3582222 609 22359564
                          21392 577 94424799
                         105644 593  2232671
                            171 609 11651118
                            171 604 74211695
                          21392 581 48465261
                          21392 619 18428450
                          21392 576 97812574
                          12929 643 64244629
                          12929 586 23459878
                          21392 623 30345123
                          21392 601 25334736
                         105644 582  2299135
                          21392 576  5537417
                          21392 596 63780379
                          21392 590 92067799
                        end
                        format %tmMon_CCYY Variable2
                        generate obs_v2 = _n
                        bysort UniqueID (obs_v2): generate seq = _n
                        
                        tempfile v2
                        save "`v2'"
                        
                        clear
                        input long UniqueID float Variable1 long ContactID
                          12929 575 2730
                          12929 575 2732
                          12929 575 2732
                          12929 575 3024
                        3582222 576 3420
                          25691 576 3850
                          21392 576 3959
                          21392 576 4002
                          25691 576 4165
                            171 576 4329
                            171 576 4359
                          21392 576 4690
                        3582222 576 4979
                          21392 576 5474
                        end
                        format %tmMon_CCYY Variable1
                        generate obs_v1 = _n
                        bysort UniqueID (obs_v1): generate seq = _n
                        
                        merge 1:1 UniqueID seq using "`v2'"
                        order UniqueID seq 
                        sort UniqueID seq
                        list, noobs sepby(UniqueID) abbreviate(9)
                        Code:
                        . list, noobs sepby(UniqueID) abbreviate(9)
                        
                          +----------------------------------------------------------------------------------------------------+
                          | UniqueID   seq   Variable1   ContactID   obs_v1   Variable2   Variable3   obs_v2            _merge |
                          |----------------------------------------------------------------------------------------------------|
                          |      171     1    Jan 2008        4329       10    Aug 2011    97666371        6       matched (3) |
                          |      171     2    Jan 2008        4359       11    Oct 2010    11651118       12       matched (3) |
                          |      171     3           .           .        .    May 2010    74211695       13    using only (2) |
                          |----------------------------------------------------------------------------------------------------|
                          |    12929     1    Dec 2007        2730        1    Aug 2013    64244629       17       matched (3) |
                          |    12929     2    Dec 2007        2732        2    Nov 2008    23459878       18       matched (3) |
                          |    12929     3    Dec 2007        2732        3           .           .        .   master only (1) |
                          |    12929     4    Dec 2007        3024        4           .           .        .   master only (1) |
                          |----------------------------------------------------------------------------------------------------|
                          |    21392     1    Jan 2008        3959        7    Aug 2011    56826249        3       matched (3) |
                          |    21392     2    Jan 2008        4002        8    Feb 2008    90913024        5       matched (3) |
                          |    21392     3    Jan 2008        4690       12    Jun 2009    36899590        7       matched (3) |
                          |    21392     4    Jan 2008        5474       14    Mar 2009    45797938        8       matched (3) |
                          |    21392     5           .           .        .    Feb 2008    94424799       10    using only (2) |
                          |    21392     6           .           .        .    Jun 2008    48465261       14    using only (2) |
                          |    21392     7           .           .        .    Aug 2011    18428450       15    using only (2) |
                          |    21392     8           .           .        .    Jan 2008    97812574       16    using only (2) |
                          |    21392     9           .           .        .    Dec 2011    30345123       19    using only (2) |
                          |    21392    10           .           .        .    Feb 2010    25334736       20    using only (2) |
                          |    21392    11           .           .        .    Jan 2008     5537417       22    using only (2) |
                          |    21392    12           .           .        .    Sep 2009    63780379       23    using only (2) |
                          |    21392    13           .           .        .    Mar 2009    92067799       24    using only (2) |
                          |----------------------------------------------------------------------------------------------------|
                          |    25691     1    Jan 2008        3850        6           .           .        .   master only (1) |
                          |    25691     2    Jan 2008        4165        9           .           .        .   master only (1) |
                          |----------------------------------------------------------------------------------------------------|
                          |   105644     1           .           .        .    Dec 2009    63578643        2    using only (2) |
                          |   105644     2           .           .        .    May 2009    46591042        4    using only (2) |
                          |   105644     3           .           .        .    Jun 2009     2232671       11    using only (2) |
                          |   105644     4           .           .        .    Jul 2008     2299135       21    using only (2) |
                          |----------------------------------------------------------------------------------------------------|
                          |  3582222     1    Jan 2008        3420        5    Apr 2010    33992021        1       matched (3) |
                          |  3582222     2    Jan 2008        4979       13    Oct 2010    22359564        9       matched (3) |
                          +----------------------------------------------------------------------------------------------------+

                        Comment


                        • #13
                          Hello,
                          I need to merge two data sets.

                          The first data set includes the start date and the end date of a billing period for each house ID.
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long house_id int(bill_start bill_end)
                          555 21923 21951
                          555 21952 21980
                          555 21981 22011
                          555 22012 22042
                          555 22043 22071
                          557 21939 21971
                          557 21972 21998
                          557 21999 22032
                          557 22033 22057
                          557 22058 22085
                          end
                          format %td bill_start
                          format %td bill_end
                          The second data set includes the meter read dates for each house ID.
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long house_id float read_date
                          555 21951
                          555 21976
                          555 21980
                          555 22011
                          555 22042
                          555 22051
                          555 22071
                          556 21951
                          556 21983
                          556 22011
                          556 22041
                          556 22071
                          557 21971
                          557 21998
                          557 22032
                          557 22036
                          557 22050
                          557 22057
                          557 22085
                          end
                          format %td read_date
                          In the first data set, there are multiple billing periods for one house ID; the same applies to the second data set (many meter read dates for a house ID).
                          The thing is that for a given house, there can be multiple meter read dates that are within a certain billing period.

                          I need to merge the two data sets such that in the final data set for each house ID, I have the list of the billing periods with corresponding meter read dates (a ‘corresponding’ meter read date is the one that is within the date of a billing period). In other words, I want the output file to look like this (_merge==3):
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long house_id float(bill_start bill_end read_date)
                          555 21923 21951 21951
                          555 21952 21980 21976
                          555 21952 21980 21980
                          555 21981 22011 22011
                          555 22012 22042 22042
                          555 22043 22071 22051
                          555 22043 22071 22071
                          557 21939 21971 21971
                          557 21972 21998 21998
                          557 21999 22032 22032
                          557 22033 22057 22036
                          557 22033 22057 22050
                          557 22033 22057 22057
                          557 22058 22085 22085
                          end
                          format %td bill_start
                          format %td bill_end
                          format %td read_date
                          I did the m:m merge command, but it did not give me what I wanted: I just got the houses with billing periods from the first data set with shifted meter read dates from the second data set.

                          Could you please help?

                          Comment


                          • #14
                            Your problem fits with the capacities of the community-contributed program rangestat (-ssc describe rangestat-), which I haven't used before, but whose help file was such that it was easy to apply to your situation. I presume this does what you want:
                            Code:
                            ssc install rangestat // required by rangejoin
                            ssc install rangejoin
                            rangejoin read_date bill_start bill_end using "read.dta", by(house_id)

                            Comment

                            Working...
                            X