Simple question about merging data sets

Juan Guillermo

Join Date: Jul 2023

Posts: 1
#1

Simple question about merging data sets

10 Jul 2023, 09:47

Hello all, I have a simple question about merging data sets. I have two data sets, the master data set has 625 obs and the using data set has 705. Each data set contains repeating cities with different street names for each one. The master data set contains info from 2022 and the using contains info from 2021. I want to merge this two in order to add some info to the master that only appears in the the using. (I used m:m and the street names of each city). My question is: why does the sum of Not matched from master (_merge ==1) plus the matched observations in stata after the merge not equal the number of obs from the master data set? Ignoring the not matched obs from using (_merge == 2, meaning info that only the 2021 data set contains and that is no real use to me) wouldn´t total the obs from the master? After I merge the data sets and drop the obs with _merge == 2 the resulting data set has 631 obs and not 625 like I thought it would.

Result --------------- Number of obs
-----------------------------------------
Not matched -------------- 394
from master --------------- 160 (_merge==1)
from using ----------------- 234 (_merge==2)

Matched ------------------- 471 (_merge==3)
-----------------------------------------

Thanks,

Last edited by Juan Guillermo; 10 Jul 2023, 10:24.
Tags: merge
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#2

10 Jul 2023, 10:25

-merge m:m- produces data salad in almost all situations and its results are nonsense. So you should not expect anything about them to be sensible. The PDF documentation makes clear that -merge m:m- should not be used.

To help you correctly combine these two data sets, it is important to know more about what is in them and what you are trying to accomplish. It seems, although you do not make it quite explicit, that each data set is a data set of streets, but that streets need to be identified by a combination of street name and city name, since different cities can have streets with the same name. Is it true that the combination of city and street name uniquely identify observations in both data sets?

Or can the same street in a given city have multiple observations in one or both data sets? If the same street in a given city can have multiple observations in both data sets, how do you decide which observation(s) from the 2021 data set should be paired with which observation(s) in the 2022 data set?

Finally, what does "I want to merge this two in order to add some info to the master that only appears in the the using" mean? Is this additional information in the form of different variables that do not appear in the master? Or is this information an update of existing variables that do appear in the master? Or a supplement to fill in missing values of variables that do appear in the master?
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1060

10 Jul 2023, 10:38

The addition rule mentioned does not apply well to merge m:m. It's common to see the Ns not adding up:

Code:

clear
input city y2
1 2
2 5
3 8
4 2
4 5
4 6
4 7
5 8
7 2
7 1
end

save tempfile, replace

clear
input city y1
1 2
1 5
3 5
3 8
4 2
4 5
6 8
6 8
7 1
7 2
7 4
7 5
7 6
end

merge m:m city using tempfile
gsort city y1
list, sepby(city)

Results:

Code:

. merge m:m city using tempfile

    Result                      Number of obs
    -----------------------------------------
    Not matched                             4
        from master                         2  (_merge==1)
        from using                          2  (_merge==2)

    Matched                                13  (_merge==3)
    -----------------------------------------

. gsort city y1

. list, sepby(city)

     +----------------------------------+
     | city   y1   y2            _merge |
     |----------------------------------|
  1. |    1    2    2       Matched (3) |
  2. |    1    5    2       Matched (3) |
     |----------------------------------|
  3. |    2    .    5    Using only (2) |
     |----------------------------------|
  4. |    3    5    8       Matched (3) |
  5. |    3    8    8       Matched (3) |
     |----------------------------------|
  6. |    4    2    2       Matched (3) |
  7. |    4    5    7       Matched (3) |
  8. |    4    5    6       Matched (3) |
  9. |    4    5    5       Matched (3) |
     |----------------------------------|
 10. |    5    .    8    Using only (2) |
     |----------------------------------|
 11. |    6    8    .   Master only (1) |
 12. |    6    8    .   Master only (1) |
     |----------------------------------|
 13. |    7    1    2       Matched (3) |
 14. |    7    2    1       Matched (3) |
 15. |    7    4    1       Matched (3) |
 16. |    7    5    1       Matched (3) |
 17. |    7    6    1       Matched (3) |
     +----------------------------------+

Neither of the data set has 15 cases. According to how the streets are repeated in each data set, the added Ns can change quite considerably.

Also, as you can see, merge m:m seldom is the solution, it tends to produce very weird results. I don't use it at all and from my limited understanding is that it generates a data set showing each uniquely values in their highest frequency (between the two data sets). I.e., if "A" appears in master data 3 times and in using data 7 times, then the merged result will make sure A appears 7 times. But it is not very clear how the other variables would be duplicated; it tends to carry over the last available value (see how value of "1" got repeated for city "7"), though I am not sure.

(Agree with the points raised in #2)

Last edited by Ken Chui; 10 Jul 2023, 10:44.

Announcement

Simple question about merging data sets

Comment

Comment