Problem with merging

Surya Singh

Join Date: Sep 2014

Posts: 54
#1

Problem with merging

16 Mar 2016, 09:45

Hi,
I have two datasets: (1) children_mothers and (2) mothers_info
My first dataset, children_mothers looks like this:
ID_children ID_mother
1 1
2 1
3 2
4 3

and so on, it is a dataset of unique child-mother pairs.
My second dataset has panel information about mothers from years 1990-2011. I want to merge the information to the first dataset based upon ID_mothers.
However, when I tried merge m:m, it does not seem to work and only merges the information for the first child rather than repeating that information for the second child with the same mother.

Any ideas on how to overcome this?
Maybe the problem is that my first dataset is not a panel and my second one is?

Thanks for any help!
Surya
Tags: None
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

16 Mar 2016, 10:02

You should use m:1 instead of m:m. A careful reading of [D] merge in the manual will explain.

Code:

use mothers_info sort ID_mother save mother_info_sorted.dta use childern_mothers sort ID_mother merge m:1 ID_mother using mothers_info_sorted.dta

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

16 Mar 2016, 10:45

Carole gives good advice, as usual.

I just want to add that m:m merging is almost never appropriate. If you are thinking about doing an m:m merge, it is a near certainty that at least one of the following is true:

1. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, thereby supporting -merge 1:m /* or m:1 */ expanded merge key-.
3. You are really trying to accomplish what -joinby-, a different command, does.
4. A variant of 2: your merge key actually does uniquely identify the observations in one of your data set, so you can use merge 1:m or m:1 (or maybe even 1:1).

As Carole suggests, the most likely situation here is 4.

A couple of technical points:

A. If you are using the current version of Stata, it is not necessary to -sort- the data sets before you -merge- them (though it doesn't hurt). Stata will do that for you automatically. (But if they are already sorted on ID_mother, then adding the -sorted- option to your -merge- command will save a little execution time.

B. There is a typo in Carole's second -use- command: it should be -use children_mothers-.
1 like
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

16 Mar 2016, 10:58

Thanks, Clyde. I didn't know about the auto sort, which apparently was implemented in at least Stata 13! After years of getting "data not sorted" errors, it is hardwired.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#5

21 Mar 2016, 07:20

Originally posted by Clyde Schechter View Post

Carole gives good advice, as usual.

I just want to add that m:m merging is almost never appropriate. If you are thinking about doing an m:m merge, it is a near certainty that at least one of the following is true:

1. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, thereby supporting -merge 1:m /* or m:1 */ expanded merge key-.
3. You are really trying to accomplish what -joinby-, a different command, does.
4. A variant of 2: your merge key actually does uniquely identify the observations in one of your data set, so you can use merge 1:m or m:1 (or maybe even 1:1).

As Carole suggests, the most likely situation here is 4.

A couple of technical points:

A. If you are using the current version of Stata, it is not necessary to -sort- the data sets before you -merge- them (though it doesn't hurt). Stata will do that for you automatically. (But if they are already sorted on ID_mother, then adding the -sorted- option to your -merge- command will save a little execution time.

B. There is a typo in Carole's second -use- command: it should be -use children_mothers-.

Thanks Clyde and Carole! I tried join by instead of merge m:m and it did what I wanted!
Surya
Comment

Dr. Iqra Yaseen

Join Date: Jun 2022
Posts: 89

13 Jun 2023, 01:39

Clyde Schechter Carole J. Wilson while using m:1 my dataset does not get updated for the missing values, although the dataset on disk contains the requisite data. plus i am unable to use 1:1 or 1:m with the error that the master data does not contain unique identifiers althought using

Code:

duplicates list

i don't find any duplicate observations. my dataset has composite unique identifier of

year iso3_o iso3_d

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year str40(iso3_o iso3_d) double tradeflow_imf_d
1995 "ARE" "ARG"                 18
1996 "ARE" "ARG"  338.6659851074219
1997 "ARE" "ARG"                 30
1998 "ARE" "ARG"                 25
1999 "ARE" "ARG"    1594.0009765625
2000 "ARE" "ARG"      2909.00390625
2001 "ARE" "ARG"    32352.396484375
2002 "ARE" "ARG"  2485.860107421875
2003 "ARE" "ARG"    2385.7080078125
2004 "ARE" "ARG"   14217.6962890625
2005 "ARE" "ARG"   16160.2919921875
2006 "ARE" "ARG"  520.6320190429688
2007 "ARE" "ARG"         18021.3125
2008 "ARE" "ARG"    5007.2041015625
2009 "ARE" "ARG"   2818.52392578125
2010 "ARE" "ARG"      26824.6328125
2011 "ARE" "ARG"     54085.22265625
2012 "ARE" "ARG"     33019.37890625
2013 "ARE" "ARG"     49520.51953125
2014 "ARE" "ARG"    32238.396484375
2015 "ARE" "ARG"      40069.9453125
2016 "ARE" "ARG"              70894
2017 "ARE" "ARG"              48704
2018 "ARE" "ARG"             137890
2019 "ARE" "ARG"             186096
2020 "ARE" "ARG"              91.54
2021 "ARE" "ARG"            337.557
2022 "ARE" "ARG"            863.546
2023 "ARE" "ARG"                 .
1995 "ARE" "ATG"                  .
1996 "ARE" "ATG"                  .
1997 "ARE" "ATG"                  .
1998 "ARE" "ATG"                  .
1999 "ARE" "ATG"                  .
2000 "ARE" "ATG"                  .
2001 "ARE" "ATG"                  .
2002 "ARE" "ATG"                  .
2003 "ARE" "ATG"                  .
2004 "ARE" "ATG"                .03
2005 "ARE" "ATG"                  .
2006 "ARE" "ATG"                  .
2007 "ARE" "ATG"  .8339999914169312
2008 "ARE" "ATG"  66.65299987792969
2009 "ARE" "ATG"   57.8390007019043
2010 "ARE" "ATG" 15.515999794006348
2011 "ARE" "ATG"  47.49399948120117
2012 "ARE" "ATG" 114.18399810791016
2013 "ARE" "ATG"  28.04599952697754
2014 "ARE" "ATG"  79.77200317382813
2015 "ARE" "ATG"  691.7440185546875
2016 "ARE" "ATG"   487.072998046875
2017 "ARE" "ATG"   3232.97705078125
2018 "ARE" "ATG"  564.8040161132813
2019 "ARE" "ATG"   632.781982421875
2020 "ARE" "ATG"            .442552
2021 "ARE" "ATG"            .801864
2022 "ARE" "ATG"           1.071888
2023 "ARE" "ATG"                  .
1995 "ARE" "BHR"      97787.3828125
1996 "ARE" "BHR"      93825.2734375
1997 "ARE" "BHR"      104146.046875
1998 "ARE" "BHR"       94635.140625
1999 "ARE" "BHR"         93878.0625
2000 "ARE" "BHR"     114530.7578125
2001 "ARE" "BHR"       108834.78125
2002 "ARE" "BHR"         139448.625
2003 "ARE" "BHR"      169057.296875
2004 "ARE" "BHR"        188481.0625
2005 "ARE" "BHR"      232619.046875
2006 "ARE" "BHR"       221041.53125
2007 "ARE" "BHR"          327693.75
2008 "ARE" "BHR"        499485.1875
2009 "ARE" "BHR"        366853.5625
2010 "ARE" "BHR"       374665.28125
2011 "ARE" "BHR"        577138.5625
2012 "ARE" "BHR"         921739.125
2013 "ARE" "BHR"         1051032.25
2014 "ARE" "BHR"        1071890.875
2015 "ARE" "BHR"        1135789.125
2016 "ARE" "BHR"         1051271.75
2017 "ARE" "BHR"         1368644.25
2018 "ARE" "BHR"        1308318.875
2019 "ARE" "BHR"          918474.25
2020 "ARE" "BHR"         802.987993
2021 "ARE" "BHR"        1307.731112
2022 "ARE" "BHR"        1285.691009
2023 "ARE" "BHR"                  .
1995 "ARE" "BRA"            30552.5
1996 "ARE" "BRA"      43889.1796875
1997 "ARE" "BRA"              19217
1998 "ARE" "BRA"     34471.80078125
1999 "ARE" "BRA"    17474.599609375
2000 "ARE" "BRA"         67546.1875
2001 "ARE" "BRA"     110022.1015625
2002 "ARE" "BRA"      96921.7109375
2003 "ARE" "BRA"     19911.62890625
2004 "ARE" "BRA"      68142.5390625
2005 "ARE" "BRA"      81551.6328125
2006 "ARE" "BRA"        368040.0625
2007 "ARE" "BRA"        339175.1875
end

#2 does not solve my problem.

Code:

 merge m:1 year iso3_o iso3_d using "D:\WTO dispute settlement system\my work\THESIS\data\DG datasets 2\dg HDC 1995-2022 imports only s
> orted .dta", update replace force
(note: variable tradeflow_imf_d was str17 in the using data, but will be double now)

    Result                           # of obs.
    -----------------------------------------
    not matched                         3,217
        from master                     3,217  (_merge==1)
        from using                          0  (_merge==2)

    matched                            46,913
        not updated                    46,913  (_merge==3)
        missing updated                     0  (_merge==4)
        nonmissing conflict                 0  (_merge==5)

Last edited by Dr. Iqra Yaseen; 13 Jun 2023, 01:47.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

13 Jun 2023, 08:47

If Stata is telling you that year iso3_o and iso3_d do not uniquely identify observations in your master data set, then, in fact they don't. I have never known Stata to be wrong about this. -duplicates list- will not show them to you because it will only show you observations that are complete duplicates, that is, that are identical on all variables. To see the observations that are preventing you from using -merge 1:m-, run:

Code:

duplicates tag year iso3_o iso3_d, gen(flag) browse if flag

Stata will show them to you in the data browser. You will then have to figure out what went wrong in the creation of the data set that led to the inclusion of these excess observations and how to fix it.

The example data you show does not illustrate the problem: in the example data year iso3_o and iso3_d do uniquely identify observations. But evidently somewhere in the full data set, there are observations that should not be there.

The -merge- output you show is a successful merge, although it apparently isn't what you were expecting. It shows that the using data set did not actually contain any observations that match any observations in the master data set. If they were supposed to match existing items in the master data set, then you should, again, check how the data sets were created to figure out why that didn't happen. Perhaps the coding of the iso3_* variables was different in the two data sets? Since you only showed example data from one of the data sets, I cannot say anything more specific about this. If you need more specific help on this, please show example data from both data sets when posting back.

Finally, let me comment on the use of the -force- option in your -merge- command. You should not do this. Stata has told you that the variable tradeflow_imf_d is incompatible in the two data sets: the master has double and the using has string. Without the -force- option, Stata will refuse to do the merge and will stop with an error message. That is the better outcome! Because you applied -force-, Stata does its best: since it cannot combine a double and a string, it just omits the values of tradeflow_imf_d from the using data set and sets them to missing values in the -merge- result. If the variable tradeflow_imf_d were of no importance, this would be OK. But I suspect it is an important variable in your project, so you don't want it replaced by missing values. The way to get these data sets merge-d is to first fix the using data set. Use the -destring- command to change tradeflow_imf_d into a numeric value in the using data set, and then do the -merge-. (-force- will not be needed to do that.)

More generally in Stata, avoid the use of -force- options. The fact that you are drawn to use one suggests that there is something wrong with the data. You should fix the data rather than use -force-. Remember that when you do use -force- it always means that you will lose data in the process. So -force- should only be applied if you know exactly what data you are going to lose, and you are sure that that data is of no importance for your purposes.
Comment
Dr. Iqra Yaseen

Join Date: Jun 2022

Posts: 89
#8

13 Jun 2023, 09:00

thank you so much Clyde Schechter for such an elaborative explanation
Comment

Announcement