Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with merging

    Hi,
    I have two datasets: (1) children_mothers and (2) mothers_info
    My first dataset, children_mothers looks like this:
    ID_children ID_mother
    1 1
    2 1
    3 2
    4 3

    and so on, it is a dataset of unique child-mother pairs.
    My second dataset has panel information about mothers from years 1990-2011. I want to merge the information to the first dataset based upon ID_mothers.
    However, when I tried merge m:m, it does not seem to work and only merges the information for the first child rather than repeating that information for the second child with the same mother.

    Any ideas on how to overcome this?
    Maybe the problem is that my first dataset is not a panel and my second one is?

    Thanks for any help!
    Surya

  • #2
    You should use m:1 instead of m:m. A careful reading of [D] merge in the manual will explain.

    Code:
    use mothers_info
    sort ID_mother
    save mother_info_sorted.dta
    
    use childern_mothers
    sort ID_mother
    merge m:1 ID_mother using mothers_info_sorted.dta
    Stata/MP 14.1 (64-bit x86-64)
    Revision 19 May 2016
    Win 8.1

    Comment


    • #3
      Carole gives good advice, as usual.

      I just want to add that m:m merging is almost never appropriate. If you are thinking about doing an m:m merge, it is a near certainty that at least one of the following is true:

      1. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
      2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, thereby supporting -merge 1:m /* or m:1 */ expanded merge key-.
      3. You are really trying to accomplish what -joinby-, a different command, does.
      4. A variant of 2: your merge key actually does uniquely identify the observations in one of your data set, so you can use merge 1:m or m:1 (or maybe even 1:1).

      As Carole suggests, the most likely situation here is 4.

      A couple of technical points:

      A. If you are using the current version of Stata, it is not necessary to -sort- the data sets before you -merge- them (though it doesn't hurt). Stata will do that for you automatically. (But if they are already sorted on ID_mother, then adding the -sorted- option to your -merge- command will save a little execution time.

      B. There is a typo in Carole's second -use- command: it should be -use children_mothers-.

      Comment


      • #4
        Thanks, Clyde. I didn't know about the auto sort, which apparently was implemented in at least Stata 13! After years of getting "data not sorted" errors, it is hardwired.
        Stata/MP 14.1 (64-bit x86-64)
        Revision 19 May 2016
        Win 8.1

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Carole gives good advice, as usual.

          I just want to add that m:m merging is almost never appropriate. If you are thinking about doing an m:m merge, it is a near certainty that at least one of the following is true:

          1. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
          2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, thereby supporting -merge 1:m /* or m:1 */ expanded merge key-.
          3. You are really trying to accomplish what -joinby-, a different command, does.
          4. A variant of 2: your merge key actually does uniquely identify the observations in one of your data set, so you can use merge 1:m or m:1 (or maybe even 1:1).

          As Carole suggests, the most likely situation here is 4.

          A couple of technical points:

          A. If you are using the current version of Stata, it is not necessary to -sort- the data sets before you -merge- them (though it doesn't hurt). Stata will do that for you automatically. (But if they are already sorted on ID_mother, then adding the -sorted- option to your -merge- command will save a little execution time.

          B. There is a typo in Carole's second -use- command: it should be -use children_mothers-.

          Thanks Clyde and Carole! I tried join by instead of merge m:m and it did what I wanted!
          Surya

          Comment


          • #6
            Clyde Schechter Carole J. Wilson while using m:1 my dataset does not get updated for the missing values, although the dataset on disk contains the requisite data. plus i am unable to use 1:1 or 1:m with the error that the master data does not contain unique identifiers althought using
            Code:
            duplicates list
            i don't find any duplicate observations. my dataset has composite unique identifier of
            year iso3_o iso3_d
            .
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float year str40(iso3_o iso3_d) double tradeflow_imf_d
            1995 "ARE" "ARG"                 18
            1996 "ARE" "ARG"  338.6659851074219
            1997 "ARE" "ARG"                 30
            1998 "ARE" "ARG"                 25
            1999 "ARE" "ARG"    1594.0009765625
            2000 "ARE" "ARG"      2909.00390625
            2001 "ARE" "ARG"    32352.396484375
            2002 "ARE" "ARG"  2485.860107421875
            2003 "ARE" "ARG"    2385.7080078125
            2004 "ARE" "ARG"   14217.6962890625
            2005 "ARE" "ARG"   16160.2919921875
            2006 "ARE" "ARG"  520.6320190429688
            2007 "ARE" "ARG"         18021.3125
            2008 "ARE" "ARG"    5007.2041015625
            2009 "ARE" "ARG"   2818.52392578125
            2010 "ARE" "ARG"      26824.6328125
            2011 "ARE" "ARG"     54085.22265625
            2012 "ARE" "ARG"     33019.37890625
            2013 "ARE" "ARG"     49520.51953125
            2014 "ARE" "ARG"    32238.396484375
            2015 "ARE" "ARG"      40069.9453125
            2016 "ARE" "ARG"              70894
            2017 "ARE" "ARG"              48704
            2018 "ARE" "ARG"             137890
            2019 "ARE" "ARG"             186096
            2020 "ARE" "ARG"              91.54
            2021 "ARE" "ARG"            337.557
            2022 "ARE" "ARG"            863.546
            2023 "ARE" "ARG"                 .
            1995 "ARE" "ATG"                  .
            1996 "ARE" "ATG"                  .
            1997 "ARE" "ATG"                  .
            1998 "ARE" "ATG"                  .
            1999 "ARE" "ATG"                  .
            2000 "ARE" "ATG"                  .
            2001 "ARE" "ATG"                  .
            2002 "ARE" "ATG"                  .
            2003 "ARE" "ATG"                  .
            2004 "ARE" "ATG"                .03
            2005 "ARE" "ATG"                  .
            2006 "ARE" "ATG"                  .
            2007 "ARE" "ATG"  .8339999914169312
            2008 "ARE" "ATG"  66.65299987792969
            2009 "ARE" "ATG"   57.8390007019043
            2010 "ARE" "ATG" 15.515999794006348
            2011 "ARE" "ATG"  47.49399948120117
            2012 "ARE" "ATG" 114.18399810791016
            2013 "ARE" "ATG"  28.04599952697754
            2014 "ARE" "ATG"  79.77200317382813
            2015 "ARE" "ATG"  691.7440185546875
            2016 "ARE" "ATG"   487.072998046875
            2017 "ARE" "ATG"   3232.97705078125
            2018 "ARE" "ATG"  564.8040161132813
            2019 "ARE" "ATG"   632.781982421875
            2020 "ARE" "ATG"            .442552
            2021 "ARE" "ATG"            .801864
            2022 "ARE" "ATG"           1.071888
            2023 "ARE" "ATG"                  .
            1995 "ARE" "BHR"      97787.3828125
            1996 "ARE" "BHR"      93825.2734375
            1997 "ARE" "BHR"      104146.046875
            1998 "ARE" "BHR"       94635.140625
            1999 "ARE" "BHR"         93878.0625
            2000 "ARE" "BHR"     114530.7578125
            2001 "ARE" "BHR"       108834.78125
            2002 "ARE" "BHR"         139448.625
            2003 "ARE" "BHR"      169057.296875
            2004 "ARE" "BHR"        188481.0625
            2005 "ARE" "BHR"      232619.046875
            2006 "ARE" "BHR"       221041.53125
            2007 "ARE" "BHR"          327693.75
            2008 "ARE" "BHR"        499485.1875
            2009 "ARE" "BHR"        366853.5625
            2010 "ARE" "BHR"       374665.28125
            2011 "ARE" "BHR"        577138.5625
            2012 "ARE" "BHR"         921739.125
            2013 "ARE" "BHR"         1051032.25
            2014 "ARE" "BHR"        1071890.875
            2015 "ARE" "BHR"        1135789.125
            2016 "ARE" "BHR"         1051271.75
            2017 "ARE" "BHR"         1368644.25
            2018 "ARE" "BHR"        1308318.875
            2019 "ARE" "BHR"          918474.25
            2020 "ARE" "BHR"         802.987993
            2021 "ARE" "BHR"        1307.731112
            2022 "ARE" "BHR"        1285.691009
            2023 "ARE" "BHR"                  .
            1995 "ARE" "BRA"            30552.5
            1996 "ARE" "BRA"      43889.1796875
            1997 "ARE" "BRA"              19217
            1998 "ARE" "BRA"     34471.80078125
            1999 "ARE" "BRA"    17474.599609375
            2000 "ARE" "BRA"         67546.1875
            2001 "ARE" "BRA"     110022.1015625
            2002 "ARE" "BRA"      96921.7109375
            2003 "ARE" "BRA"     19911.62890625
            2004 "ARE" "BRA"      68142.5390625
            2005 "ARE" "BRA"      81551.6328125
            2006 "ARE" "BRA"        368040.0625
            2007 "ARE" "BRA"        339175.1875
            end
            #2 does not solve my problem.
            Code:
             merge m:1 year iso3_o iso3_d using "D:\WTO dispute settlement system\my work\THESIS\data\DG datasets 2\dg HDC 1995-2022 imports only s
            > orted .dta", update replace force
            (note: variable tradeflow_imf_d was str17 in the using data, but will be double now)
            
                Result                           # of obs.
                -----------------------------------------
                not matched                         3,217
                    from master                     3,217  (_merge==1)
                    from using                          0  (_merge==2)
            
                matched                            46,913
                    not updated                    46,913  (_merge==3)
                    missing updated                     0  (_merge==4)
                    nonmissing conflict                 0  (_merge==5)
            Last edited by Dr. Iqra Yaseen; 13 Jun 2023, 01:47.

            Comment


            • #7
              If Stata is telling you that year iso3_o and iso3_d do not uniquely identify observations in your master data set, then, in fact they don't. I have never known Stata to be wrong about this. -duplicates list- will not show them to you because it will only show you observations that are complete duplicates, that is, that are identical on all variables. To see the observations that are preventing you from using -merge 1:m-, run:
              Code:
              duplicates tag year iso3_o iso3_d, gen(flag)
              browse if flag
              Stata will show them to you in the data browser. You will then have to figure out what went wrong in the creation of the data set that led to the inclusion of these excess observations and how to fix it.

              The example data you show does not illustrate the problem: in the example data year iso3_o and iso3_d do uniquely identify observations. But evidently somewhere in the full data set, there are observations that should not be there.

              The -merge- output you show is a successful merge, although it apparently isn't what you were expecting. It shows that the using data set did not actually contain any observations that match any observations in the master data set. If they were supposed to match existing items in the master data set, then you should, again, check how the data sets were created to figure out why that didn't happen. Perhaps the coding of the iso3_* variables was different in the two data sets? Since you only showed example data from one of the data sets, I cannot say anything more specific about this. If you need more specific help on this, please show example data from both data sets when posting back.

              Finally, let me comment on the use of the -force- option in your -merge- command. You should not do this. Stata has told you that the variable tradeflow_imf_d is incompatible in the two data sets: the master has double and the using has string. Without the -force- option, Stata will refuse to do the merge and will stop with an error message. That is the better outcome! Because you applied -force-, Stata does its best: since it cannot combine a double and a string, it just omits the values of tradeflow_imf_d from the using data set and sets them to missing values in the -merge- result. If the variable tradeflow_imf_d were of no importance, this would be OK. But I suspect it is an important variable in your project, so you don't want it replaced by missing values. The way to get these data sets merge-d is to first fix the using data set. Use the -destring- command to change tradeflow_imf_d into a numeric value in the using data set, and then do the -merge-. (-force- will not be needed to do that.)

              More generally in Stata, avoid the use of -force- options. The fact that you are drawn to use one suggests that there is something wrong with the data. You should fix the data rather than use -force-. Remember that when you do use -force- it always means that you will lose data in the process. So -force- should only be applied if you know exactly what data you are going to lose, and you are sure that that data is of no importance for your purposes.

              Comment


              • #8
                thank you so much Clyde Schechter for such an elaborative explanation

                Comment

                Working...
                X