Really need help,error in 1:m and m:1 when merge two datasets

Spacey Shi

Join Date: Oct 2016
Posts: 47

Really need help,error in 1:m and m:1 when merge two datasets

20 Mar 2017, 05:45

I need to merge file rst and file ID_mc together, key variable is mother's ID----IDind_m + wave,but 1:m and m:1 don't work,what should I do? I have been buffled by this problem several days, can't solve this, really wish help.

Code:

use rst1,clear
(Written by R.              )

. merge m:1 IDind_m wave using ID_mc1
variables IDind_m wave do not uniquely identify observations in the using data
r(459);
merge 1:m IDind_m wave using ID_mc1
variables IDind_m wave do not uniquely identify observations in the master data
r(459);

The first file is:
dataex IDind IDind_m wave in 1/3

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(IDind IDind_m wave)
111101011003 111101011002 2011
111101012003 111101012002 2011
111101013003 111101013002 2011
end

------------------ copy up to and including the previous line ------------------
The second file is:
. dataex in 1/5

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(IDind_m IDind_c wave hhid_m hhid_c commid t1 t2)
111101011002 111101011003 2011 111101011 111101011 111101 11 1
111101012002 111101012003 2011 111101012 111101012 111101 11 1
111101013002 111101013003 2011 111101013 111101013 111101 11 1
111101014001 111101014003 2011 111101014 111101014 111101 11 1
111101015002 111101015003 2011 111101015 111101015 111101 11 1
end

------------------ copy up to and including the previous line ------------------

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

20 Mar 2017, 05:54

Spacey:
like in your previous post (http://www.statalist.org/forums/foru...f-merging-data), the main reason of your problem is a suboptimal approach in creating -id- for your different variables.
That said, you should consider using a different variable (or more than one variable) for -merge-.
The mandatory advice, however, is to stay away from -merge m:m-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Spacey Shi

Join Date: Oct 2016
Posts: 47

20 Mar 2017, 06:21

Originally posted by Carlo Lazzaro View Post

Spacey:
like in your previous post (http://www.statalist.org/forums/foru...f-merging-data), the main reason of your problem is a suboptimal approach in creating -id- for your different variables.
That said, you should consider using a different variable (or more than one variable) for -merge-.
The mandatory advice, however, is to stay away from -merge m:m-.

Thanks so much.
In ID_mc:

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     IDind_m |     59,848    4.00e+11    9.27e+10   1.11e+11   5.52e+11
     IDind_c |     59,848    4.00e+11    9.27e+10   1.11e+11   5.52e+11
        wave |     59,848    2002.596    6.992698       1991       2011
      hhid_m |     59,848    4.00e+08    9.27e+07   1.11e+08   5.52e+08
      hhid_c |     59,848    4.00e+08    9.27e+07   1.11e+08   5.52e+08
-------------+---------------------------------------------------------
      commid |     59,848    399585.8    92687.17     111101     552304
          t1 |     59,848    39.75966    9.267507         11         55
          t2 |     59,848     1.75802    .4282856          1          2

In rst:

Code:

 sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       IDind |    156,645    3.92e+11    9.46e+10   1.11e+11   5.52e+11
          a5 |    151,205    2.144466    1.934509          0         10
        wave |    156,645    2000.816    7.683592       1989       2011
        a5a1 |     29,992    1.019772    .1819866          0          3
        a5c1 |     32,531    1.018137    .1843308          0          9
-------------+---------------------------------------------------------
       adult |     76,720    .5573514    .4967032          0          1
       child |     76,721    .1137759    .3175409          0          1
        hhid |    156,645    3.92e+08    9.46e+07   1.11e+08   5.52e+08
      commid |    156,645    392238.3    94636.55     111101     552304
          t1 |    156,645    39.03139    9.461549         11         55
-------------+---------------------------------------------------------
          t2 |    156,645    1.700533     .458026          1          2
     IDind_f |     35,555    3.97e+11    9.18e+10   1.11e+11   5.52e+11
     IDind_m |     38,601    3.98e+11    9.18e+10   1.11e+11   5.52e+11
     IDind_s |     66,118    3.79e+11    9.89e+10   1.11e+11   5.52e+11

SO, I have to use IDind_m + wave as key, I'm really confused,if this doesn't work, I can't go on the rest steps.
I won't use m:m, it's absolutely wrong. Is there any other methods to merge these two datasets?
Sigh.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

20 Mar 2017, 06:38

Spacey:
try use -IDind_m + wave- as match variiables.
You should also consider with -_merge- option you need for your analysis (see -help merge-, -merge result table-).

Kind regards,
Carlo
(Stata 19.0)
Comment
Spacey Shi

Join Date: Oct 2016

Posts: 47
#5

20 Mar 2017, 07:01

Originally posted by Carlo Lazzaro View Post

Spacey:
try use -IDind_m + wave- as match variiables.
You should also consider with -_merge- option you need for your analysis (see -help merge-, -merge result table-).

Yes, I use IDind_m + wave from the very beginning, try both 1:m and m:1......I merge other files have no problem....what should I do.....
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

20 Mar 2017, 07:41

Spacey:
as a last resort, you may want considering splitting your datasets by -wave-, -merge- them year per year and then -append-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

20 Mar 2017, 07:50

Spacey: I can't help here beyond Carlo's excellent advice.

I want to underline something different. Wording such as "really need help" in title or post is not a good idea. Do you suppose that the other people don't need help or are less deserving? I don't imagine you do, but there is no ordering or priority here because people mark their cases as desperate or urgent.

If anything, more people will ignore your question as from someone not understanding how forums work.

We need concise and clear questions without special pleading.

Sorry if this appears brutal or insensitive, but it's vital to successful continued participation on Statalist that you understand this.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

20 Mar 2017, 08:42

Let me add advice you may have been given already in other threads.

Stata is telling you exactly what the problem is - in both of your datasets you have at least one case where two or more observations have the same pair of values for IDind_m and wave. We don't see these in the small sample of your data that you show, but Stata tells you they exist.

Successful merging requires that for one of your two files, there is only 1 observation for any given pair of values for IDind_m and wave, while for the other file there can be multiple observations with any given combination.

That you tried both merge 1:m and merge m:1, and furthermore that they both failed, suggests you do not understand what your files contain.

Your next step is to better understand your data. Use of the duplicates command will help you understand how big the problem is in each dataset, and will allow you to view examples.
Comment
Spacey Shi

Join Date: Oct 2016

Posts: 47
#9

20 Mar 2017, 20:01

Originally posted by William Lisowski View Post

Let me add advice you may have been given already in other threads.

Stata is telling you exactly what the problem is - in both of your datasets you have at least one case where two or more observations have the same pair of values for IDind_m and wave. We don't see these in the small sample of your data that you show, but Stata tells you they exist.

Successful merging requires that for one of your two files, there is only 1 observation for any given pair of values for IDind_m and wave, while for the other file there can be multiple observations with any given combination.

That you tried both merge 1:m and merge m:1, and furthermore that they both failed, suggests you do not understand what your files contain.

Your next step is to better understand your data. Use of the duplicates command will help you understand how big the problem is in each dataset, and will allow you to view examples.

Thanks for your thread, I check duplicates in the ID_mc file:

Code:

bys IDind_m wave: gen dup2 = cond(_N==1,0,_n) . tabulate dup2 dup2 | Freq. Percent Cum. ------------+----------------------------------- 0 | 11,777 19.68 19.68 1 | 17,722 29.61 49.29 2 | 17,722 29.61 78.90 3 | 7,787 13.01 91.91 4 | 3,022 5.05 96.96 5 | 1,149 1.92 98.88 6 | 457 0.76 99.65 7 | 130 0.22 99.86 8 | 51 0.09 99.95 9 | 31 0.05 100.00 ------------+----------------------------------- Total | 59,848 100.00

There are duplicates of mother's IDind, I think the reason is one mother maybe have many children,and the second reason is that she may take part in the survey for several years.
SO, how to solve this and to merge it with the master file? Many thanks.

Last edited by Spacey Shi; 20 Mar 2017, 20:10.
Comment
Spacey Shi

Join Date: Oct 2016

Posts: 47
#10

20 Mar 2017, 20:08

Originally posted by Carlo Lazzaro View Post

Spacey:
as a last resort, you may want considering splitting your datasets by -wave-, -merge- them year per year and then -append-.

Thanks.
I check duplicates of ID_mc file:

Code:

bys IDind_m wave: gen dup2 = cond(_N==1,0,_n) . tabulate dup2 dup2 | Freq. Percent Cum. ------------+----------------------------------- 0 | 11,777 19.68 19.68 1 | 17,722 29.61 49.29 2 | 17,722 29.61 78.90 3 | 7,787 13.01 91.91 4 | 3,022 5.05 96.96 5 | 1,149 1.92 98.88 6 | 457 0.76 99.65 7 | 130 0.22 99.86 8 | 51 0.09 99.95 9 | 31 0.05 100.00 ------------+----------------------------------- Total | 59,848 100.00

And I don't understand how to split datasets by -wave-, -merge- them year per year and then -append-, can I have a small example of codes?
Many thanks.

Last edited by Spacey Shi; 20 Mar 2017, 20:11.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#11

20 Mar 2017, 20:25

Originally posted by Spacey Shi View Post

. . . one mother maybe have many children,and the second reason is that she may take part in the survey for several years. SO, how to solve this and to merge it with the master file?

Try something like this:

Code:

use ID_mc1, clear rename IDind_c IDind // The name of the variable in rst1 isid IDind_m IDind wave // confirmation of your assumption tempfile ID_mc1 quietly save `ID_mc1' // Prevent accidental changes in original // Now use rst1 isid IDind_m IDind wave merge 1:1 IDind_m IDind wave using `ID_mc1'
Comment

Spacey Shi

Join Date: Oct 2016
Posts: 47

#12

20 Mar 2017, 20:47

Originally posted by Joseph Coveney View Post

Try something like this:

Code:

use ID_mc1, clear

rename IDind_c IDind // The name of the variable in rst1

isid IDind_m IDind wave // confirmation of your assumption

tempfile ID_mc1
quietly save `ID_mc1' // Prevent accidental changes in original

// Now
use rst1

isid IDind_m IDind wave

merge 1:1 IDind_m IDind wave using `ID_mc1'

Thanks!
It works, but I don't understand what's the IDind contained now, and I need the IDind_c display in the merged file,but it isn't there.

Code:

. rename IDind_c IDind

. isid IDind_m IDind wave

. tempfile ID_mc1

. quietly save `ID_mc1'

. use rst1
(Written by R.              )

. isid IDind_m IDind wave
variables IDind_m IDind wave should never be missing
r(459);

 merge 1:1 IDind_m IDind wave using `ID_mc1'

    Result                           # of obs.
    -----------------------------------------
    not matched                       161,925
        from master                   129,361  (_merge==1)
        from using                     32,564  (_merge==2)

    matched                            27,284  (_merge==3)
    -----------------------------------------

. sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       IDind |    189,209    3.94e+11    9.42e+10   1.11e+11   5.52e+11
          a5 |    151,205    2.144466    1.934509          0         10
        wave |    189,209    2001.533    7.558792       1989       2011
        a5a1 |     29,992    1.019772    .1819866          0          3
        a5c1 |     32,531    1.018137    .1843308          0          9
-------------+---------------------------------------------------------
       adult |     76,720    .5573514    .4967032          0          1
       child |     76,721    .1137759    .3175409          0          1
        hhid |    156,645    3.92e+08    9.46e+07   1.11e+08   5.52e+08
      commid |    189,209    394099.5    94227.56     111101     552304
          t1 |    189,209    39.21578    9.420671         11         55
-------------+---------------------------------------------------------
          t2 |    189,209     1.71624    .4508231          1          2
     IDind_f |     35,555    3.97e+11    9.18e+10   1.11e+11   5.52e+11
     IDind_m |     71,165    4.00e+11    9.18e+10   1.11e+11   5.52e+11
     IDind_s |     66,118    3.79e+11    9.89e+10   1.11e+11   5.52e+11
      hhid_m |     59,848    4.00e+08    9.27e+07   1.11e+08   5.52e+08
-------------+---------------------------------------------------------
      hhid_c |     59,848    4.00e+08    9.27e+07   1.11e+08   5.52e+08
      _merge |    189,209    1.460507    .7326962          1          3

. tsset IDind wave
repeated time values within panel
r(451);

And I find the problem in the ID_mc file, like this picture:

Click image for larger version

Name: bug.png
Views: 1
Size: 42.6 KB
ID: 1379369

The same mother has two children, so mother's IDind_m appears twice, induce the problem, and I can't figure out how to solve this
.

Last edited by Spacey Shi; 20 Mar 2017, 20:54.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#13

20 Mar 2017, 21:43

Originally posted by Spacey Shi View Post

It works, but I don't understand what's the IDind contained now, and I need the IDind_c display in the merged file,but it isn't there.

Code:

rename IDind Dind_c

(If you wanted the children's ID variable to be named IDind_c, then you could have forgone the renaming in the using dataset and just renamed the variable in the master.)

Originally posted by Spacey Shi View Post

The same mother has two children, so mother's IDind_m appears twice, induce the problem, and I can't figure out how to solve this

What you need to do depends upon what you want to accomplish. You'll probably want to think about that first or ask whoever it was who asked you to do this analysis for guidance—the advice might be as simple as

Code:

bysort IDind_m wave (IDind_c): keep if _n == 1

to get a single observation for the mother's ID per wave, I don't know.

In the meantime, you've got serious problems with your data that you need to deal with: look at the error message from the isid command.
Comment

Announcement

Really need help,error in 1:m and m:1 when merge two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment