Problems merging data

Eli Omo

Join Date: Aug 2021
Posts: 40

Problems merging data

12 Oct 2021, 08:24

Hi all,

I have been trying to merge my data using key variables but error I keep getting is : " key variable do not uniquely identify observations in the using data"

The is my first data
----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 id float(date days) byte measurement float date1 int(sugar ice smoke sugar2 ice1 smoke1 ice2 cie3) float(number date4)
"44" 21213 130 12 21327 120 70   .  . . . . . 18 21369
"44" 21213 130  8 21290 112 64   .  . . . . . 18 21369
"30" 21213 130  2 21329 140 85   .  . . . . . 19 21369
"30" 21213 130 10 21306 120 70   .  . . . . . 18 21369
"30" 21213 130  4 21251 126 62   .  . . . . . 18 21369
"30" 21213 130  6 21264 110 60   .  . . . . . 18 21369
"30" 21213 144  5 21255 133 62   .  . . . . . 18 21369
""       . 144  3 21241 108 70   .  . . . . . 18 21369
"30" 21213 130 11 21318 130 70 130 60 . . . . 18 21369
"30" 21213 144  2 21235 124 76   .  . . . . . 18 21369
"30" 21213 130  1 21227 118 62   .  . . . . . 18 21369
"30" 21213 130  7 21269 120 58   .  . . . . . 18 21369
"44" 21213 130  1 21284 105 55   .  . . . . . 19 21369
"44" 21213 130  9 21297 124 60   .  . . . . . 18 21369
"44" 21213 130  1 21192 112 64   .  . . . . . 17 21369
end
format %tdDD/NN/CCYY date
format %td date1
format %tdDD/NN/CCYY date4

This is the second data:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 id int(read1 read2) float(time self days value exp1 obse1)
"419" 110 58             . 1 1 . . .
"3"   120 46 1.8054964e+12 1 4 . . .
"39"   81 44 1.8510804e+12 1 3 . . .
"39"   84 50  1.724345e+12 1 3 . . .
"39"   84 50 1.8505718e+12 1 3 . . .
"39"  109 45             . 1 3 . . .
"39"  143 50  1.850501e+12 1 3 . . .
"39"  153 50 1.8138373e+12 1 4 . . .
"373" 120 48 1.8326545e+12 1 5 . . .
"373" 123 59 1.8765867e+12 1 5 . . .
"373" 120 58   1.85383e+12 1 5 . . .
"87"  120 50 1.8529647e+12 1 5 . . .
"87"  100 50 1.8534015e+12 1 5 . . .
"87"   89 50 1.7746176e+12 1 5 . . .
"87"    9 50 1.8555128e+12 1 5 . . .
"26"  121 48 1.7681596e+12 1 5 . . .
end
format %tcDD/NN/CCYY_HH:MM time
format %tdDD/NN/CCYY value

------------------ copy up to and including the previous line ------------------

Listed 16 out of 16 observations

Any help would be greatly appreaciated

Tags: None

Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

12 Oct 2021, 08:57

The error means what it says, the variables used for merging is not unique in the data set. The question did not show the actual merging command or did not describe the actual plan of the merging so it's hard to tell what went wrong. However, assuming you've attempted a 1:1 merge using id as the matching variable, then it would be caused by the fact that in both data there are multiple copies of the same id.

Again, without knowing the actual intention, I cannot suggest anything. But there is also another command -joinby- that can do multiple merge (if that was the original intention.)
Comment
Eli Omo

Join Date: Aug 2021

Posts: 40
#3

13 Oct 2021, 05:19

Thanks for your reply. I think there are multiple copies of the same id because the data is currently in the long format. But what I intend to do is to add some variables that are missing in the first dataset from the second one. So both data sets are in the long format and have the same id but some variable e.g reading1 from the second data set I'd like to add to the first one
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#4

13 Oct 2021, 05:51

Originally posted by Eli Omo View Post

Thanks for your reply. I think there are multiple copies of the same id because the data is currently in the long format. But what I intend to do is to add some variables that are missing in the first dataset from the second one. So both data sets are in the long format and have the same id but some variable e.g reading1 from the second data set I'd like to add to the first one

Then that would depend on of the variables you plan to add vary within an ID or constant within an ID. If it's constant, then you can check out -collapse- to compress the supplemental data set into one ID a line, and then use m:1 merge to combine them.

If the variable you plan to add vary within ID, then it'd require you to identify additional variable (e.g. id number + year, etc.) so that it's unique enough for the collapsed data to be matched.
Comment
Eli Omo

Join Date: Aug 2021

Posts: 40
#5

13 Oct 2021, 07:10

The variable I want to add varies with ID. Would you be able to assist me with the code to collapse the supplemental data?
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#6

13 Oct 2021, 14:34

Originally posted by Eli Omo View Post

The variable I want to add varies with ID. Would you be able to assist me with the code to collapse the supplemental data?

I'd like to help but I (and many others here) cannot render any help unless the question is clear. Please take a moment to read the FAQ (http://www.statalist.org/forums/help) on how to ask a good question. Basically, you'll need to:

1) Clearly describe what you're trying to achieve, instead of what didn't work.
2) Use -dataex- to show example. Which you did, but your examples are not useful because none of the ID exists in both data sets, so we will not be able even try our code to merge them. Try to use something like:

Code:

dataex var1 var2 var3 if inlist(id, "30", "44", "39", "87")

so that both data sets would have the same people.
3) Try to word the question with more information. For example, "I have this data 1 and this data 2. I'd like to add variable ABC from data 2 to data 1, matched by ID and XYZ."
4) Show the code you used, if any.

Last edited by Ken Chui; 13 Oct 2021, 14:37.
1 like
Comment
Eli Omo

Join Date: Aug 2021

Posts: 40
#7

18 Oct 2021, 12:44

Hi thank you very much for your help. The issue is that I am currently using fake data but I will try again to explain what I mean. So I have to twists of data with the common variable id (which idenitifies the person number) but there are some variables e.g "value " regarding " id" that are only in the second dataset but cannot be found in the first dataset. Currently as you can see above each "id" has multiple lines ( wide data) but I want to collapse each datasets so that each id only contains one line and then merge/join( unsure which one is suitable ) both datasets to have all information about the "id" in one dataset. I hope this is clearer. And once again thanks for all your help
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#8

18 Oct 2021, 14:07

Originally posted by Eli Omo View Post

Hi thank you very much for your help. The issue is that I am currently using fake data but I will try again to explain what I mean. So I have to twists of data with the common variable id (which idenitifies the person number) but there are some variables e.g "value " regarding " id" that are only in the second dataset but cannot be found in the first dataset. Currently as you can see above each "id" has multiple lines ( wide data) but I want to collapse each datasets so that each id only contains one line and then merge/join( unsure which one is suitable ) both datasets to have all information about the "id" in one dataset. I hope this is clearer. And once again thanks for all your help

No, it's not clear. Because in the second data set all data under "value" is missing. It's fine to demonstrate with fake data, but do make sure they make sense.

I'm going to use another example here to show how to do that if the information to be merged does not change within each id number:

Code:

clear input id x1 1 1 1 1 2 1 2 1 3 1 3 1 end save data1, replace clear input id value 1 17 1 17 2 19 2 19 3 21 3 21 end * If the information is constant within each id (e.g. id 1 is all 17, id 2 is all 19...) collapse (mean) value, by(id) merge 1:m id using data1 list

If the information to be merged also vary inside each id (aka, there are many values within id = 1), then this method will not work. And again, I have no idea what the truth is.
Comment
Eli Omo

Join Date: Aug 2021

Posts: 40
#9

18 Oct 2021, 16:23

Thanks Ken, the information to be merged also vary inside each.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#10

18 Oct 2021, 17:39

Originally posted by Eli Omo View Post

Thanks Ken, the information to be merged also vary inside each.

Then at least two possible next steps:

1) Is there a second, third, etc. variable that if you keep focusing in, will eventually see the "value" becoming constant? For example, here we added "year", while "value" is not the same within id, it's the same within id-year, so we can aggregate them:

Code:

clear input id x1 year 1 1 2020 1 1 2021 1 1 2020 1 1 2021 2 1 2020 2 1 2021 2 1 2020 2 1 2021 end save data1, replace clear input id value year 1 17 2020 1 17 2021 2 19 2020 2 19 2021 end * If the information is constant within each id (e.g. id 1 is all 17, id 2 is all 19...) collapse (mean) value, by(id year) merge 1:m id year using data1 list

2) Another way is to consider extracting one number from this different "values" for each id (could be their mean, min, max, first, last, etc.) and then merge those back to the first data.

Either way, collapse and merge would probably be the major commands. Check out:

Code:

help merge help collapse

to see how they work. Click on the PDF link to see the PDF version, in which you may find more examples.
Comment
Eli Omo

Join Date: Aug 2021

Posts: 40
#11

19 Oct 2021, 03:38

Thanks I see what you mean I think I will have a detailed look at the merge and collapse commands
Comment

Announcement

Problems merging data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment