Problem merging data (can't find solution in previous posts): 'variable caseid does not uniquely identify observations in the using data'

Zach Goldberg

Join Date: Jul 2017
Posts: 184

Problem merging data (can't find solution in previous posts): 'variable caseid does not uniquely identify observations in the using data'

28 May 2018, 12:12

Greetings,

I'm running Stata 15.1 on OSX. I'm working with data from the American National Election Studies' 1948-2012 cumulative time series dataset. I'd like to add subsequent years to this dataset. Most I can add via the 'append' command (as they contain unique observations/respondents). However, one of the additional datasets in question is a panel recontact survey; i.e. a subset of respondents who were sampled in 2012 was included in a follow-up survey in 2013. The respondents in the 2013 follow-up survey retain the same caseids they were assigned in the 2012 edition. I'd thus like to merge the 2013 data using the shared caseid numbers. I began by deleting duplicate observations and sorting each dataset by caseid. I then entered the following command:

Code:

 merge 1:1 caseid using "/Users/zg/Documents/anespanel13-vars.dta"

. I receive the following error message in response:

variable caseid does not uniquely identify observations in the master data

. I then attempted both a 1:m and also a m:1 merge but received the same error message. How can I correct this issue?

This is example data from the master set:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(caseid YEAR ftblacks)
 1 2011   .
 1    .   .
 2 2010   .
 2 2011  40
 2 2011   .
 2 2012 100
 2 2012  70
 3 2010   .
 3 2011  60
 3 2011   .
 3 2012  30
 3 2012 100
 4 2010   .
 4 2011   .
 4 2011  70
 4 2012  70
 4 2012   .
 5 2010   .
 5 2011   .
 5 2011  75
 5 2012 100
 5 2012  40
 6 2010   .
 6 2011   .
 6 2011 100
 6 2012 100
 6 2012  60
 7 2010   .
 7 2011  50
 7 2011   .
 7 2012   .
 7 2012  60
 8 2010   .
 8 2011  40
 8 2011   .
 8 2012  50
 8 2012 100
 9 2010   .
 9 2011   .
 9 2011  85
 9 2012  70
 9 2012 100
10 2010   .
10 2011   .
10 2011 100
10 2012  50
10 2012  50
11 2010   .
11 2011  50
11 2011   .
11 2012  70
11 2012  60
12 2010   .
12 2011   .
12 2011  50
12 2012  50
12 2012  50
13 2010   .
13 2011   .
13 2011  70
13 2012  50
13 2012  70
14 2010   .
14 2011 100
14 2011   .
14 2012  85
14 2012  50
15 2010   .
15 2011   .
15 2011  85
15 2012 100
15 2012  85
16 2010   .
16 2011   .
16 2011 100
16 2012  50
16 2012  50
17 2010   .
17 2011  85
17 2011   .
17 2012  75
17 2012 100
18 2010   .
18 2011  30
18 2011   .
18 2012  70
18 2012  70
19 2010   .
19 2011  15
19 2011   .
19 2012 100
19 2012  70
20 2010   .
20 2011   .
20 2011  60
20 2012  60
20 2012 100
21 2010   .
21 2011  15
21 2011   .
end
label values ftblacks V162312

and here is example data from the using set:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(caseid YEAR ftblacks)
3003 2013  50
3006 2013  50
3007 2013  50
3008 2013  85
3011 2013  30
3012 2013  76
3014 2013  85
3015 2013  85
3016 2013  40
3017 2013  85
3018 2013  40
3019 2013  50
3020 2013  85
3023 2013  70
3024 2013  85
3026 2013 100
3027 2013  50
3030 2013  50
3032 2013  70
3033 2013  30
3035 2013  50
3037 2013  85
3038 2013  50
3041 2013  50
3045 2013  23
3048 2013  30
3049 2013  70
3051 2013  70
3053 2013  60
3059 2013  50
3060 2013  80
3062 2013  60
3063 2013  40
3064 2013   .
3069 2013  50
3070 2013  30
3071 2013  60
3072 2013  70
3073 2013  50
3074 2013  70
3075 2013  65
3077 2013  20
3078 2013  70
3079 2013  50
3080 2013  65
3082 2013  40
3084 2013 100
3086 2013  95
3090 2013  50
3091 2013   .
3094 2013  15
3095 2013  50
3099 2013  50
3101 2013  85
3103 2013   .
3105 2013  50
3106 2013  70
3108 2013  55
3109 2013  50
3110 2013  60
3112 2013  70
3113 2013  70
3114 2013  85
3116 2013   3
3117 2013  75
3118 2013  85
3119 2013  87
3120 2013  50
3121 2013   0
3123 2013  80
3126 2013  50
3130 2013  70
3132 2013  40
3133 2013 100
3134 2013  85
3135 2013  60
3137 2013  60
3139 2013  60
3142 2013  30
3143 2013  70
3145 2013  95
3146 2013 100
3150 2013  60
3151 2013  70
3152 2013 100
3154 2013  85
3156 2013  75
3157 2013  60
3158 2013  20
3161 2013  85
3162 2013  30
3163 2013   .
3164 2013  80
3165 2013  90
3166 2013  35
3170 2013  85
3171 2013 100
3172 2013  50
3174 2013  50
3178 2013   .
end
label values ftblacks C5_N3

Really appreciate any help with this. Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30064

28 May 2018, 13:23

Well, it is blatantly obvious that caseid does not uniquely identify observations in dataset1. In fact, it is clear that every caseid appears twice or more in dataset1.

Dataset2 does have its observations uniquely identified by caseid, and when I run

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(caseid YEAR ftblacks)
 1 2011   .
 1    .   .
 2 2010   .
 2 2011  40
 2 2011   .
 2 2012 100
 2 2012  70
 3 2010   .
 3 2011  60
 3 2011   .
 3 2012  30
 3 2012 100
 4 2010   .
 4 2011   .
 4 2011  70
 4 2012  70
 4 2012   .
 5 2010   .
 5 2011   .
 5 2011  75
 5 2012 100
 5 2012  40
 6 2010   .
 6 2011   .
 6 2011 100
 6 2012 100
 6 2012  60
 7 2010   .
 7 2011  50
 7 2011   .
 7 2012   .
 7 2012  60
 8 2010   .
 8 2011  40
 8 2011   .
 8 2012  50
 8 2012 100
 9 2010   .
 9 2011   .
 9 2011  85
 9 2012  70
 9 2012 100
10 2010   .
10 2011   .
10 2011 100
10 2012  50
10 2012  50
11 2010   .
11 2011  50
11 2011   .
11 2012  70
11 2012  60
12 2010   .
12 2011   .
12 2011  50
12 2012  50
12 2012  50
13 2010   .
13 2011   .
13 2011  70
13 2012  50
13 2012  70
14 2010   .
14 2011 100
14 2011   .
14 2012  85
14 2012  50
15 2010   .
15 2011   .
15 2011  85
15 2012 100
15 2012  85
16 2010   .
16 2011   .
16 2011 100
16 2012  50
16 2012  50
17 2010   .
17 2011  85
17 2011   .
17 2012  75
17 2012 100
18 2010   .
18 2011  30
18 2011   .
18 2012  70
18 2012  70
19 2010   .
19 2011  15
19 2011   .
19 2012 100
19 2012  70
20 2010   .
20 2011   .
20 2011  60
20 2012  60
20 2012 100
21 2010   .
21 2011  15
21 2011   .
end
label values ftblacks V162312

tempfile dataset1
save `dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(caseid YEAR ftblacks)
3003 2013  50
3006 2013  50
3007 2013  50
3008 2013  85
3011 2013  30
3012 2013  76
3014 2013  85
3015 2013  85
3016 2013  40
3017 2013  85
3018 2013  40
3019 2013  50
3020 2013  85
3023 2013  70
3024 2013  85
3026 2013 100
3027 2013  50
3030 2013  50
3032 2013  70
3033 2013  30
3035 2013  50
3037 2013  85
3038 2013  50
3041 2013  50
3045 2013  23
3048 2013  30
3049 2013  70
3051 2013  70
3053 2013  60
3059 2013  50
3060 2013  80
3062 2013  60
3063 2013  40
3064 2013   .
3069 2013  50
3070 2013  30
3071 2013  60
3072 2013  70
3073 2013  50
3074 2013  70
3075 2013  65
3077 2013  20
3078 2013  70
3079 2013  50
3080 2013  65
3082 2013  40
3084 2013 100
3086 2013  95
3090 2013  50
3091 2013   .
3094 2013  15
3095 2013  50
3099 2013  50
3101 2013  85
3103 2013   .
3105 2013  50
3106 2013  70
3108 2013  55
3109 2013  50
3110 2013  60
3112 2013  70
3113 2013  70
3114 2013  85
3116 2013   3
3117 2013  75
3118 2013  85
3119 2013  87
3120 2013  50
3121 2013   0
3123 2013  80
3126 2013  50
3130 2013  70
3132 2013  40
3133 2013 100
3134 2013  85
3135 2013  60
3137 2013  60
3139 2013  60
3142 2013  30
3143 2013  70
3145 2013  95
3146 2013 100
3150 2013  60
3151 2013  70
3152 2013 100
3154 2013  85
3156 2013  75
3157 2013  60
3158 2013  20
3161 2013  85
3162 2013  30
3163 2013   .
3164 2013  80
3165 2013  90
3166 2013  35
3170 2013  85
3171 2013 100
3172 2013  50
3174 2013  50
3178 2013   .
end
label values ftblacks C5_N3

tempfile dataset2
save `dataset2'

use `dataset1', clear
merge m:1 caseid using `dataset2'

it runs without any error messages. So I can't reproduce your problem. What I do note, however, is that the caseid values in dataset2 don't even overlap at all with those in dataset1, notwithstanding your claim that they are the same in both data sets, at least with regard to the observations for year 2012.

So there is apparently some considerable confusion about what exactly is in these datasets and what is going on. I suggest you repost with example data that actually reproduces your problem and a clearer explanation of what you want to do.

Comment

Zach Goldberg

Join Date: Jul 2017

Posts: 184
#3

28 May 2018, 14:00

The cumulative datafile has two caseids: one is labeled as 'study respondent number: year-level case id' and the second as 'study variable: unique respondent number (cross-year ID for panel cases). I've been attempting to merge with the former. Perhaps I need to do so with the latter instead? Otherwise, it could be that I only assumed that 2012 caseids were also present in the 2013 dataset. How can I check that this is the case? Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#4

28 May 2018, 14:18

The cumulative datafile has two caseids: one is labeled as 'study respondent number: year-level case id' and the second as 'study variable: unique respondent number (cross-year ID for panel cases). I've been attempting to merge with the former. Perhaps I need to do so with the latter instead?

I wouldn't know. I have not worked with this data set. But clearly you need to identify which variable is the appropriate one to use as caseid and also understand what corresponds to what in the different years.

Otherwise, it could be that I only assumed that 2012 caseids were also present in the 2013 dataset. How can I check that this is the case? Thanks!

Once you have figured out what the proper caseid variable is, try -merge-ing the 2012 and 2013 datasets on caseid. Look at the results that Stata shows you. _merge == 3 corresponds to exact matches. _merge == 1 and _merge == 2 are cases that appear in one of the data sets but not the other.
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#5

28 May 2018, 14:46

Okay. I think I figured something out. I was able to do a (m:1) merge, but instead of merging the variables with the caseid of a single survey year, it merges them with every year in the file. Is there a way to fix this? Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#6

28 May 2018, 14:55

Well, I just noticed that the variable ftblacks has the same name in the 2013 file as in the other file. So when you -merge- them, the new value is not brought in. It needs a separate variable for that.

Code:

tempfile dataset2 rename ftblacks ftblacks2013 save `dataset2' use `dataset1', clear merge m:1 caseid using `dataset2' replace ftblacks2013 = . if year != 2012

Now that we have finally solved this problem, I have to ask why you want to do this. The resulting data set is very malformed for the purpose of any analysis I can think of. I'm really having trouble imagining how you might use this data that won't make you really regret this. Why didn't you just want to -append- the 2013 data?
Comment

Announcement

Problem merging data (can't find solution in previous posts): 'variable caseid does not uniquely identify observations in the using data'

Comment

Comment

Comment

Comment

Comment