Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem merging data (can't find solution in previous posts): 'variable caseid does not uniquely identify observations in the using data'

    Greetings,

    I'm running Stata 15.1 on OSX. I'm working with data from the American National Election Studies' 1948-2012 cumulative time series dataset. I'd like to add subsequent years to this dataset. Most I can add via the 'append' command (as they contain unique observations/respondents). However, one of the additional datasets in question is a panel recontact survey; i.e. a subset of respondents who were sampled in 2012 was included in a follow-up survey in 2013. The respondents in the 2013 follow-up survey retain the same caseids they were assigned in the 2012 edition. I'd thus like to merge the 2013 data using the shared caseid numbers. I began by deleting duplicate observations and sorting each dataset by caseid. I then entered the following command:
    Code:
     merge 1:1 caseid using "/Users/zg/Documents/anespanel13-vars.dta"
    . I receive the following error message in response:
    variable caseid does not uniquely identify observations in the master data
    . I then attempted both a 1:m and also a m:1 merge but received the same error message. How can I correct this issue?

    This is example data from the master set:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(caseid YEAR ftblacks)
     1 2011   .
     1    .   .
     2 2010   .
     2 2011  40
     2 2011   .
     2 2012 100
     2 2012  70
     3 2010   .
     3 2011  60
     3 2011   .
     3 2012  30
     3 2012 100
     4 2010   .
     4 2011   .
     4 2011  70
     4 2012  70
     4 2012   .
     5 2010   .
     5 2011   .
     5 2011  75
     5 2012 100
     5 2012  40
     6 2010   .
     6 2011   .
     6 2011 100
     6 2012 100
     6 2012  60
     7 2010   .
     7 2011  50
     7 2011   .
     7 2012   .
     7 2012  60
     8 2010   .
     8 2011  40
     8 2011   .
     8 2012  50
     8 2012 100
     9 2010   .
     9 2011   .
     9 2011  85
     9 2012  70
     9 2012 100
    10 2010   .
    10 2011   .
    10 2011 100
    10 2012  50
    10 2012  50
    11 2010   .
    11 2011  50
    11 2011   .
    11 2012  70
    11 2012  60
    12 2010   .
    12 2011   .
    12 2011  50
    12 2012  50
    12 2012  50
    13 2010   .
    13 2011   .
    13 2011  70
    13 2012  50
    13 2012  70
    14 2010   .
    14 2011 100
    14 2011   .
    14 2012  85
    14 2012  50
    15 2010   .
    15 2011   .
    15 2011  85
    15 2012 100
    15 2012  85
    16 2010   .
    16 2011   .
    16 2011 100
    16 2012  50
    16 2012  50
    17 2010   .
    17 2011  85
    17 2011   .
    17 2012  75
    17 2012 100
    18 2010   .
    18 2011  30
    18 2011   .
    18 2012  70
    18 2012  70
    19 2010   .
    19 2011  15
    19 2011   .
    19 2012 100
    19 2012  70
    20 2010   .
    20 2011   .
    20 2011  60
    20 2012  60
    20 2012 100
    21 2010   .
    21 2011  15
    21 2011   .
    end
    label values ftblacks V162312
    and here is example data from the using set:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(caseid YEAR ftblacks)
    3003 2013  50
    3006 2013  50
    3007 2013  50
    3008 2013  85
    3011 2013  30
    3012 2013  76
    3014 2013  85
    3015 2013  85
    3016 2013  40
    3017 2013  85
    3018 2013  40
    3019 2013  50
    3020 2013  85
    3023 2013  70
    3024 2013  85
    3026 2013 100
    3027 2013  50
    3030 2013  50
    3032 2013  70
    3033 2013  30
    3035 2013  50
    3037 2013  85
    3038 2013  50
    3041 2013  50
    3045 2013  23
    3048 2013  30
    3049 2013  70
    3051 2013  70
    3053 2013  60
    3059 2013  50
    3060 2013  80
    3062 2013  60
    3063 2013  40
    3064 2013   .
    3069 2013  50
    3070 2013  30
    3071 2013  60
    3072 2013  70
    3073 2013  50
    3074 2013  70
    3075 2013  65
    3077 2013  20
    3078 2013  70
    3079 2013  50
    3080 2013  65
    3082 2013  40
    3084 2013 100
    3086 2013  95
    3090 2013  50
    3091 2013   .
    3094 2013  15
    3095 2013  50
    3099 2013  50
    3101 2013  85
    3103 2013   .
    3105 2013  50
    3106 2013  70
    3108 2013  55
    3109 2013  50
    3110 2013  60
    3112 2013  70
    3113 2013  70
    3114 2013  85
    3116 2013   3
    3117 2013  75
    3118 2013  85
    3119 2013  87
    3120 2013  50
    3121 2013   0
    3123 2013  80
    3126 2013  50
    3130 2013  70
    3132 2013  40
    3133 2013 100
    3134 2013  85
    3135 2013  60
    3137 2013  60
    3139 2013  60
    3142 2013  30
    3143 2013  70
    3145 2013  95
    3146 2013 100
    3150 2013  60
    3151 2013  70
    3152 2013 100
    3154 2013  85
    3156 2013  75
    3157 2013  60
    3158 2013  20
    3161 2013  85
    3162 2013  30
    3163 2013   .
    3164 2013  80
    3165 2013  90
    3166 2013  35
    3170 2013  85
    3171 2013 100
    3172 2013  50
    3174 2013  50
    3178 2013   .
    end
    label values ftblacks C5_N3
    Really appreciate any help with this. Thanks!


  • #2
    Well, it is blatantly obvious that caseid does not uniquely identify observations in dataset1. In fact, it is clear that every caseid appears twice or more in dataset1.

    Dataset2 does have its observations uniquely identified by caseid, and when I run
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(caseid YEAR ftblacks)
     1 2011   .
     1    .   .
     2 2010   .
     2 2011  40
     2 2011   .
     2 2012 100
     2 2012  70
     3 2010   .
     3 2011  60
     3 2011   .
     3 2012  30
     3 2012 100
     4 2010   .
     4 2011   .
     4 2011  70
     4 2012  70
     4 2012   .
     5 2010   .
     5 2011   .
     5 2011  75
     5 2012 100
     5 2012  40
     6 2010   .
     6 2011   .
     6 2011 100
     6 2012 100
     6 2012  60
     7 2010   .
     7 2011  50
     7 2011   .
     7 2012   .
     7 2012  60
     8 2010   .
     8 2011  40
     8 2011   .
     8 2012  50
     8 2012 100
     9 2010   .
     9 2011   .
     9 2011  85
     9 2012  70
     9 2012 100
    10 2010   .
    10 2011   .
    10 2011 100
    10 2012  50
    10 2012  50
    11 2010   .
    11 2011  50
    11 2011   .
    11 2012  70
    11 2012  60
    12 2010   .
    12 2011   .
    12 2011  50
    12 2012  50
    12 2012  50
    13 2010   .
    13 2011   .
    13 2011  70
    13 2012  50
    13 2012  70
    14 2010   .
    14 2011 100
    14 2011   .
    14 2012  85
    14 2012  50
    15 2010   .
    15 2011   .
    15 2011  85
    15 2012 100
    15 2012  85
    16 2010   .
    16 2011   .
    16 2011 100
    16 2012  50
    16 2012  50
    17 2010   .
    17 2011  85
    17 2011   .
    17 2012  75
    17 2012 100
    18 2010   .
    18 2011  30
    18 2011   .
    18 2012  70
    18 2012  70
    19 2010   .
    19 2011  15
    19 2011   .
    19 2012 100
    19 2012  70
    20 2010   .
    20 2011   .
    20 2011  60
    20 2012  60
    20 2012 100
    21 2010   .
    21 2011  15
    21 2011   .
    end
    label values ftblacks V162312
    
    tempfile dataset1
    save `dataset1'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(caseid YEAR ftblacks)
    3003 2013  50
    3006 2013  50
    3007 2013  50
    3008 2013  85
    3011 2013  30
    3012 2013  76
    3014 2013  85
    3015 2013  85
    3016 2013  40
    3017 2013  85
    3018 2013  40
    3019 2013  50
    3020 2013  85
    3023 2013  70
    3024 2013  85
    3026 2013 100
    3027 2013  50
    3030 2013  50
    3032 2013  70
    3033 2013  30
    3035 2013  50
    3037 2013  85
    3038 2013  50
    3041 2013  50
    3045 2013  23
    3048 2013  30
    3049 2013  70
    3051 2013  70
    3053 2013  60
    3059 2013  50
    3060 2013  80
    3062 2013  60
    3063 2013  40
    3064 2013   .
    3069 2013  50
    3070 2013  30
    3071 2013  60
    3072 2013  70
    3073 2013  50
    3074 2013  70
    3075 2013  65
    3077 2013  20
    3078 2013  70
    3079 2013  50
    3080 2013  65
    3082 2013  40
    3084 2013 100
    3086 2013  95
    3090 2013  50
    3091 2013   .
    3094 2013  15
    3095 2013  50
    3099 2013  50
    3101 2013  85
    3103 2013   .
    3105 2013  50
    3106 2013  70
    3108 2013  55
    3109 2013  50
    3110 2013  60
    3112 2013  70
    3113 2013  70
    3114 2013  85
    3116 2013   3
    3117 2013  75
    3118 2013  85
    3119 2013  87
    3120 2013  50
    3121 2013   0
    3123 2013  80
    3126 2013  50
    3130 2013  70
    3132 2013  40
    3133 2013 100
    3134 2013  85
    3135 2013  60
    3137 2013  60
    3139 2013  60
    3142 2013  30
    3143 2013  70
    3145 2013  95
    3146 2013 100
    3150 2013  60
    3151 2013  70
    3152 2013 100
    3154 2013  85
    3156 2013  75
    3157 2013  60
    3158 2013  20
    3161 2013  85
    3162 2013  30
    3163 2013   .
    3164 2013  80
    3165 2013  90
    3166 2013  35
    3170 2013  85
    3171 2013 100
    3172 2013  50
    3174 2013  50
    3178 2013   .
    end
    label values ftblacks C5_N3
    
    tempfile dataset2
    save `dataset2'
    
    use `dataset1', clear
    merge m:1 caseid using `dataset2'
    it runs without any error messages. So I can't reproduce your problem. What I do note, however, is that the caseid values in dataset2 don't even overlap at all with those in dataset1, notwithstanding your claim that they are the same in both data sets, at least with regard to the observations for year 2012.

    So there is apparently some considerable confusion about what exactly is in these datasets and what is going on. I suggest you repost with example data that actually reproduces your problem and a clearer explanation of what you want to do.

    Comment


    • #3
      The cumulative datafile has two caseids: one is labeled as 'study respondent number: year-level case id' and the second as 'study variable: unique respondent number (cross-year ID for panel cases). I've been attempting to merge with the former. Perhaps I need to do so with the latter instead? Otherwise, it could be that I only assumed that 2012 caseids were also present in the 2013 dataset. How can I check that this is the case? Thanks!

      Comment


      • #4
        The cumulative datafile has two caseids: one is labeled as 'study respondent number: year-level case id' and the second as 'study variable: unique respondent number (cross-year ID for panel cases). I've been attempting to merge with the former. Perhaps I need to do so with the latter instead?
        I wouldn't know. I have not worked with this data set. But clearly you need to identify which variable is the appropriate one to use as caseid and also understand what corresponds to what in the different years.

        Otherwise, it could be that I only assumed that 2012 caseids were also present in the 2013 dataset. How can I check that this is the case? Thanks!
        Once you have figured out what the proper caseid variable is, try -merge-ing the 2012 and 2013 datasets on caseid. Look at the results that Stata shows you. _merge == 3 corresponds to exact matches. _merge == 1 and _merge == 2 are cases that appear in one of the data sets but not the other.

        Comment


        • #5
          Okay. I think I figured something out. I was able to do a (m:1) merge, but instead of merging the variables with the caseid of a single survey year, it merges them with every year in the file. Is there a way to fix this? Thanks again!

          Comment


          • #6
            Well, I just noticed that the variable ftblacks has the same name in the 2013 file as in the other file. So when you -merge- them, the new value is not brought in. It needs a separate variable for that.

            Code:
            tempfile dataset2
            rename ftblacks ftblacks2013
            save `dataset2'
            
            use `dataset1', clear
            merge m:1 caseid using `dataset2'
            replace ftblacks2013 = . if year != 2012
            Now that we have finally solved this problem, I have to ask why you want to do this. The resulting data set is very malformed for the purpose of any analysis I can think of. I'm really having trouble imagining how you might use this data that won't make you really regret this. Why didn't you just want to -append- the 2013 data?

            Comment

            Working...
            X