Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets

    Hello,

    I am conducting a study to assess the impact of a city-level policy on tobacco use. To answer this question, I have two separate datasets; tobacco use dataset with the outcomes and policy dataset with the exposure variable. The policies are graded on a city level but the problem is that the tobacco use dataset does not have city-level variables but has zipcodes. So I created a policy dataset that has the same zipcodes and linked them to city names with each corresponding policy grade. I planned to merge both datasets using the common zipcode variable to conduct my analysis. Please note the bestzip variable is numeric in both datasets.

    I used merge 1:1, 1:m and m:1 in Stata 15.1 but I kept getting the same error messages: variable bestzip does not uniquely identify observations in the master data
    r(459);

    . merge 1:1 bestzip using "C:\Users\Tobacco_use.dta"
    variable bestzip does not uniquely identify observations in the master data
    r(459);

    . merge 1:m bestzip using "C:\Users\Tobacco_use.dta.dta"
    variable bestzip does not uniquely identify observations in the master data
    r(459);

    . merge m:1 bestzip using "C:\Users\Policydata"
    variable bestzip does not uniquely identify observations in the using data
    r(459);

    So when I checked the data, I realized there are some zipcodes that cut across different cities as seen in the last 2 lines "Los Angeles" and "Commerce" in 90022.

    . dataex bestzip city city_2012, count(23)

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long bestzip str22 city byte city_2012
    90001 "Los Angeles"          4
    90002 "Los Angeles"          4
    90003 "Los Angeles"          4
    90004 "Los Angeles"          4
    90005 "Los Angeles"          4
    90006 "Los Angeles"          4
    90007 "Los Angeles"          4
    90008 "Los Angeles"          4
    90009 "Los Angeles"          4
    90010 "Los Angeles"          4
    90011 "Los Angeles"          4
    90012 "Los Angeles"          4
    90013 "Los Angeles"          4
    90014 "Los Angeles"          4
    90015 "Los Angeles"          4
    90016 "Los Angeles"          4
    90017 "Los Angeles"          4
    90018 "Los Angeles"          4
    90019 "Los Angeles"          4
    90020 "Los Angeles"          4
    90021 "Los Angeles"          4
    90022 "Los Angeles"          4
    90022 "Commerce            " 0
    end
    My question is do I have other options to merge these two datasets using the zipcode variable or is there a way to create a unique id that I can use to merge both of them?


  • #2
    This is not a solvable problem. At least, it's not solvable by playing with -merge- or using an alternative to -merge-. The problem is that you simply don't have compatible data sets to join together, and no matter what you do to put them together, you cannot get around that unpleasant fact. You bring up the example of zip code 9022 which covers part of LA and part of Commerce. Note also that the "policy" variable city_2012 takes on different values for that zip code in the two cities. There is no way for Stata to decide, when it says zip code 90022 in the tobacco use data set whether to join that to the LA, 4, observation or the Commerce 0 observation.

    Other ways of approaching the problem all have limitations. You can simply decide for yourself that you will treat 90022 as if it were just LA (or as if it were just Commerce), modify the policy data set accordingly by dropping one of the 90022 observations and proceed with your merge on bestzip. Those decisions will have to be made based on your understanding of the "facts on the ground" in Commerce and LA. Perhaps the zip code is nearly all in one of the cities and only makes a small dip into the other. You might go with the city that has the larger coverage. Even this can be based on population coverage, area coverage, or some economic measure.

    Or you may have other knowledge about the demographics or economics of the two cities that tells you that 90022 should be considered a 4 or a 0, and pick accordingly.

    Another possibility is to simply exclude altogether from analysis any zip code that overlaps two or more cities.

    Another way is to perhaps obtain an independent source of the tobacco data that has city-level identifiers in addition to the zip codes and patch the tobacco data.

    Anyway, you can't code your way out of this problem. There are commands other than -merge- that join data sets, but what they do will only make matters worse for this.

    Bottom line: your data are broken and you need to fix them as best you can.

    Comment


    • #3
      Clyde Schechter This is very helpful. I appreciate your detailed response. Now I can move on from attempting to patch this data. Thanks again.

      Comment

      Working...
      X