Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets with duplicates..help?

    I have two datasets that match on street address for housing transactions. My master set has duplicate addresses for transactions that happened on the same parcel of land (I want to keep these duplicate observations). I have dropped the duplicates in the geo-spatial dataset (the one I want to match to the master) that contains the address of the parcel and some spatial data. The reason I dropped the duplicate addresses for the second dataset is because all I need is the spatial data for each address. I just need to get the spatial data on the second dataset (the spatial data is distance in feet from a particular landmark) to the master dataset, all while keeping the duplicate transaction observations on the master data.

    I'm sorry if that seems rather vague or poorly explained. If you need clarification, let me know. As I am fairly new to Stata, if you have a suggestion, please make sure to fully explain the reasoning of your code; I would like to learn rather than just copy and paste.

    Thanks!

  • #2
    The code is just
    Code:
    use master_data, clear
    merge m:1 parcel_address using second_dataset
    There isn't really much reasoning to explain in this code. The -merge- command does precisely what you are looking for here. I recommend that you read the -merge- chapter in the [D] volume of the PDF documentation that comes with your Stata installation. (Run -help merge- and then click on the blue link near the top of the Viewer window to get there.) It explains the entire process of -merge-ing data sets. In your instance, your merge key is a single variable, the parcel address. The same parcel address occurs multiple times in your master data, but only once in the using data. So this is precisely the situation for an m:1 merge. When you read the -merge- chapter you will learn about options available with the -merge- command, that will enable you to tailor your results. For example, since all you really want to bring in from the second dataset is the spatial data, you can accomplish that by listing the names of the spatial variables in the -keepusing()- option. And it sounds to me as if you would have no use for observations from the second data set that do not match to any parcel in the master data set. In that case -keep(master match)- would give you that. For that matter, you might decide you don't really want any observations in the master data for which no matching spatial data could be found either. In that case it would be just -keep(master)-.

    Let me also anticipate a problem. One of the commonest threads here on Statalist is triggered when, following -merge-, Stata gives an error message saying that the key variable(s) do not uniquely identify observations in {master or using} data set. This arises in two circumstances. The first is where people are unaware that the merge key has to uniquely identify observations in one or the other data set (and the choice of 1:m or m:1 has to correspond). That probably won't be your situation since you made a point of saying that you removed duplicate parcel observations from the second dataset. So you seem to be aware of the need for that. But almost equally commonly this happens because people believe that there are no duplicates in one of their data sets, but the duplicates really are there. So if Stata gives you this message about the "using" data (your second data set), even though you think you removed the duplicates, go back and look again. I have never known Stata to be wrong about this. A good tool for finding duplicate observations like that is the aptly named -duplicates- command. Run -help duplicates- for instructions if you end up in this situation.

    Comment

    Working...
    X