Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets on 2 variables

    I am trying to merge 2 datasets based on 2 variables. I am using the menu options "Data -> Combine datasets -> Merge two datasets", but only about half of the observations matched. Is there code I could adapt that allows me to merge 2 datasets on 2 variables? This is my first post here, so I'm not sure what other information you'd need from me. Thanks.

  • #2
    Emma:
    welcome to this forum.
    As you di not give us enough details, it's difficult (for me, at any rate) to reply positively.
    Assuming that you want to -merge- your 2 datasets on 1:1 basis, you may want to try:
    Code:
    merge 1:1 <var1> <var2> using <yourusingfile>.dta
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hello Carlo, and thank you for your reply and patience with me... I am trying to figure out how to use dataex to provide a concrete example. In the meantime, I've been using this blog post:
      https://blog.stata.com/2011/05/27/me...le-key-merges/

      The two variables to merge on are "region" and "epi_week".

      I've determined that when I merge by the variable "region", everything matches:

      Code:
       sort region
      
      . by region: keep if _n==1
      (1,872 observations deleted)
      
      . merge 1:m region using "/WeatherEpiWeek5.dta"
      (label region1 already defined)
      
          Result                           # of obs.
          -----------------------------------------
          not matched                             0
          matched                             2,041  (_merge==3)
          -----------------------------------------
      the problem is with the variable epi_week

      Code:
      use "/CasesEpiWeek.dta", clear
      
      . sort epi_week
      
      . by epi_week: keep if _n==1
      (1,740 observations deleted)
      
      . merge 1:m epi_week using "/WeatherEpiWeek5.dta"
      (label region1 already defined)
      
      Result # of obs.
      -----------------------------------------
      not matched 870
      from master 51 (_merge==1)
      from using 819 (_merge==2)
      
      matched 1,222 (_merge==3)
      -----------------------------------------
      At this point I don't know how to troubleshoot further. epi_week was a variable created in each of the datasets I'm merging using the EPIWEEK module. I don't know why some weeks matched and other didn't in the merge.

      any advice or further questions are welcome.

      Comment


      • #4
        So, what Stata is telling you is that each of the data sets contain some values of epi_week that are not found in the other. I have never known Stata to get this wrong.

        You have not shown example data, nor really described what the data represent and where they come from, so it is not possible to give specific advice. But here are some things to think about.

        1. Why do you expect that both data sets will have exactly the same values of epi_week? Where did you get these data from? Is there documentation accompanying them that says things that lead you to expect that? Maybe they aren't supposed to have exactly the same epi_week values?

        If you have compelling reasons to believe that both data sets should have the same values of epi_week, then you need to explore why, in reality, they don't.

        2. Could one or both of the data sets have erroneous data? Again, where did these come from? If you yourself created them, perhaps there were errors in the data management. If somebody else provided them, perhaps they made errors. Can you contact them to make sure they sent you the right data?

        3. I don't know what kind of variable epi_week is, but if it's a string variable, perhaps there are differences in capitalization or punctuation, or blank padding that are causing values that, to your mind should match, but that aren't actual exact matches. For example "Abc" is not a match to "abc". Neither does "A, BC" match "A; BC" and neither of those will match "A, BC" If epi_week contains names, there might be spelling errors: "Carrol" does not match "Carol".

        A start on pursuing #2 and #3 would be:
        Code:
        browse if _merge == 1
        and after you have a sense of those discrepancies
        Code:
        browse if _merge == 2
        Looking at the unmatched observations in isolation, something may jump out at you as characterizing the nature of the problem.

        Comment


        • #5
          Thanks for those browse codes! I used an old/incorrect do-file to calculate epi_week in one of the datasets. Problem solved Many thanks!

          Comment

          Working...
          X