Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Doubts in merging two datasets

    Good morning everyone,
    I merged two datasets using a 1:m merge. However, some observations didn't match, mostly from the master dataset.

    Since I struggle in understanding the "merge" logic, I have few questions to ask:
    1) How do you know if the type of merge (1:m or m:1) is correct for your purpose?
    2) Why some observations are not merging, even if the variable name is exactly the same?
    3) Suppose you do a merge and you obtain unmatched observations, what are your suggestions on how to proceed with the analysis?

    Thanks.

  • #2
    Originally posted by Fabio Delisio View Post
    1) How do you know if the type of merge (1:m or m:1) is correct for your purpose?
    The character before the colon refers to the data currently open, and the character after the colon to the file specified in using. Say you have two datasets: dataset1 contains survey data collected in different countries. So a row in this dataset is a person. The country is stored in a variable called country and each country will appear multiple times because multiple people from the same country were interviewed. Dataset2 contains for a set of countries the GDP per capita in a given year. So a row now represents a country and each country appears exactly once.

    So if dataset1 (the dataset with individuals) is currently open, then you would type:

    merge m:1 country using dataset2

    because each country can appear multiple times in the dataset in memory but only once in the dataset specified in using.

    If dataset2 is currently open, then you would type

    merge 1:m country using dataset1

    Originally posted by Fabio Delisio View Post
    2) Why some observations are not merging, even if the variable name is exactly the same?
    Sometimes there just isn't a match. Continuing the example, the survey was collected in a few countries, while the GDP per capita was collected in many countries. If your survey did not take place in Vatican City, but your dataset2 did contain GDP per capita data on the Vatican City, then no match will be found for Vatican City. This is usually not a problem, and you can just remove the extra observations the merge command creates.

    Sometimes, there should be a match. For example, the variable country is a string with the country name, and in dataset1 you use Luxemburg as one of the country names, while dataset2 uses Luxembourg for that same country (or Ivory Coast and Côte D'Ivoire, or ... ). So in this case, there is a problem, and you first need to fix it before doing the merge again. The merge command leaves a variable behind called _merge, which identifies what happened to each observation. You just stare at the values of the problem observations long enough until you figure out what the problem is, and than you fix it. Usually this does not take that long, but sometimes it can be tricky. That is not a big problem, you just need to stare at it a little bit longer.



    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Going off on a tanget from Maarten Buis' outstanding explanation of -merge-, if you every find yourself confronting the problem of countries with alternative name spellings that he mentions, there is a fantastic tool available for resolving almost all of these problems. Rafal Raciborski's -kountry-, available from SSC, can reconcile nearly all such differences and also can crosswalk the commonly used standardized country coding systems if you are confronting a pair of data sets that use different ones.

      Comment


      • #4
        Thank you both for the clear explanation and help!

        Comment

        Working...
        X