Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with merging datasets - unique ID repeats for twins/triplets for some datasets but not others

    Hello everyone,

    I have run into a problem when attempting to merge datasets on stata 14. I am using panel data, with three cohorts, called the “Child of the new millennium”. It contains data on children born in 2000. I am currently attempting to merge the datasets within each cohort. But the problem is, is that the ID does not uniquely identify each response in some datasets – in particular data-sets asking specific questions about the child. This is because the sample includes twins/triplets so, for about 300-400 observations, there is more than one code. There is a dummy variable included which allows me to identify twins and triplets but apart from that there is no way to identify them.

    But in other datasets, the ID appears only once so I am able to successfully merge in these.

    Unfortunately, I need to keep the twins and triplets in my sample – I was wondering if anyone could help? I'd really appreciate any advice that can be given

    I am new to both stata and this forum, please let me know if any more information is needed!

    Thanks in advance, Kishan
    Last edited by Kishan Chotalia; 28 Feb 2017, 09:18.

  • #2
    Well, if in all of your data sets but one the ID appears only once, and then you have just one data set where the ID sometimes occurs multiply, then it shouldn't be a problem. You can merge together all of the former data sets using -merge 1:1 ID-, and then starting from that result, -merge 1:m ID- using the latter. Everything should go smoothly.

    Now, perhaps you have several data sets where the IDs occur multiply. Then you have more than just a Stata -merge- problem, you have a conceptual dilemma. When you -merge- data sets, you want to put together the observations from one data set that correspond to the same people in the other. So let's say ID 1000 appears twice in each of two data sets, corresponding to two twins in datasets A and B. How do you know which of the two observations for ID 1000 in dataset A goes with which of the two observations for ID 1000 in dataset B? If the data sets you are dealing with were professionally curated, it is likely that there is some additional variable that distinguishes them. This would be a variable such as "birth order" that runs 1, 2, 3... and so ID=1000 birth_order = 1 would uniquely identify observations in each data set and would refer to the same person in every data set where it occurs. If there is such a variable (or perhaps a group of variables that jointly accomplish the same thing), then your sequence of operations would be:

    1. Merge together all the data sets where ID uniquely identifies observations using -merge 1:1 ID-.
    2. Merge that result with the first data set where ID does not uniquely identify observations using -merge 1:m ID-.
    3. Merge that result with each of the remaining data sets using -merge 1:1 ID birth_order-.

    If this doesn't clarify the situation for you, I suggest you post back showing examples of the data from a few of the data sets illustrating the problem. Be sure to use the -dataex- command to do that: see FAQ #12 for how to install and use -dataex-.

    Comment

    Working...
    X