Merging datasets together

Vitor Tardelli

Join Date: Apr 2019

Posts: 2
#1

Merging datasets together

19 Apr 2019, 09:29

I am currently in a tricky situation. I have a dataset with 83,311 observations and 109 variables. I need to add a categorical variable, "year", to it ("year" is either =2015 or =2016). I have created from other files a new Stata dataset with that one variable, "year", only, also with the same 83,311 observations. How do I incorporate this new variables (with matching observations) into my first dataset, with all the other variables?

Best and happy easter!

Vitor
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

19 Apr 2019, 11:50

Welcome to Statalist.

The command you need is the merge command. I will guess that you have an identifier - which I will call ID - which takes a different value for each of the 83.311 observations, and you want to match the observations on the value of ID.

Code:

use dataset1 merge 1:1 ID using dataset2

might start you in a useful direction. First you should read the output of help merge to understand how to use the command with your data and how to interpret its output.

With that said, to improve the likelihood of a more helpful answer to future questions, please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment
Vitor Tardelli

Join Date: Apr 2019

Posts: 2
#3

19 Apr 2019, 14:32

Dear William,

Thanks for your answer. I am kind of desperate about this. What I now notice having done before is deleting many variables I wouldn't need from the original dataset, so to make it fit into Stata IC. I have a Stata SE license now, so that wouldn't have been necessary at the end. Well, it happens that I forgot to include the ID variable in my final bank, which I did all my analysis with. I deleted some of the individuals from the original dataset as well.

So here's what I am trying to do. I started over from my original dataset, with 114k observations, and deleted the same individuals I did at the first time. It seems to have worked, since I came to the same final number of observations (83,311). This time, I kept the ID variable and the variable I need to merge, as well as many others. I used Data > Combine Datasets > Merge two datasets. It asks for key variables, but even if I insert all variables I have, it says "variable questid2 does not uniquely identify observations in the master data". In the "type of merge" option, it works when I change from "one-to-one on key variables" to "one-to-one by observation", but the logistic regression models using variables from the merged dataset simply won't work. I think the "one-to-one by observation" does not properly match observations from both datasets. Any insights?

Thanks again!
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#4

19 Apr 2019, 15:16

This is a really hard question to answer because you don't provide clear information about what variables you have in the two datasets you are trying to merge. You should reread the FAQ to get some good advice on how to frame questions so we can more easily help.

It's very rare that you would want to try to merge without using an ID variable. You should only do this if you are absolutely positive that every observation in your master data matches an observation in your using data AND that the observations are in the exact same order in both datasets. That second part is crucial and if it isn't true matching by observation will result in garbage data.

In your case it sounds like you need to look carefully at your data and determine what variable or combination of variables uniquely identify observations in your data. You can use the isid command to verify whether a variable (or combination of variables) uniquely identifies your observations. If you have multiple observations per ID value it's crucial that you figure out why that is. Do you expect to have multiple observations within a given ID? Is it a sign of a problem with your data?

If your master data has multiple observations per ID level you may still be able to do a straight-forward merge using a m:1 match. For this to work you need for your ID variable to be a unique ID (each value identifies one and only one observation) in the using data set. If you have multiple observations for each value of ID in both datasets you'll need to take a different tact.

If you give us clear information about your data (include examples of both datasets using dataex) we can help you. As it is now any specific advice I might give would be based on guessing at what your data looks like.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

20 Apr 2019, 09:17

I started over from my original dataset, with 114k observations, and deleted the same individuals I did at the first time. It seems to have worked, since I came to the same final number of observations (83,311). This time, I kept the ID variable and the variable I need to merge, as well as many others.

I agree with Sarah's advice. I would start with your original dataset, and keep all of the variables you need — those you had originally, those you later realized you needed to use to create your new categorical variable, and of course whatever variables are needed to uniquely identify each observation in your data — and thus forgo the need to merge the categorical variable you calculated separately.

With that, then, you need to rerun all the work you did on the original dataset and see that the results are the same. If you were systematic in your work, it just means rerunning the do-files you created to do that work. If you were not systematic, and you did everything using menus without keeping a copy of the commands that Stata actually ran (from which you could construct a do-file to rerun them), then you have a more substantial amount of work ahead of you.
Comment

Announcement

Merging datasets together

Comment

Comment

Comment

Comment