Merging of three different data sets using two id variables

Christer Flatoey

Join Date: Sep 2018

Posts: 4
#1

Merging of three different data sets using two id variables

05 Oct 2018, 07:56

Hi,

I'm trying to merge three different data sets, on the two key variables 'year' and 'country'. These are the two variables that all three data sets have in common. However, when executing the merge neither of the 'approved' functions, 1:1, M:1 or1:M, works. Stata responds that the variables year country do not uniquely identify observations in the using/master data.
I have checked for and removed duplicates.
So, I was wondering if someone could point me in the right direction?

Regards,
Christer A. Flatøy
Tags: cross sectional, large data set, merge, Time Series
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

05 Oct 2018, 08:28

Welcome to Statalist, Christer.

To begin with some general advice on making good use of Statalist, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

With that said, you don't tell us enough about where you are for us to know what direction you need to take to reach your destination. While the FAQ recommends providing data examples, they're often not sufficient in cases where there are non-unique identifiers, because the examples may not have duplicates.

So tell us more about your three datasets. Is it the case that each dataset is supposed to have a single observation for each country/year combination? Or do some of the datasets have multiple observations for each country/year combination? For those datasets, what additional variable allows you to tell the observations apart?

For example, you might have dataset1 with country/year/industry and dataset2 with country/year and dataset3 also with country/year/industry. Then something like

Code:

use dataset1 merge 1:1 country year industry using dataset3, generate(m1) merge m:1 country year using dataset2, generate(m2)

would start you in a useful direction. After the merges, you could use m1 and m2 to discover which datasets each observation appeared in.

My point is that just because the three datasets have only year and country in common does not necessarily mean that your merges are restricted to those two variables. But again, you have not told us enough to know if this misunderstanding is indeed the source of your problem.
Comment
Christer Flatoey

Join Date: Sep 2018

Posts: 4
#3

05 Oct 2018, 10:07

First of all, thank you very much for your reply Mr. Lisowski. Second, I agree completly about the lack of information, and I thought about immidiately after posting. And I'll definetly spend some time getting familiar with the best way of posting question etc. This forum seems like an excellent resource that can really help with different challenges in Stata and statistics.

Regarding the data sets. The largest is from the European Social Surveys (ESS), where I have appended 8 rounds of data for 19 European countries into one data set. I am trying to merge the ESS set with data from Global Terrorism Database (GTD), where GTD contains data for the same countries for the same period. After this merge, I will try to merge the new data set with a Quality of Government (QOG) data set that also has data from the same countries for the same period as the other two original data sets.

My aim is to use this new merged data set in a multi level regression analysis (which is another challenge I've started looking into). The dependent variables is from the ESS set, 'trust in [institution]'. The ESS set also contains different covariates that will be controlled for ('micro' variables), and so does the QOG set ('macro' variables). The GTD set contains the independent variable of interest, 'amount of terrorist incidents/country/year'. In total there's almost 290 000 observations, where about 285 000 comes from the ESS set.

So, your question regarding observations is a valid one. The data sets does not, and probably should not, have a single observation for each country/year combination. Both the ESS and GTD data sets contain multiple observations for each country/year combination, while the QOG has one observation for each country/year combination.

Both the ESS and the GTD sets, the sets that have multiple observations, also contain id variables: 'idno' and 'eventid'. These variables will, I asume, contribute so that the observations will be uniquely identified. However, and this question probably illustrates my lack of experience with merge, and Stata in general, will merge the correct way? I.e. will the correct observations from the one data set, e.g. ESS, be merged with the appropriate observations from the other set, e.g. GTD? Also, the ESS set has 4020 duplicates for idno/year/country, so if I drop these, i.e. 2010 copies, every observation will not be connected to and idno. I don't know if that matters for the merge process.

Regards,
Christer A. Flatøy

Ps. if this is not enough information regarding the data and the merge procedure, I will post the necessary information.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

05 Oct 2018, 11:20

Here are my thoughts.

Your "master data" is the ESS, for which you have a longitudinal dataset of individuals within years within countries, so country/year/idno seems to be the unique identifier in the ESS data. Your objective is to append, to each individual, data from QOG about governance and data from GTD about terrorism.

But first, the duplicates in ESS are a problem, although not an immediate problem. Nevertheless, at this point you should step aside from the merging and first figure out why you have duplicates and then deal with them appropriately. Perhaps you accidentally appended the same ESS dataset twice, or perhaps one of the ESS datasets has a problem. Look at the output of help duplicates for ideas on how to locate the duplicated values. Look at the documentation for ESS to see if duplicates are expected, and if so, what they indicate. If the observations are duplicated in their entirety - so that every variable has the same value in each of the duplicated observations - then dropping all but one of each duplicated observation will suffice. But if the observations with the same country/year/idno have different values for some variables - for example, one is a 23-year-old man and the other is a 32 year old woman - then you have to decide how you want to handle these, based on what the ESS documentation tells you.

Your QOG data has one observation per country/year combination. So - even without first cleaning up the duplicated ESS observations - the following should work.

Code:

use ESS merge m:1 country year using QOG, generate(m1)

That will augment your individual-level survey data from the ESS with country-level data from the QOG describing, for each individual, the governance of their country during the year of the observation.

The problem is your GTD data. It seems to have observations corresponding to terrorist incidents within years within countries, so that country/year/eventid is the unique identifier. You surely can't want to append each incident to each individual within the same country and year as the incident, because you tell us that the GTD variable of interest is 'amount of terrorist incidents/country/year'. That suggests that it has the same value for every incident within the same country/year combination, or else that it doesn't yet exist but you plan on creating such a variable from the data it does contain.

You need to first - before merging - reduce your GTD data to a single observation per country/year combination, creating a new dataset with unique country/year combinations and with the GTD variable of interest, which is at the country/year level. Once you have done this, the following extends the code above to merge it with the results of the first merge.

Code:

merge m:1 country year using GTDreduced, generate(m2)

This will augment your individual-level survey data from the ESS with country-level data from GTD describing, for each individual, the incidence of terrorist events in their country during the year of the observation.
1 like
Comment
Christer Flatoey

Join Date: Sep 2018

Posts: 4
#5

09 Oct 2018, 03:22

I cleaned the data sets some more and changed the design a bit, and with your guidance the merge was successful, with every observation matched. Thank you very much Mr. Lisowski!

Best regards,
Christer A. Flatøy
1 like
Comment

Announcement

Merging of three different data sets using two id variables

Comment

Comment

Comment

Comment