Joinby two variables

Alejandro Torres

Join Date: Jan 2018

Posts: 152
#1

Joinby two variables

27 Dec 2018, 13:55

Dear statalisters,

I am trying to merge two datasets and I have some problems. I started yesterday merging using

Code:

joinby firm

and everything was ok. Today I am trying to use

Code:

joinby country year

but I have a problem, I think I créate duplicate data. My master data has 1 million observation and a size about 1,3 GB and the second database is about 170,000 observation and a size of 10MB. The final database is about 20 GB and 20 millón observations.

Do you know why is that change in size and observations? I think there are some duplicates, how can I see if there are duplicates and what can I do if there are?

Thank you very much for your help.

Alejandro
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#2

27 Dec 2018, 14:23

-duplicates report- will tell you how many observations appear as duplicates and how many copies of them there are. If you want to identify them in your data set so you can do something with them, look -duplicates tag- will do that. Read -help duplicates- for the details of these commands and how to use them.

If there are duplicates in your -joinby- result they necessarily arise from duplicates in the original datasets. -joinby- will not create duplicates that were not there to start with. From what you describe, it sounds like for each observation in the master data, there are, on average, 20 observations in the second data set that agree with it on country and year. Does that seem unreasonably large? If it does, perhaps you meant to also restrict the pairings to those that also agree on some other variable(s). If so, the variable(s) in question must be included in the varlist of the -joinby- command.
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#3

27 Dec 2018, 14:26

Joinby forms all pairwise combinations within groups. So if the variable is not unique in one of your datasets you will create duplicates. Stata has a duplicates command you could use.

I personally would use a merge and verify beforehand that your data are unique.

Code:

bysort country year: assert _N == 1

If you want to check duplicates after the joinby you could try

Code:

bysort country year: gen var_count = _N br if var_count > 1
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#4

27 Dec 2018, 14:55

Hello, thank you for your answers.

From what you describe, it sounds like for each observation in the master data, there are, on average, 20 observations in the second data set that agree with it on country and year. Does that seem unreasonably large? If it does, perhaps you meant to also restrict the pairings to those that also agree on some other variable(s)

That is correct, I am trying to merge political parties influencing firms decisions, and I have observation lets say for France 1995 to 2015. But for every observation of firms I have a year and country of observation, lets say firm Renault country france year 2000 income 1,000, then I want a merge a name of a political party for france and year 2000. The second database only have a country, year and a name of the political party.

Thats why I dont understand the reasons for having more observation after the merge.I did the same after a concatenate and I have the same results.

At the other side I tested

Code:

bysort country year: assert _N == 1

and received a message saying
1,463 contradictions in 1,463 by-groups

Hope to be clear.

Thank you again,

Alejandro
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#5

27 Dec 2018, 15:25

From what you've described, you should have a unique list in your "second database" with the country and year. I would try doing a merge.

Something along the lines of

Code:

use "Dataset1.dta" , clear merge m:1 country year using "Dataset2.dta" tab _merge

Also you should try the code below in your using dataset ("second database")

Code:

bysort country year: assert _N == 1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#6

27 Dec 2018, 15:37

So, focusing on Renault, in France, year 2000, how many political parties were there in France in year 2000. I would be willing to bet it's at least 3 and if minor parties are included in your data set there could well be many more. For present-day France, Wikipedia lists 3 major national parties and 20 minor national parties. So your observation for Renault, in France, year 2000 would lead to somewhere between 3 and 23 observations in the resulting data set after the -joinby-.

Justin Blasongame and I seem to have different impressions regarding the nature of your data. From my reading of your descriptions, you would not have unique identification of observations by country and year in either of your data sets. But he is certainly correct that if country and year do identify unique observations in one of your data sets, you should use -merge- rather than -joinby-.
Comment
Alejandro Torres

Join Date: Jan 2018

Posts: 152
#7

28 Dec 2018, 06:27

Thank you very much !!!! Justin and Clyde for your advice !!! it Works !! I used the merge that Justin said, since I have only one party on dataset 2, the party with the largest presence in the government.
I would like to ask you just one more question. When I do the merge, I add to the dataset1 countries from dataset2 that I dont have in dataset1, are there any option to merge that allow to merge only if I have a match between dataset1 and dataset2 not adding the extra countries?
Thank you very much, you saved me a lot of time.
Alejandro
Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 760

28 Dec 2018, 08:35

Yes, you could do the following:

Code:

merge m:1 country year using "Dataset2.dta"
keep if _merge == 3
drop _merge

Or in one line:

Code:

merge m:1 country year using "Dataset2.dta" ,  keep(match) nogenerate

Announcement

Joinby two variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment