Joining when the join variable has a different name in two datasets

Michal Dzielinski

Join Date: Mar 2020

Posts: 2
#1

Joining when the join variable has a different name in two datasets

12 Mar 2020, 05:17

I get frustrated time and again by the fact that join variables need to have identical names in both datasets for Stata to perform the join, so finally I decided to ask.

In SQL, names don't matter as long as I specify, which variables to match on e.g. "SELECT * FROM a, b WHERE a.v1 = b.v2"
Why can't it work in similar fashion in Stata? I'd appreciate any explanations that could soothe my pain over this issue.
Tags: data, joining

1 like
Maarten Buis

Join Date: Mar 2014

Posts: 3449
#2

12 Mar 2020, 06:21

It is a design choice: The variables in the two datasets are supposed to be the same, so you emphasize that by forcing equality of names. You almost always have to open and prepare both datasets before merging anyhow. So changing names is not a big deal, and often improves the readability of your .do files. Another issue is that separating the data preparation part before the merging from the actual merging is in my view, the right choice. Putting too many steps in one program often leads to confusing syntax. Other than that, I see no reason why there could not be an option that identifies the names of the id variables in the using and master dataset. You could do this yourself: write a wrapper for merge with such an option. I will not do that, as I am afraid that such an option would encourage sloppy data preparation and thus more errors.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Michal Dzielinski

Join Date: Mar 2020

Posts: 2
#3

13 Mar 2020, 03:01

Thank you, Maarten - now at least I know that is's literally force majeure :-)
The specific yet quite representative example that got me frustrated this time was the following:

Table A contains annual data about companies, including some information about their CEOs and CFOs, so it includes the variables CEOid and CFOid. Those ids are originally from Table B, which contains annual data about all kinds of corporate executives, so it just has one variable: Execid

Now, I wanted to add some information from B to A and instead of just doing one join on A.CEOid = B.Execid and another on A.CFOid = B.Execid, I had to rename Execid, do the join, and repeat. Not sure that made the whole process less confusing... (yes, I could reshape Table B but it's originally from an external source and periodically updated, so it wouldn't be a one-time operation either...)

Anyway, thanks again for explaining - I feel better now :-)
1 like
Comment
Chris Penfold

Join Date: Jun 2018

Posts: 3
#4

20 Oct 2022, 07:07

I can understand Stata not wanting to encourage sloppy practice, but merging datasets on variables with different names is done routinely in other software.
Also, Stata still allows this merge:

Code:

merge m:m

...!

This is allowed for completeness, but it is difficult to imagine an example of when it would be useful.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#5

20 Oct 2022, 09:06

I can understand Stata not wanting to encourage sloppy practice, but merging datasets on variables with different names is done routinely in other software.

Also, the -frlink- command, linking frames, and highly analogous to -merge-ing data sets allows different variable names. And while that practice can be sloppy, there are times when it is actually quite clarifying to explicitly show in the code that you are linking by matching two different variables, as, for example, when joining two data sets where the link is a parent's person ID in one data set and a child's parent ID in the other. Changing the names of the id variables (or cloning them under a common name) obscures the logic in a situation like this.

As for -merge m:m-, I have railed against its existence often on this forum and don't have the energy to do so yet again today. I have repeatedly asked for its elimination in the wishlist threads. I imagine that they feel the need to retain it for backward compatibility with existing code--yet I am willing to wager that the overwhelming majority of its uses in existing code are, in fact, errors and it would be a good step forward if those code files were broken by eliminating -merge m:m-. It would be, at worst, an inconvenience to the handful of people who have legacy doe in which -merge m:m- is actually correct (if there are any at all!) to have to revise their code.
1 like
Comment

Announcement

Joining when the join variable has a different name in two datasets

Comment

Comment

Comment

Comment