How does David Dorn endorse a m:m merge?

Evan Kim

Join Date: Sep 2021

Posts: 2
#1

How does David Dorn endorse a m:m merge?

24 Sep 2021, 22:26

Hi everyone,

First time poster to this message board. I had a question about David Dorn's commuting zone crosswalking files posted on his website, specifically files E5 and E6.
https://ddorn.net/data.htm

Dorn explicitly calls for a m:m merge with the ACS data on his website, and this does not work to convert the Public Use Microdata Areas (PUMAs) to commuting zones. Could someone explain why Dorn endorses this method?
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10262
#2

25 Sep 2021, 04:50

It is unlikely that anyone will invest the time to go through those files and figure out what is going on, unless they directly work with those data. The use of a many-many merge almost always results from a misunderstanding of what the command does. The user notices that he/she has duplicates in both datasets and assumes that a many-to-many merge is the way to combine the datasets. As Clyde Schechter puts it, the result is most often a "data salad". You can try contacting the author for an explanation. It is possible that this seemingly small coding issue may have a profound impact on the subsequent analysis.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

25 Sep 2021, 13:51

I have a different interpretation of David Dorn's recommendation.

On the data page linked to from post #1, it is not the case that "Dorn explicitly calls for a m:m merge ..." which strongly suggests Stata syntax.

What he writes is "... one has to merge the geographic unit of the Census file to the corresponding CZ crosswalk file using a many-to-many merge." That is a more generic phrasing that references the process of merging rather than a specific merge command. Searching the web finds examples of "many-to-many merge" used to describe the goal of, for example, the SQL joinby command on which Stata's joinby command is based.

I was able to access online a copy of Dorn's The Growth of Low-Skill Service Jobs and the Polarization ... which he cites as the reference for most of his E-series tables. Nowhere in it does he suggest that his work was done in Stata.

It is unfortunate that the term describing the general process of "merging" two datasets where the key is not a unique identifier in either dataset was used, first by SAS and then by Stata, to describe the results of a not-very-helpful, and easily misunderstood, implementation of their respective "merge" commands.
3 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

25 Sep 2021, 15:01

I am now less confident of my analysis in post #3.

I overlooked the paper-specific archives of Stata files, etc. on the data page linked to from post #1.

I downloaded and searched the files for the paper cited in post #3 (Dorn's P2).

I find the following version of the advice on the data page.

To allocate observations from Census microdata to Commuting Zones, one has to merge the geographic unit of the Census file to the corresponding Commuting Zone crosswalk using a many-to-many merge (command joinby in older versions of Stata).

This seems to recognize the appropriateness of joinby.

I also find in the code several instances of the "old" merge syntax, which does not include the 1:1, 1:m, m:1, or m:m.

That syntax predates my use of Stata, but it continues to function, in the recent versions of merge that support the 1:1 etc, and the results from my experimentation suggest it worked as though it were merge m:m.

I am unable to understand the intent of the parenthetical "command joinby in older versions of Stata".

I hazard a guess that perhaps a very old merge syntax did not originally support any sort of many-to-many merges but at some point was "enhanced" to do so, and perhaps Dorn misunderstood joinby to have been folded into merge, (would that Stata had done so!) rather than that merge had become compromised in the same way as the SAS data step merge. And that misunderstanding would have been an easy one to fall into, given the common use of "many-to-many" to refer to the "joinby" outcome.

Last edited by William Lisowski; 25 Sep 2021, 15:04.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

26 Sep 2021, 05:56

Closing this loop, Professor Dorn has told me

The right command to use with my crosswalks is joinby (as everyone who carefully looks through the files on my webpage can easily verify).
The text on my webpage pre-dates the time when Stata introduced the confusing merge m:m command.

and he indicated he would review his web page to reduce the likelihood of the confusion we had.

So I'd say the bulk of what I wrote in post #3 applies: he meant many-to-many as a description of the process rather than as Stata's narrower terminology for its merge m:m command. I'll admit that I did not invest the time to go through those files and figure out what is going on: I was searching for merge m:m and was distracted by finding the older merge syntax.

I will add the following in SAS's defense. Prior to the introduction of SAS in the 1970s, statistical packages generally did not have built-in capabilities for sorting and merging data, and they processed their input datasets one observation at a time. Merging was accomplished by an initial program in the statistical package that prepared output datasets that were then processed by a separate sort/merge utility program provided by the mainframe operating system, and the results would then be read as by the statistical package as input in a subsequent program. At that time, relational database concepts exemplified by joinby and merge 1:m had not reached widespread familiarity, and the way merging worked was designed for a process that had access to just the most recently read observation and the prior observation from each of the input datasets. The "merge m:m" style of result in the face of multiple observations with the same merge key in each dataset was, I suspect, rarely the expected or desired output, but just how the merging algorithm worked. And SAS duplicated this, because that was what 1970s users expected of a merge.

50 years later, it seems time to put what was never a good idea to rest and retire merge m:m.
2 likes
Comment

Announcement

How does David Dorn endorse a m:m merge?

Comment

Comment

Comment

Comment