Code for running merge on multiple datasetss seperated by year.

Mads Lillevang-Johansen

Join Date: Sep 2015

Posts: 22
#1

Code for running merge on multiple datasetss seperated by year.

22 Sep 2015, 06:00

Hello Statalist

I have 2 lists of datasets containing a personal identification number (not unique) and a record number (unique), among other variables differing for each of the lists. All the sets in the 2 lists are seperated by year, because the record number shifts each year. Therefore, I have to merge by record number for each year. Thus my code will look something like this:

use lpr_1993
sort recnum
merge m:m recnum using pop_1993

I was wondering if there was a form of "loop"-code that would enable me to run this command for each year? It's merely a question of saving space in the dofile, and time.
I have been looking into using the "foreach" command, but I'm quite new to this, so I'm not sure how it will work, or even if it is applicable in this scenario. I hope you can help me.

Mads
Tags: loop
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

22 Sep 2015, 19:49

Welcome to Statalist!

I (and perhaps others) have hesitated to address your question because I don't entirely understand what you are trying to do.

I think the first step for you is to actually do two or three of the merges without a loop and consider the results and be certain they are what you want. Then show us the sequence of commands you used, and we can advise you on how to use foreach or forvalues to help.

But as you are trying the merges, you should carefully review the documentation on the merge command found in the Stata Data Management Reference Manual PDF included in your Stata installation and accessible through the PDF Documentation entry in the Help menu. You suggest you will use an m:m merge, but the documentation includes the following warning.

merge m:m varlist . . . specifies a many-to-many match merge. This is allowed for completeness, but it is difficult to imagine an example of when it would be useful. ... Use of merge m:m is not encouraged.

This suggests it is likely that you want some other option, most likely 1:1.

Let me add two other pieces of advice.

When I began using Stata in a serious way last fall, I started by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata.

Also, to increase the likelihood that Statalist readers will be able to assist you, please review the Statalist FAQ linked to from the top of the page, especially sections 9-12 on how to best pose your question. The more you help others understand your problem, the more likely others are to be able to assist you.
Comment
Mads Lillevang-Johansen

Join Date: Sep 2015

Posts: 22
#3

23 Sep 2015, 03:17

Thank you very much for your reply, I'm sorry if I made it hard to understand. I don't think I did a very good job in explaining which problem I am facing, and I shall try again. I made a typo, neither the record number variable nor personal identification number variable are unique. What I meant to say was that the record numbers are unique to each personal identification number, but there are other variables (such as diagnoses) linked to the same record number, thus making for repeats, as well as there are several record numbers for each personal identification number. I have enclosed an image to make this more understandable, if I have failed to clarify. My one dataset holds personal identification numbers and record numbers, while the one i want to merge with only holds record numbers and necessary variables not enclosed in the first set. Because the record number varies per year, I have to merge the datasets for each year separately. Thus I end up with the following command:

use lpr_pop_1994
sort recnum
merge m:m using lpr_diag_1994'
sort pnr

pnr being the personal identification number. Using this approach i get no unmatched observation, this result is the same when I run 1:m. 1:1 merge does not work, seeing that recnum is not unique, again I apologize for not making that clear.

Returning to the problem at hand, I have the same datasets containing the same variables for each year, and my question was if it would be possible to run a command that will merge the corresponding datasets for each year forward e.g. lpr_pop_1995 with lpr_diag_1995 etc., seeing that the variables are the same for each year.

I hope this clarifies things a bit.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

23 Sep 2015, 08:23

I think that William was too deferential to StataCorp's hedging on m:m merge. I can absolutely say that you cannot possibly need to run a m:m merge because the results are non-sensical.

You appear to be saying that 1:m merge works in which case that's what you should use. This formulation will check that recnum uniquely identifies observations in the master. In order to merge properly, the key variables must uniquely identify observations in either the master or the using or both. A m:m merge simply disables the checks that verify this requirement and jumbles the observations together in a random way.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

23 Sep 2015, 10:34

The following code will, unless I have made a mistake or misunderstood what you want, use your code to merge each pair of lpr_pop and lpr_diag files creating a new lpr_merged file.

Code:

forvalues year = 1993/1997 { use lpr_pop_`year', clear sort recnum merge 1:m recnum using lpr_diag_`year' sort pnr save lpr_merged_`year', replace }

Again, adding to Robert's comment and my earlier comment, when the Stata documentation itself discourages the use of merge m:m you should take it seriously. Let me quote below the extended description from the reference manual, and note that since your description and picture in post #3 are consistent with a 1:m merge, I have made that change to your code, along with adding the name of the variable being used to match the two datasets.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
2 likes
Comment
Mads Lillevang-Johansen

Join Date: Sep 2015

Posts: 22
#6

25 Sep 2015, 04:01

Thank you very much, this was exactly what I was looking for.
Comment
Tereza Lokajickova

Join Date: Dec 2016

Posts: 2
#7

01 Dec 2016, 05:25

Let me also thank you, William Lisowski.
Comment

Announcement

Code for running merge on multiple datasetss seperated by year.

Comment

Comment

Comment

Comment

Comment

Comment