Running the same .do-file multiple times results each time in a different amount of observations after merging 3 datasets

LydiaSmit

Join Date: Jul 2014

Posts: 77
#1

Running the same .do-file multiple times results each time in a different amount of observations after merging 3 datasets

09 Jul 2014, 23:02

Dear Statalists,

Running the same .do-file multiple times, without making any adjustments by hand between those runs, results each time in a different amount of observations.
After merging three datasets, the outcome of total obs. keeps changing.

Below is an example of what I do after the 1st merge with the masterfile called Example.dta
Maybe this is the reason of the strange problem.

Save Example1.dta, replace
use Likewise.dta, clear
drop useless variables
rename some variables
sort key variables
save Likewise1.dta, replace
use Example1.dta, clear
sort key variables

merge based on the key variables (of both datasets) using Likewise1.dta

4 'endresults': 29,551 obs; 29532; 29563; 29588

Last edited by LydiaSmit; 09 Jul 2014, 23:07.
Tags: None
Andrew Lover

Join Date: Apr 2014

Posts: 182
#2

10 Jul 2014, 01:22

This is likely due to the -sort- command. Adding the -stable- option is likely to fix it.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#3

10 Jul 2014, 01:31

I alos suspect -sort- is the problem. But I don't think the -stable- option will solve it. Instead I think the -stable- option will just hide the problem with the identifying variables. So I would strongly recommend not to use the -stable- option. Instead she should take a close look at the identifying variables and find the real problem. She can find useful material on how to diagnose possible problems here:
http://blog.stata.com/2011/04/18/mer...rges-gone-bad/
http://blog.stata.com/2011/05/27/mer...le-key-merges/

Appart from that Lydia is using old syntax for -merge-. She can avoid -sort-ing altogether by using the new syntax.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#4

10 Jul 2014, 03:29

I'll certainly defer to Maarten's expertise on this!

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#5

10 Jul 2014, 04:54

I also recommended -sort, stable- in this thread. It turned out to be bad advice, because the user did have a problem that stable was hiding rather than solving. I won't give that advice again.

http://www.statalist.org/forums/foru...-ttest-command

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
LydiaSmit

Join Date: Jul 2014

Posts: 77
#6

10 Jul 2014, 04:59

Thank you for trying to help me.

Maarten, I already checked those exact two sites before I decided to merge on multiple key variables for my first merge. However, I didn't understand an important part.
by id: assert _N==1 The writer, William Gould, used that command probably to check the uniqueness as he described before, however, when I used that command a few days ago. Nothing happened.

use one, clear . merge 1:1 id sex using two . sort id . by id: assert _N==1 . keep if _merge==3
By the way, I also checked the following two sites for my merges (about a week ago), which Konrad Zdeb advised me today. "Merging in Stata is neatly explained here and here."
http://www.statalist.org/forums/foru...t-consequences

I used http://www.ats.ucla.edu/stat/stata/faq/multmerge.htm to do my merges but I combined that method with adding an extra key variable for 2 of my 3 merges. (So using 2 key variables for those two merges as explained in the links you, Maarten, gave me.

These three merges, of which the first one is highly likely to be succesful, are the first merges I ever made in Stata. (Never learnt anything about merges at college, work etcetera.)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#7

10 Jul 2014, 05:24

I haven't been following this carefully, but you may want to check out the -isid- command. isid checks whether the specified variables uniquely identify the observations. So, you would do something like

isid var1 var2

If there is a problem, you will get the message

Code:

. isid var1 var2 variables var1 var2 do not uniquely identify the observations r(459);

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#8

10 Jul 2014, 05:24

On one detail: assert says nothing if the statement is true. Try some tautology:

Code:

assert 42 == 42

You will get no output. The flavour of assert is: something should be true, so check if it is. If not, stop me with the news; otherwise let me carry on.

Or more succinctly, no news is good news.

For more, with examples, see William Gould at http://www.stata-journal.com/sjpdf.h...iclenum=dm0003
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#9

10 Jul 2014, 11:35

If the number of merged observations changes at each run of the do-file, then the datasets used for the merge have changed so you must look upstream for what has happened. Since you are saving each dataset before performing the merge, you can use the cf command to quickly check for changes between do-file runs. For example, before saving a new version, you would include in your do-file

Code:

cf _all using Example1.dta, all save Example1.dta, replace

If you do this for all new versions of your datasets, you should quickly be able to identify which dataset has changed.

As to why the dataset has changed, as others have pointed out, this is usually because of an incomplete sort. Look back at all the sort commands in the do-file and replace them with isid varlist, sort commands. For example, replace

Code:

sort country id

with

Code:

isid country id, sort
Comment
LydiaSmit

Join Date: Jul 2014

Posts: 77
#10

10 Jul 2014, 13:07

Thank you for the replies. isid and especially assert seem to be very useful.

Could a m:m merge lead to this problem of mine? (different endresult of obs while running the same .do-file over and over).

Robert, your reply seems to be the solution for my problem. Thank you very much. I'll check it after I learnt more about the joinby-command which you advised me in my other topic.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#11

10 Jul 2014, 13:52

Sure, another problem with m:m merge is that within groups of the key variable used to perform the merge, the order of observations is not taken into account. After any merge, you must perform a sort to get the observations back in order. If you follow a regular merge with a m:m merge, the second merge will sort your data only based on the key variable. That means that at each run, the order of observations within each group will change. Try running the following example many times, the results should change at each run. Yet another reason why you should never use m:m merge!

Code:

clear input str1 id x A 1 A 2 A 3 B 2 end tempfile fx qui save "`fx'" list, sepby(id) noobs clear input str1 id y A 11 A 12 A 13 A 14 B 12 B 13 B 14 end tempfile fy qui save "`fy'" list, sepby(id) noobs use "`fx'", clear bysort id (x): keep if _n == 1 list, sepby(id) noobs merge m:m id using "`fy'", keep(master match) nogen list, sepby(id) noobs merge m:m id using "`fy'", keep(master match) nogen list, sepby(id) noobs
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#12

10 Jul 2014, 16:41

Sorry but the last merge in my previous example did not do anything as I forgot to rename variables. This is what I was trying to illustrate:

Code:

clear
input str1 id x
A 1
B 2
end
tempfile fx
qui save "`fx'"

clear
input str1 id y
A 11
A 12
A 13
A 14
B 12
B 13
B 14
end
tempfile fy
qui save "`fy'"

use "`fx'", clear
merge 1:m id using "`fy'", keep(master match) nogen
list, sepby(id) noobs

rename y y0
merge m:m id using "`fy'", keep(master match) nogen
list, sepby(id) noobs

Comment

LydiaSmit

Join Date: Jul 2014

Posts: 77
#13

10 Jul 2014, 22:04

Thank you for the great help. By the way, your command "isid country id, sort" gave me the following error: variables myvarname1 myvarname 2 should never be missing. r(459).

Based on what you said earlier. "Appart from that Lydia is using old syntax for -merge-. She can avoid -sort-ing altogether by using the new syntax." I only temporarily didn't specify the merge type for my 2nd merge because it seemed to result in a perfect match without chosing a merge type. About 9 of the 10 sites I checked with the new syntax method use the sort command like I did. For example, Indiana University https://kb.iu.edu/d/azck

Could someone tell me why they still use the the risky, unnecessary sort-command?

By the way, even without using the sort-command at all, I'm still getting a different total amount of obs after each run of the same .do-file
cf _all helped me finding the origin of this problem, however, I haven't solved it yet because the cause isn't clear yet.

Last edited by LydiaSmit; 10 Jul 2014, 22:46.
Comment
LydiaSmit

Join Date: Jul 2014

Posts: 77
#14

11 Jul 2014, 00:18

Below is an example of what I do after the 1st merge with the masterfile called Example.dta

Save Example1.dta, replace (so after I merged the masterfile Example.dta with an other dataset I save the new merged file under Example1)
use Likewise.dta, clear

Better example...... I start with the masterfile A.dta then I open B.dta, drop and rename soms vars and save & rename B.dta to B1.dta after that I merge A.dta with B1.dta ....after that, I save & rename the merged file to A1.dta....then I use C.dta, drop and rename some vars and then save & rename C.dta to C1.dta,,,,,,then I use joinby key variable (not unique) using C1.dta

Like Robert suggested, I used the following:

cf _all using A1.dta, all save Example1.dta, replace Most of the time only 8 vars contain mismatches according to the above command cf _all.....6 of those 8 always contain mismatches...1 of those 6 is 1 of the 2 key variables used in my 1st merge. However, sometimes 22 vars contain mismatches.

Hopefully someone can help me further. I haven't slept at all tonight trying to solve this. I didn't even lay down. I'm even thinking about giving a very small cash reward for the person who can solve this annoying problem
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#15

11 Jul 2014, 06:42

By the way, your command "isid country id, sort" gave me the following error: variables myvarname1 myvarname 2 should never be missing. r(459).

That means you have missing data on the variables you are trying to sort on. You can add the -missok- option to -isid- . But is there a way to clean up the missing data?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Running the same .do-file multiple times results each time in a different amount of observations after merging 3 datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment