When merging datasets I end up with duplicates in terms of my merging variable

Adam Sadi

Join Date: Jul 2022
Posts: 68

When merging datasets I end up with duplicates in terms of my merging variable

05 Sep 2022, 03:56

Dear Statalisters,

I hope I will be able to explain the problem clearly because I believe my problem is due to a small coding mistake that I can't find. Basically, I have a list of countries who each have two to five related datasets (the number varies). I would like to :

1) merge all the datasets related to a country
2) append all the merged country-specific dataset

I am stuck at the first step of my problem. Given the fact that I have a lot of countries, I created a loop to merge every country-specific dataset together. The merge goes fine for the majority of my countries, however, for three countries, namely Chad, Italy and North Macedonia, my unique identifier I used to merge every country-specific file together becomes doubled due to a bad merging.

Code:

global countries1 Albania Armenia Azerbaijan Belarus Bosnia_and_Herzegovina Bulgaria Chad Croatia Cyprus Czech_Republic El_Salvador Estonia Georgia Greece Guatemala Guinea Honduras Hungary Italy Jordan Kazakhstan Latvia Lebanon Lithuania Malta Moldova Mongolia Montenegro Morocco Mozambique Nicaragua Niger North_Macedonia Poland Portugal Romania Russian_Federation Serbia Slovak_Republic Slovenia Somalia South_Africa Togo Zambia Zimbabwe

global countries2 Armenia Bulgaria Croatia Cyprus Czech_Republic El_Salvador Estonia Georgia Greece Guatemala Honduras Hungary Italy Jordan Latvia Lebanon Lithuania Malta Moldova Mongolia Morocco Mozambique Nicaragua North_Macedonia Poland Portugal Romania Russian_Federation Slovak_Republic Slovenia Somalia Zambia Zimbabwe

global countries3 Armenia Bulgaria Croatia Cyprus Czech_Republic Estonia Georgia Greece Hungary Italy Jordan Latvia Lebanon Lithuania Malta Moldova Morocco North_Macedonia Poland Portugal Romania Slovak_Republic Slovenia Somalia Zambia

global countries4 Jordan


forvalues x = 1/4 {
    foreach country of global countries`x' {
        local round_file: dir "$inp_bm_cf/Round_`x'/`country'" files "*.dta"
        
        foreach rf of local round_file {
            tempfile `country'_R`x'_ed /* Ex: Albania_R1_ed */
            use "$inp_bm_cf/Round_`x'/`country'/`rf'", clear
            
            gen round_r`x' = `x'
            
            * Save data in a tempfile
            save ``country'_R`x'_ed', replace
            
        }
    }
}

foreach country of global countries1 {
    local master_file: dir "$inp_bm_bs/`country'/" files "*.dta"
    display `master_file'
    
    foreach mf of local master_file {
        use "$inp_bm_bs/`country'/`mf'", clear
    }
    

    forvalues x = 1/4 {
        capture noisily: merge 1:1 idstd using ``country'_R`x'_ed', generate(merge_round`x')
        tempfile `country'_AR /* Ex : Albania_AR (AR = Allrounds) */
        save ``country'_AR'
    }
}

foreach country of global countries1 {
    append using ``country'_AR', force
}

I can answer questions if the code is unclear, but basically, at the end of the appending process, Chad, Italy and Macedonia have their idstd variable doubled, which is annoying for the rest of my do-file. Have a look at my data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str28 countryx long idstd byte(merge_round1 i1 k3a)
"Chad" 624601 3 2  -9
"Chad" 624601 2 .   .
"Chad" 624602 2 .   .
"Chad" 624602 3 2  90
"Chad" 624603 3 2 100
"Chad" 624603 2 .   .
"Chad" 624604 2 .   .
"Chad" 624604 3 2 100
"Chad" 624605 2 .   .
"Chad" 624605 3 2 100
"Chad" 624606 3 2 100
"Chad" 624606 2 .   .
"Chad" 624607 3 1 100
"Chad" 624607 2 .   .
"Chad" 624608 3 1 100
"Chad" 624608 2 .   .
"Chad" 624609 2 .   .
"Chad" 624609 3 1  50
"Chad" 624610 2 .   .
"Chad" 624610 3 1  30
"Chad" 624611 2 .   .
"Chad" 624611 3 1  60
"Chad" 624612 2 .   .
"Chad" 624612 3 2  50
"Chad" 624613 3 2  20
end
label values idstd IDSTD
label values merge_round1 _merge
label def _merge 2 "Using only (2)", modify
label def _merge 3 "Matched (3)", modify
label values i1 I1
label def I1 1 "Yes", modify
label def I1 2 "No", modify
label values k3a K3A
label def K3A -9 "Don't know (spontaneous)", modify

As you can see, the doubled idstd show missing variables for the rest of the merged variable (i1 k3a). Now of course I could just run a duplicates drop command, but this seems like an unsatisfying solution as I would like to understand why my code didn't work in the first place, and why it didn't work for these 3 countries only (Italy, North Macedonia and Chad).

Thank you for your help!

EDIT : Before anyone asks, I carefully checked that idstd is a unique identifier in both Chad files with the command isid. So I believe the merge 1:1 is appropriate.
EDIT2 : It is definitely a problem of code and not of data as I did the merge just for Chad in a separate do-file and it went well.

Last edited by Adam Sadi; 05 Sep 2022, 04:54.

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

05 Sep 2022, 08:40

I note you use

Code:

capture noisily

on your merge commands but you code doesn't do anything to test if the command failed. What is that all about?

Are you sure the merge ran correctly for Chad, Italy, and North Macedonia when run in your do-file? Running the merge by hand doesn't count.

I would also review the directories

Code:

$inp_bm_cf/Round_1/Chad/ $inp_bm_cf/Round_2/Chad/ $inp_bm_cf/Round_3/Chad/ $inp_bm_cf/Round_4/Chad/ $inp_bm_bs/Chad/

to be sure there aren't unexpected datasets in any of them.

The fact that it happens for only three countries points strongly to data problems for those three countries.
Comment

Adam Sadi

Join Date: Jul 2022
Posts: 68

05 Sep 2022, 09:11

William :

Thank you for your reply. The different countries I'm studying each have a varying number of rounds, with one dataset to merge per round. Therefore, the capture noisily prefix is here to make sure that the code won't stop if a country doesn't have the number of rounds required (up to 4 in addition to a common baseline survey). For instance, Chad only has one round available in addition to the baseline survey, so the loop won't stop for x equal to 2, 3 and 4. There seems to be nothing strange in the directories, but I will double-check.

I could solve the problem by slightly changing the following lines:

Code:

global countries1 Albania Armenia Azerbaijan Belarus Bosnia_and_Herzegovina Bulgaria Chad Croatia Cyprus Czech_Republic El_Salvador Estonia Georgia Greece Guatemala Guinea Honduras Hungary Italy Jordan Kazakhstan Latvia Lebanon Lithuania Malta Moldova Mongolia Montenegro Morocco Mozambique Nicaragua Niger North_Macedonia Poland Portugal Romania Russian_Federation Serbia Slovak_Republic Slovenia Somalia South_Africa Togo Zambia Zimbabwe

global countries2 Armenia Bulgaria Croatia Cyprus Czech_Republic El_Salvador Estonia Georgia Greece Guatemala Honduras Hungary Italy Jordan Latvia Lebanon Lithuania Malta Moldova Mongolia Morocco Mozambique Nicaragua North_Macedonia Poland Portugal Romania Russian_Federation Slovak_Republic Slovenia Somalia Zambia Zimbabwe

global countries3 Armenia Bulgaria Croatia Cyprus Czech_Republic Estonia Georgia Greece Hungary Italy Jordan Latvia Lebanon Lithuania Malta Moldova Morocco North_Macedonia Poland Portugal Romania Slovak_Republic Slovenia Somalia Zambia

global countries4 Jordan


forvalues x = 1/4 {
    foreach country of global countries`x' {
        local round_file: dir "$inp_bm_cf/Round_`x'/`country'" files "*.dta"
        
        foreach rf of local round_file {
            tempfile `country'_R`x'_ed /* Ex: Albania_R1_ed */
            use "$inp_bm_cf/Round_`x'/`country'/`rf'", clear
            
            gen round_r`x' = `x'
            
            * Save data in a tempfile
            save ``country'_R`x'_ed', replace
            
        }
    }
}

foreach country of global countries1 {
    local master_file: dir "$inp_bm_bs/`country'/" files "*.dta"
    display `master_file'
    
    foreach mf of local master_file {
        use "$inp_bm_bs/`country'/`mf'", clear
    }
    
tempfile `country'_AR /* Ex : Albania_AR (AR = Allrounds) */
    forvalues x = 1/4 {
        capture noisily: merge 1:1 idstd using ``country'_R`x'_ed', generate(merge_round`x')

    }
save ``country'_AR'
}

foreach country of global countries1 {
    append using ``country'_AR', force
}

However, given that the merge ran correctly for those three countries, I don't understand what changed when adding the option keep(3) (the merge_round`x' == 1 or 2 were all equal to 0 before). So it's still puzzling, I'd be happy to know if someone knows why this happened.

Last edited by Adam Sadi; 05 Sep 2022, 09:15.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#4

05 Sep 2022, 11:05

Code:

foreach country of global countries1 { local master_file: dir "$inp_bm_bs/`country'/" files "*.dta" display `master_file' foreach mf of local master_file { use "$inp_bm_bs/`country'/`mf'", clear } forvalues x = 1/4 { capture noisily: merge 1:1 idstd using ``country'_R`x'_ed', generate(merge_round`x') tempfile `country'_AR /* Ex : Albania_AR (AR = Allrounds) */ save ``country'_AR' } } foreach country of global countries1 { append using ``country'_AR', force }

As I have no files, nor set of directories with similar structure to work with, I cannot troubleshoot your code in the usual way. I have highlighted (bold face) above some red flags that may be contributing to your problem, or, perhaps are unrelated but may also be introducing errors that you have not yet become aware of.

The -foreach mf of local master_file {- loop is suspicious. It loops over a local that contains, presumably, many filenames. It reads each one in and then does nothing with it. At the end, only the final file remains in memory. If your intent is to use only the last of those files as the starting point for merging, there is no need for a loop. I suspect, however, that is not your intent and that something is missing from this loop.

The contents of the next loop beginning with -capture noisily- is also suspicious. You have indicated that you did this because you don't want the code to break when some country does not have a file for all four rounds. But you have also allowed the code to continue when there is something other than that wrong with the merge, such as perhaps an inability to do a 1:1 merge on idstd. (Although you do say you have verified that idstd uniquely identifies observations in each data set.) Regardless, even if the -merge- is just bypassed because the using file does not exist, you are still trying to save a ``country'_AR' tempfile for it. But for all four iterations of that -forvalues x = 1/4 {- loop you are trying to save the same file, because `country' does not vary within the loop. So the -save- command should be breaking as soon as you reach `x' == 2 because you did not tell Stata it is OK to overwrite the file. So I'm also thinking that the code you are showing is not what you are running, because I do not see how this code can produce results at all.

There is a much safer way to code around the problem of some countries not having data for all four rounds:

Code:

tempfile `country'_AR /* Ex : Albania_AR (AR = Allrounds) */ forvalues x = 1/4 { capture confirm file ``country'_R`x'_ed' // CHECK WHETHER FILE EXISTS if c(rc) == 0 { merge 1:1 idstd using ``country'_R`x'_ed', generate(merge_round`x') save ``country'_AR', replace } else if c(rc) == 601 { // FILE NOT FOUND continue } else { // OTHER, UNEXPECTED ERROR; BAIL OUT!!! display as error "Unexpected problem x = `x'" exit c(rc) } }

Notes on above code:
The declaration of tempfile `country'_AR is moved outside the forvalues loop, as it should only be declared once, since it is a single file to be used repeatedly in the loop.

This code will break if the -merge- files because the -capture- is finely honed to only let "file not found" problems pass through. This way if there is a problem with the -merge-s, or any other problem inside that -if c(rc) == 0- block, you will find out about it immediately and not blunder on.

Finally, the use of the -force- option with -append- is very unsafe. If your files are all completely compatible for appending, there is no need for -force- and it does nothing. -force- only comes into play when the files are incompatible. By -force-ing the -append- you lose data in the variables that are incompatible, so the resulting data set is inherently corrupted and should not be used. The only exception is if you know with 100% certainty that the only variables that exhibit incompatibilities are variables that you will never need to use. But if that is the case, a safer way to deal with the problem is to either -drop- those variables before you do the -append-ing, or use -append-'s -keep()- option so that those variables are just ignored anyway. In short, the -force- option is a trap for the unwary. Bad enough that Stata has it. Don't use it!

I hope this is helpful, whether it solves the question you originally posed or not.

Last edited by Clyde Schechter; 05 Sep 2022, 11:08.
1 like
Comment
Adam Sadi

Join Date: Jul 2022

Posts: 68
#5

05 Sep 2022, 14:57

Clyde:

Thanks a lot for your insightful reply. I did not know the if function works, so your code is a real value added for my work. I am grateful.

Regarding the loop you wrote in bald, indeed I am aware that such a loop will only deal with the last file in the folder. One thing I did not mention was that there is exactly one file per folder, hence I am not concerned by this problem.

I will now double-check the append, but it seems that a lot of variables were un-append-able because they had a different format across datasets. Is there a way to convert fastly every variable before appending or is it a case-by-case situation?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#6

05 Sep 2022, 18:21

Is there a way to convert fastly every variable before appending or is it a case-by-case situation?

You have to review all the data to decide for each variable whether to make it string in all the data sets or numeric in all the data sets. But once you have made that decision you can do it in a loop. Suppose you want to make UNIT all string. Then it would be:

Code:

local filenames: dir whatever_directory files "*.dta" foreach f of local filenames { use `"`f'"', clear tostring UNIT, replace save `"`f'_clean"', replace }

And if there are several variables that need to be fixed, you can put as many -tostring- or -destring- commands in there as you need. You can also nest loops over groups of variables if their -tostring- or -destring- options are all the same. At the end of that you will have new files with _clean attached to their names. Then you can go ahead and do a loop to -append- or -merge- them all as appropriate.
1 like
Comment

Announcement

When merging datasets I end up with duplicates in terms of my merging variable

Comment

Comment

Comment

Comment

Comment