Problem with merging multiple csv files using merge 1:1

Shreya Dutt

Join Date: Dec 2019

Posts: 34
#1

Problem with merging multiple csv files using merge 1:1

30 Dec 2019, 14:49

Hi,

I am a beginner in Stata (using Stata 16) and after going through many of the posts regarding merging multiple files from a folder, I tried to write the following code but I received an error. I will describe the data, folder structure, code and error messages below:

Data: I have quarterly bank data from FDIC where each csv file corresponds to one quarter and within each file, different banks are identified using a variable called 'cert'. For every file, there is also a column named 'repdte' which lists the quarter for the particular file (so for eg, I will have a file named All_Reports_20170930_U.S. Government Obligations.csv which will have many columns giving data regarding US Govt Obligations and there will also be two additional columns cert and repdte listing the bank ID and 20170930 respectively for the entire file).
Sample csv files may be downloaded from: https://www7.fdic.gov/sdi/download_l...st_outside.asp For my testing, I am using the 2018, 2017 files for quarters 1231 and 0930 for the files "Unused Commitments Securitization" and "U.S. Government Obligations".
What I want to do: I want to merge all the bank data across banks and quarter (panel data) and to do this, i figured I should use the command: merge 1:1 cert repdte using filename
Code:
clear all
pwd
cd "C:\Users\HP\Dropbox\Data\Test2"

tempfile mbuild
clear
save `mbuild', emptyok

foreach year in 2018 2017{
foreach dm in 1231 0930 {
foreach name in "Unused Commitments Securitization" "U.S. Government Obligations"{
import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y ear' `dm'_`name'", clear
gen source = "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y ear' `dm'_`name'"
merge 1:1 cert repdte using `mbuild'
save `mbuild', replace

}
}
}
Error:
.
. foreach year in 2018 2017{
2. foreach dm in 1231 0930 {
3. foreach name in "Unused Commitments Securitization" "U.S. Governmen
> t Obligations"{
4. import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Report
> s_`year'`dm'_`name'", clear
5. gen source = "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y
> ear'`dm'_`name'"
6. merge 1:1 cert repdte using `mbuild'
7. save `mbuild', replace
8.
. }
9. }
10. }
(52 vars, 5,415 obs)
no variables defined
r(111);

Could someone please help me understand what i am doing wrong and how I can achieve what I am trying to do? Additionally, I also want to be able to retrieve the merged file to do further analysis on Stata and also export it to a folder on my computer - how should I do that?

Last edited by Shreya Dutt; 30 Dec 2019, 15:01.
Tags: data, foreach, loop, merge, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#2

30 Dec 2019, 15:15

This approach fails because, unlike -append-, -merge- does not allow you to start with an empty data set in memory.

So you have to first import one of the files into memory and then loop over all the rest of them to merge them all together.

One other thing, when you do that, each time you use -merge-, it tries to create a variable _merge that identifies which observations came from which data sets (or which from both), and on the third try that will fail because _merge will already exist. So you need to add a -drop _merge- command after the -merge- command itself. (Or you can add a -nogenerate- option to the -merge- command.)

And another thing: you cannot usefully do -gen source = …- in this code, because only the values created the first time through will be preserved: after that Stata will recognize source as a pre-existing variable and the values in the -using- data set will be ignored.

Overall, I would add that the entire approach seems misguided. I don't really understand this data, and I'm not going to download the data sets to familiarize myself with them in detail. But it seems to me that while it probably makes sense to -merge- each year's and bank's Unused Commitments Securitization and US Government Obligations data together, combining the banks and years looks like something that would be better done with -append-, to create a long layout data set that will be useful for analysis in Stata.
Comment
Shreya Dutt

Join Date: Dec 2019

Posts: 34
#3

30 Dec 2019, 15:23

Thanks Clyde - I was only trying to extrapolate solutions from existing posts but I understand that that may not be helpful given my own limited understanding of the language. Would you be able to help me with a sample code on how to merge and then append data the way you mentioned? The reason I was not using append is because if I merge all the files for a particular quarter, then between 2 different quarters there are different numbers of columns (essentially in 1992 for eg, less data was being collected compared to 2018). Can I use append in that case?

I was also looking at another post of yours: https://www.statalist.org/forums/for...to-one-dataset I tried to write the below code to replicate that because it seemed closer to what I wanted but that too did not work and errored out with the no variables error:

foreach year in 2018 2017{
foreach dm in 1231 0930 {
foreach name in "Unused Commitments Securitization" "U.S. Government Obligations"{
{
import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`year' `dm'_`name'.csv", clear
capture confirm file `"`mbuild'"'
if c(rc) == 0 {
use `mbuild', clear
merge 1:1 cert repdte using `"\Users\HP\Dropbox\Data\Test2\All_Reports_`year'` dm'_`name'.csv"', nogenerate
save `"`mbuild'"', replace
}
else if c(rc) == 601{
use `"\Users\HP\Dropbox\Data\Test2\All_Reports_`year'` dm'_`name'.csv"', clear
save /*`"`mbuild'"'*/ `mbuild'
}
else {
display as error "Unexpected problem in directory"
exit c(rc)
}

}
}
}
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#4

30 Dec 2019, 16:17

So, something like this:

Code:

clear* tempfile building save `building', emptyok frame create secondary local prefix C:\Users\HP\Dropbox\Data\Test2\All_Reports_ foreach year in 2018 2017 { foreach dm in 1231 0930 { import delimited using "`prefix'`year'`dm'_Unused Commitments Securitization.csv", clear // OPTIONS??? frame change secondary import delimited using "`prefix'`year'`dm'_U.S. Government Obligations.csv" // ??? OPTIONS?? frlink 1:1 cert repdte, frame(default) frget _all, from(default) gen year = `year' gen dm = `dm' append using `building' save `"`building'"', replace frame change default } } frame change secondary

At the end of this code, the data set in tempfile `building', and in the active frame, secondary, will be the combined results. The Unused Commitments Securitization will be "to the right" of the US Government Obligations data, and separate year's and dm's will be stacked "vertically." This is probably the best layout for analysis of the data in most situations.

Notes:
1. Untested as I am not working with the actual files.

2. You may need to specify some options on your -import delimited- commands to get the data to come in the way you want it. For example,often you want the first row of the .csv files to be used as variable names, which requires that you specify the -varnames()- option. You may have to specify special handling of quotes, forcing certain columns to be string, forcing other columns to be numeric etc. Do study -help import delimited- to figure out the options that will work best with the data you have.

3. I have created variables year and dm to mark those variables as the source in the resulting file. This assumes that those variables don't already exist in the CSV files themselves, in which case, you can skip these steps.

4. There is a good chance that all of this will fail anyway due to inconsistencies in the construction of these data sets. You have already alluded to the idea that different variables are collected in different years. That is not an obstacle to using append. But if the same variables are collected under different names, or as strings in one year and numbers in another, -append- will choke on that. In my experience problems like this are found more often than not. If you really have only 8 data sets to put together, well, you might get lucky. But if your real problem involves many years and dm, it would be surprising if you did not encounter inconsistencies across the data sets. That is why I generally do not undertake this kind of mass importation of data sets. I generally import each data set separately and familiarize myself with them. I then decide on a single organization for all the files to be put together: what variable names to use, what to have as strings, what to have as numeric variables, what value labels (if any) to use with numeric variables, what variable labels to use, and then I enforce conformity to that organization in each of those data sets. Only when that is done do I contemplate putting the data sets together. I know that some people disagree with this approach and build those "corrections" of the data into the code that puts the data sets together. I suppose that's OK if there are only a few little wrinkles that need to be ironed out. But the data sets I work with are usually very messy and that approach would be unworkable, leading to incomprehensible, tangled code.

Last edited by Clyde Schechter; 30 Dec 2019, 16:19.
Comment
Shreya Dutt

Join Date: Dec 2019

Posts: 34
#5

31 Dec 2019, 13:50

Thanks Clyde - I was able to get the merge to work but I am facing two problems:
1. When I import the csv files, I am specifying varnames(1) as a delimiter but the dta file is not getting created with the first row as variable names.
2. My code is failing when I try to append the merged files (merged by cert) so that all quarter's data is available in one file

Code:
tempfile appendfile
save `appendfile', emptyok replace

foreach year in 2018 2017{
foreach dm in 1231 0930 {
foreach name in "Unused Commitments Securitization" "U.S. Government Obligations" {
import delimited "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_`name'.csv", varnames(1) clear
save "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_`name'.dta", replace
capture confirm file `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"'
if c(rc) == 0 {
use `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"', clear
merge 1:1 cert using `"\Users\HP\Documents\ShreyaDutt\Personal\Berkeley \Test2\All_Reports_`year'`dm'_`name'.dta"', nogenerate
save `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"', replace
}
else if c(rc) == 601{
use `"\Users\HP\Documents\ShreyaDutt\Personal\Berkeley \Test2\All_Reports_`year'`dm'_`name'.dta"', clear
save `C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta', replace
}
else {
display as error "Unexpected problem in directory"
exit c(rc)
}

}

append using `appendfile'
save `"`appendfile'"', replace
}
}

I am doing the append in the outer loop and the merge in the inner loop (so the merge is happening across the filenames for a particular quarter and the append is happening across the merged file for the different quarters). The error I am getting is below for the line append using `appendfile':

invalid file specification
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#6

31 Dec 2019, 15:04

1. I don't know what to tell you. I've never encountered this situation. Your use of the -varnames()- option appears to be correct. Perhaps the variable names are not actually in the first row of the .csv files? What are you getting for variable names when you do the import? Provide more information.

2. Again, your use of tempfile appendfile looks OK. One common problem when people run code with local macros is that they do not run the entire code from beginning to end without interruption. If you are running the code in "chunks" or if the code is interrupted inside the loops and you get Stata to "resume" it from where it left off, then the local macro goes out of scope (i.e. ceases to be defined) which can lead to the -invalid file specification- when you try to access the tempfile. If it's not that, I don't see the problem.

As an aside, in your pathname, Bertkeley is spelled with an embedded blank space, and the blank space does not consistently appear in the same place. It's hard for me to imagine that this isn't causing some kind of problem--perhaps interrupting the code in the loops to set up the problem you reported as 2.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 789

03 Jan 2020, 15:24

Shreya Dutt, I suggest rereading the previous posts #4 and #6 carefully before continuing.

Below are some additional comments and suggestions, with a updated version of your example code #5 tested on data from SDI.

When posting code,

Code:

please use CODE delimiters

described in the FAQ 12.3 How to use CODE delimiters

Always use code indenting see #4, and show the code used. The code example in #5 cannot be exact what was used given the spaces in the file paths (or maybe all error messages was not reported).

I was only trying to extrapolate solutions from existing posts but I understand that that may not be helpful given my own limited understanding of the language.

I suggest considering the following NetCourses:

and the following books:

An Introduction to Stata Programming. Baum, C. F. 2016.
The Workflow of Data Analysis Using Stata. Long, J. S. 2009.

I want to merge all the bank data

Restrict the data to what you need, and validate data as early as possible, possibly using the SDI Glossary with definitions of variables contained within the SDI System which also can be used to add descriptions of data.

Below you find a update of your code in #5. The main changes are:

use of local macros to avoid repeating path etc.
merge and append is separated: (1) first merging, (2) then appending all merged files. (my preference)
added some minimum validation (isid)

Some comments to the revised code/data:

merge: overlapping variables exists (same variable names in master and using)
variable name "IDSCAS" in SDI definition is "idscas" in SDI data. Ref #4 point 4. on inconsistencies in data.
the reported problem in #5 is not reproduced

1. When I import the csv files, I am specifying varnames(1) as a delimiter but the dta file is not getting created with the first row as variable names.

(thus, the first problem reported in #5 could be a consequence of the extra spaces, or we are not using the same data)

I added two extra elements: a small ad-hoc program for adding metadata to characteristics, and locals for selecting variables. This can be deleted; delete the definitions typed in small size and the lines

Code:

capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
* keep `id' `vars_common' `vars_from_UCS' /* uncomment to run */
capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
* keep `id' `vars_from_USGO' /* uncomment to run */

Revised code: The following code can possibly be in a separate do file, to be called from a master do file (https://www.stata.com/manuals/u16.pdf).

Code:

version 16
 
********************************************************************************
* define locals
********************************************************************************

local path "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkeley\Test2\"
local pre "All_Reports_" // common first part of filename

local years 2018 2017
local quarters 1231 0930  
local types `" "Unused Commitments Securitization" "U.S. Government Obligations" "'

********************************************************************************
* EXTRA uncomment the two keep commands to apply the restrictions
local id = "cert"
local vars_from_UCS = "SZUCRES"  
local vars_from_USGO = "scage idscas" // WARNING: "IDSCAS" in SDI definition
local vars_common "docket name" // WARNING: common vars in UCS and USGO

********************************************************************************
* EXTRA define prog to add some metadata to characteristics
capt prog drop extra_add_metadata
prog define extra_add_metadata, nclass
args year quarter t csv
qui ds  
char define _dta[file_`year'_`quarter'_`t'] "`csv' : `r(varlist)'"
end // define extra_add_metadata
********************************************************************************

********************************************************************************
* (0) start in directory with READ ONLY csv files to be used  
********************************************************************************

cd "`path'"

********************************************************************************
* (1) merge UCS USGO for each year-quarter  
********************************************************************************
 
foreach year of local years {

  foreach quarter of local quarters {
        
        local fn = "`pre'`year'`quarter'"
        local t = 0
        
      foreach type of local types {
        
            local ++t
                        
            import delim "`fn'_`type'.csv", case(preserve) clear
            
            isid `id' // assert variable cert uniquely identify observations
            
            if ( `t' == 1 ) { /* "Unused Commitments Securitization" */
            
                capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
                
                * keep `id' `vars_common' `vars_from_UCS' /* uncomment to run */
                
                tempfile UCS
                save "`UCS'"
            }
            
            else { /* "U.S. Government Obligations" */
            
                capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"

                * NB master data variables (USGO) kept if overlapping variables
                
                * keep `id' `vars_from_USGO' /* uncomment to run */
                
               merge 1:1 `id' using "`UCS'" , assert(match) nogen
                
                save "`fn'_merged.dta"
                
                erase "`UCS'" /* early deletion of tempfile */
            }
            
      } // types  
    } // quarters    
} // years 

********************************************************************************
* (2) append merged year-quarter files
********************************************************************************

local filenames : dir "." files "*_merged.dta" , respectcase

clear

foreach filename of local filenames {

    append using "`filename'"
    
    capture replace data_source = "`filename'" if mi(data_source)
    
    if ( _rc == 111 ) {
    
        gen data_source = "`filename'"
    }
}

isid `id' data_source /* local id = "cert" */
compress
save "All_Reports_combined" , replace

Describe data:

Code:

use "All_Reports_combined" , clear
tab data_source
char dir
misstable summarize,  all showzeros

Examine closely:

Code:

. use "All_Reports_combined" , clear

. tab data_source

                    data_source |      Freq.     Percent        Cum.
--------------------------------+-----------------------------------
All_Reports_20170930_merged.dta |      5,747       25.74       25.74
All_Reports_20171231_merged.dta |      5,679       25.44       51.18
All_Reports_20180930_merged.dta |      5,486       24.57       75.75
All_Reports_20181231_merged.dta |      5,415       24.25      100.00
--------------------------------+-----------------------------------
                          Total |     22,327      100.00

. char dir
  _dta[file_2017_0930_2]:     All_Reports_20170930_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
  _dta[file_2017_0930_1]:     All_Reports_20170930_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
  _dta[file_2017_1231_2]:     All_Reports_20171231_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
  _dta[file_2017_1231_1]:     All_Reports_20171231_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
  _dta[file_2018_0930_2]:     All_Reports_20180930_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
  _dta[file_2018_0930_1]:     All_Reports_20180930_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
  _dta[file_2018_1231_2]:     All_Reports_20181231_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
  _dta[file_2018_1231_1]:     All_Reports_20181231_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..

. misstable summarize,  all showzeros 
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
          cert |         0         0    22,327  |   >500         14       91325
        docket |         0         0    22,327  |   >500          0       82171
      fed_rssd |         0         0    22,327  |   >500         37     5336928
       rssdhcr |         0         0    22,327  |   >500          0     5292707
          name | (string variable)              |
          city | (string variable)              |
         stalp | (string variable)              |
           zip |         0         0    22,327  |   >500        802       99901
        repdte | (string variable)              |
       rundate | (string variable)              |
       bkclass | (string variable)              |
       address | (string variable)              |
       namehcr | (string variable)              |
        offdom |         0         0    22,327  |    304          0        6067
        offfor |         0         0    22,327  |     13          0         286
        stmult |         0         0    22,327  |      2          0           1
       specgrp |         0         0    22,327  |      9          1           9
      subchaps |         0         0    22,327  |      2          0           1
        county | (string variable)              |
    cbsa_metro |    10,253         0    12,074  |    366      10180       49740
  cbsa_metro~e | (string variable)              |
        estymd | (string variable)              |
       insdate | (string variable)              |
       effdate | (string variable)              |
        mutual | (string variable)              |
       parcert |         0         0    22,327  |      9          0       32188
         trust |         0         0    22,327  |      2          0           1
       regagnt | (string variable)              |
      insagnt1 | (string variable)              |
       fdicdbs | (string variable)              |
      fdicsupv | (string variable)              |
        fldoff | (string variable)              |
           fed | (string variable)              |
       occdist | (string variable)              |
      otsregnm | (string variable)              |
         offoa |         0         0    22,327  |     20          0         182
            cb |         0         0    22,327  |      2          0           1
   instwebaddr | (string variable)              |
         scage |         0         0    22,327  |   >500          0    3.40e+08
        idscas |         0         0    22,327  |   >500          0    1.08e+07
         scaot |    10,919         0    11,408  |   >500          0     2860773
         scspn |    10,919         0    11,408  |   >500          0    1.06e+07
         scfmn |        36         0    22,291  |   >500          0    1.91e+08
         scgnm |     7,280         0    15,047  |   >500          0    1.22e+08
         sccol |        36         0    22,291  |   >500          0    3.36e+07
        SCCPTG |        36         0    22,291  |   >500          0    2.25e+07
        SCCMOG |        36         0    22,291  |   >500          0    2.81e+07
       SZUCRES |    17,979         0     4,348  |      1          0           0
       SZUCHEL |    17,979         0     4,348  |      1          0           0
      SZUCCRCD |    17,979         0     4,348  |      1          0           0
      SZUCAUTO |    17,979         0     4,348  |      1          0           0
       SZUCCON |    17,979         0     4,348  |      1          0           0
        SZUCCI |    17,979         0     4,348  |      5          0      235000
       SZUCOTH |    17,979         0     4,348  |      5          0       34241
       UCSZRES |    14,490         0     7,837  |      9          0      396000
       UCSZHEL |    18,147         0     4,180  |      1          0           0
      UCSZCRCD |    18,147         0     4,180  |      2          0         336
      UCSZAUTO |    17,979         0     4,348  |      4          0      419000
       UCSZCON |    17,979         0     4,348  |      1          0           0
        UCSZCI |    17,979         0     4,348  |      4          0       80000
       UCSZOTH |    14,490         0     7,837  |      7          0      519000
   data_source | (string variable)              |
  -----------------------------------------------------------------------------

Last edited by Bjarte Aagnes; 03 Jan 2020, 16:06.

Comment

Shreya Dutt

Join Date: Dec 2019

Posts: 34
#8

15 Jan 2020, 13:22

Thank you for all your help. I was able to make it work with help from your inputs.
Comment

Announcement