Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with merging multiple csv files using merge 1:1

    Hi,

    I am a beginner in Stata (using Stata 16) and after going through many of the posts regarding merging multiple files from a folder, I tried to write the following code but I received an error. I will describe the data, folder structure, code and error messages below:

    Data: I have quarterly bank data from FDIC where each csv file corresponds to one quarter and within each file, different banks are identified using a variable called 'cert'. For every file, there is also a column named 'repdte' which lists the quarter for the particular file (so for eg, I will have a file named All_Reports_20170930_U.S. Government Obligations.csv which will have many columns giving data regarding US Govt Obligations and there will also be two additional columns cert and repdte listing the bank ID and 20170930 respectively for the entire file).
    Sample csv files may be downloaded from: https://www7.fdic.gov/sdi/download_l...st_outside.asp For my testing, I am using the 2018, 2017 files for quarters 1231 and 0930 for the files "Unused Commitments Securitization" and "U.S. Government Obligations".
    What I want to do: I want to merge all the bank data across banks and quarter (panel data) and to do this, i figured I should use the command: merge 1:1 cert repdte using filename
    Code:
    clear all
    pwd
    cd "C:\Users\HP\Dropbox\Data\Test2"

    tempfile mbuild
    clear
    save `mbuild', emptyok

    foreach year in 2018 2017{
    foreach dm in 1231 0930 {
    foreach name in "Unused Commitments Securitization" "U.S. Government Obligations"{
    import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y ear' `dm'_`name'", clear
    gen source = "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y ear' `dm'_`name'"
    merge 1:1 cert repdte using `mbuild'
    save `mbuild', replace

    }
    }
    }
    Error:
    .
    . foreach year in 2018 2017{
    2. foreach dm in 1231 0930 {
    3. foreach name in "Unused Commitments Securitization" "U.S. Governmen
    > t Obligations"{
    4. import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Report
    > s_`year'`dm'_`name'", clear
    5. gen source = "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`y
    > ear'`dm'_`name'"
    6. merge 1:1 cert repdte using `mbuild'
    7. save `mbuild', replace
    8.
    . }
    9. }
    10. }
    (52 vars, 5,415 obs)
    no variables defined
    r(111);


    Could someone please help me understand what i am doing wrong and how I can achieve what I am trying to do? Additionally, I also want to be able to retrieve the merged file to do further analysis on Stata and also export it to a folder on my computer - how should I do that?
    Last edited by Shreya Dutt; 30 Dec 2019, 16:01.

  • #2
    This approach fails because, unlike -append-, -merge- does not allow you to start with an empty data set in memory.

    So you have to first import one of the files into memory and then loop over all the rest of them to merge them all together.

    One other thing, when you do that, each time you use -merge-, it tries to create a variable _merge that identifies which observations came from which data sets (or which from both), and on the third try that will fail because _merge will already exist. So you need to add a -drop _merge- command after the -merge- command itself. (Or you can add a -nogenerate- option to the -merge- command.)

    And another thing: you cannot usefully do -gen source = …- in this code, because only the values created the first time through will be preserved: after that Stata will recognize source as a pre-existing variable and the values in the -using- data set will be ignored.

    Overall, I would add that the entire approach seems misguided. I don't really understand this data, and I'm not going to download the data sets to familiarize myself with them in detail. But it seems to me that while it probably makes sense to -merge- each year's and bank's Unused Commitments Securitization and US Government Obligations data together, combining the banks and years looks like something that would be better done with -append-, to create a long layout data set that will be useful for analysis in Stata.

    Comment


    • #3
      Thanks Clyde - I was only trying to extrapolate solutions from existing posts but I understand that that may not be helpful given my own limited understanding of the language. Would you be able to help me with a sample code on how to merge and then append data the way you mentioned? The reason I was not using append is because if I merge all the files for a particular quarter, then between 2 different quarters there are different numbers of columns (essentially in 1992 for eg, less data was being collected compared to 2018). Can I use append in that case?

      I was also looking at another post of yours: https://www.statalist.org/forums/for...to-one-dataset I tried to write the below code to replicate that because it seemed closer to what I wanted but that too did not work and errored out with the no variables error:

      foreach year in 2018 2017{
      foreach dm in 1231 0930 {
      foreach name in "Unused Commitments Securitization" "U.S. Government Obligations"{
      {
      import delimited "C:\Users\HP\Dropbox\Data\Test2\All_Reports_`year' `dm'_`name'.csv", clear
      capture confirm file `"`mbuild'"'
      if c(rc) == 0 {
      use `mbuild', clear
      merge 1:1 cert repdte using `"\Users\HP\Dropbox\Data\Test2\All_Reports_`year'` dm'_`name'.csv"', nogenerate
      save `"`mbuild'"', replace
      }
      else if c(rc) == 601{
      use `"\Users\HP\Dropbox\Data\Test2\All_Reports_`year'` dm'_`name'.csv"', clear
      save /*`"`mbuild'"'*/ `mbuild'
      }
      else {
      display as error "Unexpected problem in directory"
      exit c(rc)
      }



      }
      }
      }

      Comment


      • #4
        So, something like this:

        Code:
        clear*
        tempfile building
        save `building', emptyok
        
        frame create secondary
        
        local prefix C:\Users\HP\Dropbox\Data\Test2\All_Reports_
        foreach year in 2018 2017 {
            foreach dm in 1231 0930 {
                import delimited using "`prefix'`year'`dm'_Unused Commitments Securitization.csv", clear // OPTIONS???
                frame change secondary
                import delimited using "`prefix'`year'`dm'_U.S. Government Obligations.csv" // ??? OPTIONS??
                frlink 1:1 cert repdte, frame(default)
                frget _all, from(default)
                gen year = `year'
                gen dm = `dm'
                append using `building'
                save `"`building'"', replace
                frame change default
            }
        }
        frame change secondary
        At the end of this code, the data set in tempfile `building', and in the active frame, secondary, will be the combined results. The Unused Commitments Securitization will be "to the right" of the US Government Obligations data, and separate year's and dm's will be stacked "vertically." This is probably the best layout for analysis of the data in most situations.

        Notes:
        1. Untested as I am not working with the actual files.

        2. You may need to specify some options on your -import delimited- commands to get the data to come in the way you want it. For example,often you want the first row of the .csv files to be used as variable names, which requires that you specify the -varnames()- option. You may have to specify special handling of quotes, forcing certain columns to be string, forcing other columns to be numeric etc. Do study -help import delimited- to figure out the options that will work best with the data you have.

        3. I have created variables year and dm to mark those variables as the source in the resulting file. This assumes that those variables don't already exist in the CSV files themselves, in which case, you can skip these steps.

        4. There is a good chance that all of this will fail anyway due to inconsistencies in the construction of these data sets. You have already alluded to the idea that different variables are collected in different years. That is not an obstacle to using append. But if the same variables are collected under different names, or as strings in one year and numbers in another, -append- will choke on that. In my experience problems like this are found more often than not. If you really have only 8 data sets to put together, well, you might get lucky. But if your real problem involves many years and dm, it would be surprising if you did not encounter inconsistencies across the data sets. That is why I generally do not undertake this kind of mass importation of data sets. I generally import each data set separately and familiarize myself with them. I then decide on a single organization for all the files to be put together: what variable names to use, what to have as strings, what to have as numeric variables, what value labels (if any) to use with numeric variables, what variable labels to use, and then I enforce conformity to that organization in each of those data sets. Only when that is done do I contemplate putting the data sets together. I know that some people disagree with this approach and build those "corrections" of the data into the code that puts the data sets together. I suppose that's OK if there are only a few little wrinkles that need to be ironed out. But the data sets I work with are usually very messy and that approach would be unworkable, leading to incomprehensible, tangled code.
        Last edited by Clyde Schechter; 30 Dec 2019, 17:19.

        Comment


        • #5
          Thanks Clyde - I was able to get the merge to work but I am facing two problems:
          1. When I import the csv files, I am specifying varnames(1) as a delimiter but the dta file is not getting created with the first row as variable names.
          2. My code is failing when I try to append the merged files (merged by cert) so that all quarter's data is available in one file

          Code:
          tempfile appendfile
          save `appendfile', emptyok replace

          foreach year in 2018 2017{
          foreach dm in 1231 0930 {
          foreach name in "Unused Commitments Securitization" "U.S. Government Obligations" {
          import delimited "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_`name'.csv", varnames(1) clear
          save "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_`name'.dta", replace
          capture confirm file `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"'
          if c(rc) == 0 {
          use `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"', clear
          merge 1:1 cert using `"\Users\HP\Documents\ShreyaDutt\Personal\Berkeley \Test2\All_Reports_`year'`dm'_`name'.dta"', nogenerate
          save `"C:\Users\HP\Documents\ShreyaDutt\Personal\Berkel ey\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta"', replace
          }
          else if c(rc) == 601{
          use `"\Users\HP\Documents\ShreyaDutt\Personal\Berkeley \Test2\All_Reports_`year'`dm'_`name'.dta"', clear
          save `C:\Users\HP\Documents\ShreyaDutt\Personal\Berkele y\Test2\All_Reports_`year'`dm'_Unused Commitments Securitization.dta', replace
          }
          else {
          display as error "Unexpected problem in directory"
          exit c(rc)
          }

          }

          append using `appendfile'
          save `"`appendfile'"', replace
          }
          }

          I am doing the append in the outer loop and the merge in the inner loop (so the merge is happening across the filenames for a particular quarter and the append is happening across the merged file for the different quarters). The error I am getting is below for the line append using `appendfile':

          invalid file specification

          Comment


          • #6
            1. I don't know what to tell you. I've never encountered this situation. Your use of the -varnames()- option appears to be correct. Perhaps the variable names are not actually in the first row of the .csv files? What are you getting for variable names when you do the import? Provide more information.

            2. Again, your use of tempfile appendfile looks OK. One common problem when people run code with local macros is that they do not run the entire code from beginning to end without interruption. If you are running the code in "chunks" or if the code is interrupted inside the loops and you get Stata to "resume" it from where it left off, then the local macro goes out of scope (i.e. ceases to be defined) which can lead to the -invalid file specification- when you try to access the tempfile. If it's not that, I don't see the problem.

            As an aside, in your pathname, Bertkeley is spelled with an embedded blank space, and the blank space does not consistently appear in the same place. It's hard for me to imagine that this isn't causing some kind of problem--perhaps interrupting the code in the loops to set up the problem you reported as 2.

            Comment


            • #7
              Shreya Dutt, I suggest rereading the previous posts #4 and #6 carefully before continuing.

              Below are some additional comments and suggestions, with a updated version of your example code #5 tested on data from SDI.

              When posting code,
              Code:
              please use CODE delimiters
              described in the FAQ 12.3 How to use CODE delimiters​​​​​

              Always use code indenting see #4, and show the code used. The code example in #5 cannot be exact what was used given the spaces in the file paths (or maybe all error messages was not reported).

              I was only trying to extrapolate solutions from existing posts but I understand that that may not be helpful given my own limited understanding of the language.
              I suggest considering the following NetCourses:and the following books:
              I want to merge all the bank data
              Restrict the data to what you need, and validate data as early as possible, possibly using the SDI Glossary with definitions of variables contained within the SDI System which also can be used to add descriptions of data.

              Below you find a update of your code in #5. The main changes are:
              1. use of local macros to avoid repeating path etc.
              2. merge and append is separated: (1) first merging, (2) then appending all merged files. (my preference)
              3. added some minimum validation (isid)
              Some comments to the revised code/data:
              1. merge: overlapping variables exists (same variable names in master and using)
              2. variable name "IDSCAS" in SDI definition is "idscas" in SDI data. Ref #4 point 4. on inconsistencies in data.
              3. the reported problem in #5 is not reproduced
              1. When I import the csv files, I am specifying varnames(1) as a delimiter but the dta file is not getting created with the first row as variable names.
              (thus, the first problem reported in #5 could be a consequence of the extra spaces, or we are not using the same data)

              I added two extra elements: a small ad-hoc program for adding metadata to characteristics, and locals for selecting variables. This can be deleted; delete the definitions typed in small size and the lines
              Code:
              capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
              * keep `id' `vars_common' `vars_from_UCS' /* uncomment to run */
              capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
              * keep `id' `vars_from_USGO' /* uncomment to run */

              Revised code: The following code can possibly be in a separate do file, to be called from a master do file (https://www.stata.com/manuals/u16.pdf).
              Code:
              version 16
               
              ********************************************************************************
              * define locals
              ********************************************************************************
              
              local path "C:\Users\HP\Documents\ShreyaDutt\Personal\Berkeley\Test2\"
              local pre "All_Reports_" // common first part of filename
              
              local years 2018 2017
              local quarters 1231 0930  
              local types `" "Unused Commitments Securitization" "U.S. Government Obligations" "'
              
              ********************************************************************************
              * EXTRA uncomment the two keep commands to apply the restrictions
              local id = "cert"
              local vars_from_UCS = "SZUCRES"  
              local vars_from_USGO = "scage idscas" // WARNING: "IDSCAS" in SDI definition
              local vars_common "docket name" // WARNING: common vars in UCS and USGO
              
              ********************************************************************************
              * EXTRA define prog to add some metadata to characteristics
              capt prog drop extra_add_metadata
              prog define extra_add_metadata, nclass
              args year quarter t csv
              qui ds  
              char define _dta[file_`year'_`quarter'_`t'] "`csv' : `r(varlist)'"
              end // define extra_add_metadata
              ********************************************************************************
              
              ********************************************************************************
              * (0) start in directory with READ ONLY csv files to be used  
              ********************************************************************************
              
              cd "`path'"
              
              ********************************************************************************
              * (1) merge UCS USGO for each year-quarter  
              ********************************************************************************
               
              foreach year of local years {
              
                foreach quarter of local quarters {
                      
                      local fn = "`pre'`year'`quarter'"
                      local t = 0
                      
                    foreach type of local types {
                      
                          local ++t
                                      
                          import delim "`fn'_`type'.csv", case(preserve) clear
                          
                          isid `id' // assert variable cert uniquely identify observations
                          
                          if ( `t' == 1 ) { /* "Unused Commitments Securitization" */
                          
                              capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
                              
                              * keep `id' `vars_common' `vars_from_UCS' /* uncomment to run */
                              
                              tempfile UCS
                              save "`UCS'"
                          }
                          
                          else { /* "U.S. Government Obligations" */
                          
                              capt extra_add_metadata `year' `quarter' `t' "`fn'_`type'.csv"
              
                              * NB master data variables (USGO) kept if overlapping variables
                              
                              * keep `id' `vars_from_USGO' /* uncomment to run */
                              
                             merge 1:1 `id' using "`UCS'" , assert(match) nogen
                              
                              save "`fn'_merged.dta"
                              
                              erase "`UCS'" /* early deletion of tempfile */
                          }
                          
                    } // types  
                  } // quarters    
              } // years 
              
              ********************************************************************************
              * (2) append merged year-quarter files
              ********************************************************************************
              
              local filenames : dir "." files "*_merged.dta" , respectcase
              
              clear
              
              foreach filename of local filenames {
              
                  append using "`filename'"
                  
                  capture replace data_source = "`filename'" if mi(data_source)
                  
                  if ( _rc == 111 ) {
                  
                      gen data_source = "`filename'"
                  }
              }
              
              isid `id' data_source /* local id = "cert" */
              compress
              save "All_Reports_combined" , replace


              Describe data:
              Code:
              use "All_Reports_combined" , clear
              tab data_source
              char dir
              misstable summarize,  all showzeros
              Examine closely:
              Code:
              . use "All_Reports_combined" , clear
              
              . tab data_source
              
                                  data_source |      Freq.     Percent        Cum.
              --------------------------------+-----------------------------------
              All_Reports_20170930_merged.dta |      5,747       25.74       25.74
              All_Reports_20171231_merged.dta |      5,679       25.44       51.18
              All_Reports_20180930_merged.dta |      5,486       24.57       75.75
              All_Reports_20181231_merged.dta |      5,415       24.25      100.00
              --------------------------------+-----------------------------------
                                        Total |     22,327      100.00
              
              . char dir
                _dta[file_2017_0930_2]:     All_Reports_20170930_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
                _dta[file_2017_0930_1]:     All_Reports_20170930_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
                _dta[file_2017_1231_2]:     All_Reports_20171231_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
                _dta[file_2017_1231_1]:     All_Reports_20171231_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
                _dta[file_2018_0930_2]:     All_Reports_20180930_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
                _dta[file_2018_0930_1]:     All_Reports_20180930_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
                _dta[file_2018_1231_2]:     All_Reports_20181231_U.S. Government Obligations.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundate bkcl..
                _dta[file_2018_1231_1]:     All_Reports_20181231_Unused Commitments Securitization.csv : cert docket fed_rssd rssdhcr name city stalp zip repdte rundat..
              
              . misstable summarize,  all showzeros 
                                                                             Obs<.
                                                              +------------------------------
                             |                                | Unique
                    Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
                -------------+--------------------------------+------------------------------
                        cert |         0         0    22,327  |   >500         14       91325
                      docket |         0         0    22,327  |   >500          0       82171
                    fed_rssd |         0         0    22,327  |   >500         37     5336928
                     rssdhcr |         0         0    22,327  |   >500          0     5292707
                        name | (string variable)              |
                        city | (string variable)              |
                       stalp | (string variable)              |
                         zip |         0         0    22,327  |   >500        802       99901
                      repdte | (string variable)              |
                     rundate | (string variable)              |
                     bkclass | (string variable)              |
                     address | (string variable)              |
                     namehcr | (string variable)              |
                      offdom |         0         0    22,327  |    304          0        6067
                      offfor |         0         0    22,327  |     13          0         286
                      stmult |         0         0    22,327  |      2          0           1
                     specgrp |         0         0    22,327  |      9          1           9
                    subchaps |         0         0    22,327  |      2          0           1
                      county | (string variable)              |
                  cbsa_metro |    10,253         0    12,074  |    366      10180       49740
                cbsa_metro~e | (string variable)              |
                      estymd | (string variable)              |
                     insdate | (string variable)              |
                     effdate | (string variable)              |
                      mutual | (string variable)              |
                     parcert |         0         0    22,327  |      9          0       32188
                       trust |         0         0    22,327  |      2          0           1
                     regagnt | (string variable)              |
                    insagnt1 | (string variable)              |
                     fdicdbs | (string variable)              |
                    fdicsupv | (string variable)              |
                      fldoff | (string variable)              |
                         fed | (string variable)              |
                     occdist | (string variable)              |
                    otsregnm | (string variable)              |
                       offoa |         0         0    22,327  |     20          0         182
                          cb |         0         0    22,327  |      2          0           1
                 instwebaddr | (string variable)              |
                       scage |         0         0    22,327  |   >500          0    3.40e+08
                      idscas |         0         0    22,327  |   >500          0    1.08e+07
                       scaot |    10,919         0    11,408  |   >500          0     2860773
                       scspn |    10,919         0    11,408  |   >500          0    1.06e+07
                       scfmn |        36         0    22,291  |   >500          0    1.91e+08
                       scgnm |     7,280         0    15,047  |   >500          0    1.22e+08
                       sccol |        36         0    22,291  |   >500          0    3.36e+07
                      SCCPTG |        36         0    22,291  |   >500          0    2.25e+07
                      SCCMOG |        36         0    22,291  |   >500          0    2.81e+07
                     SZUCRES |    17,979         0     4,348  |      1          0           0
                     SZUCHEL |    17,979         0     4,348  |      1          0           0
                    SZUCCRCD |    17,979         0     4,348  |      1          0           0
                    SZUCAUTO |    17,979         0     4,348  |      1          0           0
                     SZUCCON |    17,979         0     4,348  |      1          0           0
                      SZUCCI |    17,979         0     4,348  |      5          0      235000
                     SZUCOTH |    17,979         0     4,348  |      5          0       34241
                     UCSZRES |    14,490         0     7,837  |      9          0      396000
                     UCSZHEL |    18,147         0     4,180  |      1          0           0
                    UCSZCRCD |    18,147         0     4,180  |      2          0         336
                    UCSZAUTO |    17,979         0     4,348  |      4          0      419000
                     UCSZCON |    17,979         0     4,348  |      1          0           0
                      UCSZCI |    17,979         0     4,348  |      4          0       80000
                     UCSZOTH |    14,490         0     7,837  |      7          0      519000
                 data_source | (string variable)              |
                -----------------------------------------------------------------------------
              Last edited by Bjarte Aagnes; 03 Jan 2020, 17:06.

              Comment


              • #8
                Thank you for all your help. I was able to make it work with help from your inputs.

                Comment

                Working...
                X