Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a nested foreach loop to merge and then append multiple files from multiple folders into one dataset.

    Essentially, I have a few thousand text files, for which I have a nested foreach loop to import them to .dta files. Here is that code:

    foreach f in list_of_folders <all the folders names> {
    foreach x in list_of_filenames <all the varying text file names> {
    capture import delimited using `f'/`x', clear delimiter(tab)
    if c(rc) == 0 { // SUCCESSFUL IMPORT
    save `f'/`x'.dta , replace
    }
    else if c(rc) != 601 { // UNANTICIPATED ERROR
    display as error "import delimited failed for unexpected reasons"
    exit c(rc)
    }
    else {
    display as text "file `f'/`x' not found -- skipped"
    }
    }
    }

    The data that I have is broken up quarterly across 12 years. Within each folder, I need to merge each text file onto each other. They all have a common, identifying variable labelled idrssd. Then, on a successful merge, I need to append each quarter onto the other into one, master dataset. The primary problem that I'm having in doing this is that I'm not sure how, in one loop, to indicate what the files should merge onto because the file name is different for ever quarter. It doesn't matter what the merging dataset is but for each quarter the file lists look like this:

    POR_03312006.txt.dta CI_03312006.txt.dta etc....
    POR_06302006.txt.dta CI_06302006.txt.dta etc....
    all the way to quarter 4 of 2017 so
    POR_12312017.txt.dta CI_12312017.txt.dta etc....

    I imagined some string command might help overcome this, but this is my first semester using stata, and I am only a sophomore undergraduate, so I feel a bit over my head. Any help would be appreciated, and I'm happy to provide any additional help/information I can. Thank you!

  • #2
    Does each folder contain all and only the data from one quarter? And are the text files you need to work with all of the text files that appear in each folder, no others?

    Comment


    • #3
      Yes each folder contains all and only the data from one quarter, and yes they’re the only ones I need.

      Comment


      • #4
        I believe this will do it. The code is not tested, however.

        Code:
        clear*
        
        local dirlist: dir "." dirs "*"
        
        tempfile mbuild
        save abuild
        save `abuild', emptyok
        
        foreach d of local dirlist { // LOOP OVER DIRECTORIES
            local filelist: dir "`d'" files "*.txt.dta"
            foreach f of local filelist { // LOOP OVER FILES WITHIN `d'
                capture confirm file `"`mbuild'"'
                if c(rc) == 0 {
                    use `mbuild', clear
                    merge 1:1 idrssd using `"`d'/`f/'"', nogenerate
                    save `"`mbuild'"', replace
                }
                else if c(rc) == 601 {  // FIRST FILE IN THIS DIRECTORY
                    use `"`d'/`f'"', clear
                    save `mbuild'
                }
                else {  // UNEXPECTED ERROR
                    display as error "Unexpected problem in directory `d'"
                    exit c(rc)
                }
            }
            //    ALL FILES THIS DIRECTORY NOW MERGED; APPEND IS NEXT
            append using `abuild'
            save `"`abuild'"', replace
            erase `mbuild'
        }
        Note that unless these various files you are merging and appending contain variables that themselves identify the file, the final result will not enable you to trace back where anything in it came from. Also, it is in the nature of -merge- that if the file being merged in at any point has common variables (other than idrssd) with what has already been merged, the pre-existing values of those common variables are retained and the ones in the current file are lost.

        I would also caution that when you have a large number of files to put together like this, even if they come from a very careful data curator, they often contain incompatibilities that make this kind of mass agglomeration problematic. Different coding of "the same" variable may occur in different files. There may be different spellings of the variable names. A variable might be string in one file but numeric in another. A string variable might be padded with blanks in one data set but not another. There may be one or more files where idrssd is missing, or where it does not uniquely identify observations. All of these things will either lead to the program aborting with an error message, or perhaps lead to blundering through and building a useless data salad as a result.

        So I would advise you not to undertake this until you have gone through all the data sets and looked for and resolved discrepancies among them.

        Comment


        • #5
          Thank you for your help! Though, when I run this code at save abuild I get a no variables defined error.

          Comment


          • #6
            So that suggests that the inner loop of -merge-s is not actually producing a merged file to feed to the append. I wonder if the filelist local macro isn't being properly defined. Try putting -display `"`filelist'"'- in right after the -local filelist:...- command and see if it is picking up the files.

            Comment


            • #7
              I'm afraid that also did not work. From what I can tell, it is not making it to that point in the script, and is having problems at the save abuild prior to the loop itself.

              Comment


              • #8
                Oh, sorry, the block of commands
                Code:
                tempfile mbuild
                save abuild
                save `abuild', emptyok
                is wrong. It should be:

                Code:
                tempfile mbuild
                tempfile abuild
                save `abuild', emptyok

                Comment


                • #9
                  That did overcome it, and it's now moved on to the loop, though I'm getting an r(100) "{ required" error at the end of the first iteration of the loop. I've spent a bit trying to work out what's missing, but I haven't been able to.

                  Thank you so much for all your help.

                  Comment


                  • #10
                    I tried setting up some directories with some folders in them along the lines of what you are dealing with. I copied the code from #4 and pasted it into the do-editor, and then made the changes outlined in #8 and ran the code. I got no error messages, and the code ran and produced the expected results. I don't know what to tell you. Are you sure that you didn't modify the code in some other way that has led to this problem?

                    Comment


                    • #11
                      This is what I have written out

                      Code:
                      clear*
                      
                      local dirlist: dir "." dirs "*"
                      
                      tempfile mbuild
                      tempfile abuild
                      save `abuild', emptyok
                      
                      foreach d of local dirlist 03312006 03312007 03312008 03312009 03312010 03312011 03312012 03312013 03312014 03312015 03312016 03312017 06302006 06302007 06302008 06302009 06302010 06302011 06302012 06302013 06302014 06302015 06302016 06302017 09302006 09302007 09302008 09302009 09302010 09302011 09302012 09302013 09302014 09302015 09302016 09302017 12312006 12312007 12312008 12312009 12312010 12312011 12312012 12312013 12312014 12312015 12312016 12312017 { // LOOP OVER DIRECTORIES
                      local filelist: dir "`d'" files "*.txt.dta"
                      foreach f of local filelist { // LOOP OVER FILES WITHIN `d'
                      capture confirm file `"`mbuild'"'
                      if c(rc) == 0 {
                      use `mbuild', clear
                      merge 1:1 idrssd using `"`d'/`f/'"', nogenerate
                      save `"`mbuild'"', replace
                      }
                      else if c(rc) == 601 { // FIRST FILE IN THIS DIRECTORY
                      use `"`d'/`f'"', clear
                      save `mbuild'
                      }
                      else { // UNEXPECTED ERROR
                      display as error "Unexpected problem in directory `d'"
                      exit c(rc)
                      }
                      }
                      // ALL FILES THIS DIRECTORY NOW MERGED; APPEND IS NEXT
                      append using `abuild'
                      save `"`abuild'"', replace
                      erase `mbuild'
                      }
                      That list of numbers not in the original code are the folder names, but other than that they haven't been modified. Am I not suppose to list them out like this?

                      Comment


                      • #12
                        Well, not exactly like that.

                        First of all, notice that local macro dirlist is created in the second line of the code. It should contain those numbers. If it doesn't, then that's a problem.

                        But either way,
                        Code:
                        foreach d of local dirlist 03312006 03312007 03312008 03312009 03312010 03312011 03312012 03312013 03312014 03312015 03312016 03312017 06302006 06302007 06302008 06302009 06302010 06302011 06302012 06302013 06302014 06302015 06302016 06302017 09302006 09302007 09302008 09302009 09302010 09302011 09302012 09302013 09302014 09302015 09302016 09302017 12312006 12312007 12312008 12312009 12312010 12312011 12312012 12312013 12312014 12312015 12312016 12312017 { // LOOP OVER DIRECTORIES
                        is illegal syntax. I'm surprised that Stata did not throw a syntax error message and quit. But in any case, Stata probably mangled the loop if it tried to execute this.

                        If `dirlist' already contains these numbers, then you should just have it as:
                        Code:
                        foreach d of local dirlist {
                        and let Stata iterate accordingly.

                        If `dirlist' does not contain those numbers, then, as I said, something is wrong further up in the code. But since you have already written out the list of directories, rather than trying to fix dirlist, let's just forget about it and work with your list. The syntax for doing it that way is:

                        Code:
                        foreach d in 03312006 03312007 03312008 03312009 03312010 03312011 03312012 03312013 03312014 03312015 03312016 03312017 06302006 06302007 06302008 06302009 06302010 06302011 06302012 06302013 06302014 06302015 06302016 06302017 09302006 09302007 09302008 09302009 09302010 09302011 09302012 09302013 09302014 09302015 09302016 09302017 12312006 12312007 12312008 12312009 12312010 12312011 12312012 12312013 12312014 12312015 12312016 12312017 { // LOOP OVER DIRECTORIES
                        I suggest you review the manual sections for -foreach- to be clear on the use of the various alternative syntaxes:
                        Code:
                        foreach x of local name_of_a_local_macro
                        foreach x of global name_of_a_global_macro
                        foreach x of varlist some_variable_list
                        foreach x of numlist some_Stata_numlist
                        foreach x in some_arbitrary_list_of_anything
                        Pay attention to the distinction between -of- and -in-, and note that you cannot mix the different syntaxes in one command.
                        Last edited by Clyde Schechter; 30 Sep 2019, 15:57.

                        Comment


                        • #13
                          Thanks for all your help! I realized one of the files had, for some reason, had all its titles as string variables while the others weren't, thus the idrssd variable "wasn't found" despite it being there. Destringing the files after import and reformatting a little fixed my problem.

                          Comment


                          • #14
                            Thanks for closing the thread with your solution.

                            It does seem to illustrate the point made in #4 that in a large set of files there are likely to be some inconsistencies or incompatibilities that can trip up the process of putting them all together.

                            Comment


                            • #15
                              That it does.

                              Comment

                              Working...
                              X