Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping over files in a folder

    Hi all,

    I'm trying to loop through files in the Quarterly Census of Employment and Wages -- all the files for a year are downloaded into one of my folders (all metro areas in the US). Unfortunately, Stata tells me when I try to loop that there are too many files in the loop!

    My code is below -- anyone know what could be going wrong, or alternatives?

    local files : dir "$data\qcew\2014" files "*.csv"
    foreach file in `files' {
    quietly insheet using `file', clear
    quietly keep area_fips industry_code agglvl_code year qtr area_title industry_title agglvl_title ///
    qtrly_estabs_count month* total_qtrly_wages oty_qtrly_estabs_pct_chg ///
    oty_month1_emplvl_pct_chg oty_month2_emplvl_pct_chg oty_month3_emplvl_pct_chg
    quietly keep if agglvl_code==74 // keep at naics sector level
    quietly bysort area_fips qtr year: egen demp=rowmean(oty_month1_emplvl_pct_chg oty_month2_emplvl_pct_chg oty_month3_emplvl_pct_chg)
    quietly save `file'.dta, replace
    }

  • #2
    Please run the following commands and then show exactly what Stata did in response (see section 12 in the FAQ).

    Code:
    local files : dir "$data\qcew\2014" files "*.csv"
    foreach file in `files' {
      dir `file'
    }

    Comment


    • #3
      I downloaded the Quarterly CSVs for 2014 from this page and unzipped the archive. The archive is very big (4427 cvs files, 4.23GB) and the file names are quite long. Depending on your version of Stata, you may hit the macro length limit. I suggest that you use filelist (from SSC) for this type of task. Here's an example that converts all the files in the zip archive into Stata datasets.

      Code:
      * -filelist- if from SSC
      filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur
      
      * extract the identifier code
      gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ")
      isid shortname
      
      * save a copy so that we can load each observation in the loop below
      tempfile files
      save "`files'"
      
      local obs = _N
      
      * create a directory to put the Stata dta if it does not already exist
      cap mkdir QCEW_stata
      
      * loop over each file and do whatever's needed
      forvalues i=1/`obs' {
      
          use "`files'" in `i', clear
          local source = shortname
          local f = dirname + "/" + filename
          
          insheet using "`f'", clear
          gen source = "`source'"
          
          save "QCEW_stata/`source'.dta", replace
          
      }
      This takes about 4 minutes to process the whole 4,426 files. If you want to append the files, follow the example in filelist's help file.

      Comment


      • #4
        Thank you both for the reply, and fantastic advice Robert! That's just about perfect, but I think I might have misunderstood something -- was there a line here that needs to be amended to my context? When I run it, it just goes through the loop file as if it finished, but nothing actually happens; in other words, it gets up to the "create directory" and goes through the loop without doing anything. I will keep looking at it -- I must have missed something, as this looks spot-on.

        Friedrich: just FYI the same error comes up when I only ask it to list the directory; "too many filenames".

        Comment


        • #5
          The code should run as is but the following must be true:
          • you execute the code from a do-file (e.g. do "qcwe.do")
          • the do-file is located in Stata's current directory (help cd)
          • the zip archive was uncompressed and saved in a directory called "2014.q1-q4.by_area", located in the current directory

          Comment


          • #6
            I should also note that I do not like long file names with spaces like "2014.q1-q4 01001 Autauga County, Alabama.csv" so I included code to extract the identifier based on the patterns found in the list of files. If the list of files that do not follow the same pattern, then the following line should be adjusted

            Code:
            gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ")
            I doubt this is your problem since the

            Code:
            isid shortname
            command would throw an error if there are missing or not unique short identifier.

            I suggest that you run the first line

            Code:
            filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur
            by itself and browse the data in memory afterwards to verify that it contains the list of files in the "2014.q1-q4.by_area" directory.

            You can also try to process just a few files by changing the number of observations line to

            Code:
            local obs = 5

            Comment


            • #7
              Sorry for being slow on this, and part of it is because I don't understand all of the steps in the process, but there must be something else that I'm not doing right.

              1) I'm working from the current directory: cd "$data\qcew\2014.q1-q4.by_area". (The directory was previously just "$data\qcew\2014", but I changed it to what you had in case that was the culprit behind the loop not working.)

              2) The .do file is now also in the current directory "$data\qcew\2014.q1-q4.by_area". (Before it was in a separate folder where I keep .do files for analysis involved on a particular paper; I copied it to the current directory in case that was the culprit.)

              3) All the QCEW files are located in the current directory "$data\qcew\2014.q1-q4.by_area".

              This seems to all line up with what you said?

              Thank you for your patience!

              Comment


              • #8
                No, if the do-file is in the same directory as the extracted cvs files, you need to scan for files in the current directory. This would be

                Code:
                filelist , dir(.) pattern(*.csv) norecur
                Just do that interactively and check that the data in memory contains the list of files.

                Comment


                • #9
                  Thanks again, it turned out that the current directory was mis-specified -- precisely as you mentioned in the most recent post above.

                  Now just trying to figure out how to append the files. It's a very valuable command you created!

                  Comment


                  • #10
                    If you are going to append all these files, I suggest that you reprocess the files in Stata using import delimited instead of using insheet. The problem is that at least one variable (area_fips) will be imported to a numeric variable by insheet for some files because all observations are numeric. In other files, the area_fips variable will contain letters so the variable will be imported as a string. When you try to append these files, you will not be able to resolve the issue easily. The following example shows the basic process and should work if your current directory is correctly aligned with your files. Basically, you import everything in string variables and only when the data has been combined do you destring. Before running the whole thing (import delimited is slower than insheet), try it on a small subset of files by changing the observation line to local obs = 10

                    Code:
                    * -filelist- if from SSC
                    * make sure that the current directory is the one that
                    * contains the "2014.q1-q4.by_area" directory. 
                    filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur
                    
                    * extract the identifier code
                    gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ")
                    isid shortname
                    
                    * save a copy so that we can load each observation in the loop below
                    tempfile files
                    save "`files'"
                    
                    * to try this out on the first 10 files, change to local obs = 10
                    local obs = _N
                    
                    * create a directory to put the Stata dta if it does not already exist
                    cap mkdir QCEW_stata
                    
                    * loop over each file and input each csv file into temporary datasets.
                    forvalues i=1/`obs' {
                    
                        use "`files'" in `i', clear
                        local source = shortname
                        local f = dirname + "/" + filename
                        
                        clear
                        import delimited using "`f'", varnames(1) stringcols(_all)
                        gen source = "`source'"
                        
                        tempfile save`i'
                        qui save "`save`i''"
                        
                    }
                    
                    clear
                    forvalues i=1/`obs' {
                        append using "`save`i''"
                    }
                    
                    * Now that all the data is combined, time to destring numeric vars
                    destring, replace
                    Note that when processing files like this, it is important not to rely on old Statalist posts (or other resources on the internets) that advocate solutions like

                    Code:
                    cap erase mybigfile.dta
                    local files : dir . files "*.csv"
                    
                    foreach f of local files {
                        insheet using "`f'", clear
                        append using mybigfile
                        save mybigfile, replace
                    }
                    The extra cycle of saving "mybigfile" each time a new file is input will take forever because the size increases at each save. By my calculation, this particular exercise would require writing over 10 terabytes of data by the time the last 4GB mybigfile is saved.

                    Comment


                    • #11
                      That's great. While I got it working earlier, your solution looks more efficient and I'll try it out. That'll be important for using additional years of the QCEW! Thank you again for taking the time to help me figure it out!

                      Comment

                      Working...
                      X