Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loop over a huge number of files

    I am trying to loop over a large number of files (360,000 files) to convert them from .csv to .dta files (the idea is then to append the data files later).

    When I try to do all the files in one loop, I get an error (too many file names). So I did the following outer loop to convert them in groups of 10000 files.

    The problem is after 10000 files are converted and saved, I get the error message "observation numbers out of range"

    Here is my code:

    Code:
    clear all
    
    global maindir "C:/Users/soha_/Desktop/MorningStar/1-Inputs"
    set trace on
    *------------------------------------------------------------------------------*
    // File list
    filelist , dir("$maindir/Historical Assets/Assets_csv") pattern(*.csv) norecur
    
    // save a copy so that we can load each observation in the loop below
    tempfile files
    save "`files'"
    
    local rangeinitial=1
    forvalues r = 0(10000)360000 {
    
    
    local range1 = `rangeinitial' + `r'
    local range2 = `range1' + 9999
    
    cd "$maindir/Historical Assets"
    
    forvalues i=`range1'/`range2' {
    
    use "`files'" in `i', clear
    local f = dirname + "/" + filename
    local filename = filename
    
    insheet using "`f'", clear
    local name: subinstr local filename ".csv" ""
    
    save "Assets_dta/`name'.dta", replace
    }
    }
    when I set trace on, I get the following:


    Code:
    - local range1 = `rangeinitial' + `r'
    = local range1 = 1 + 10000
    - local range2 = `range1' + 9999
    = local range2 = 10001 + 9999
    - cd "$maindir/Historical Assets"
    = cd "C:/Users/soha_/Desktop/MorningStar/1-Inputs/Historical Assets"
    C:\Users\soha_\Desktop\MorningStar\1-Inputs\Historical Assets
    - forvalues i=`range1'/`range2' {
    = forvalues i=10001/15000 {
    - use "`files'" in `i', clear
    = use "C:\Users\soha_\AppData\Local\Temp\ST_00000001.tmp" in 10001, clear
    observation numbers out of range
      local f = dirname + "/" + filename
      local filename = filename
      insheet using "`f'", clear
      local name: subinstr local filename ".csv" ""
      save "Assets_dta/`name'.dta", replace
      }
      }
    Thanks,
    Soha
    Last edited by S. IAAIA; 17 May 2019, 10:02.

  • #2
    Each of the observations of your file called `files' is supposed to contain the name of a csv file in your directory. You believe (presumably correctly) that there are more than 10000 files in that directory, so the observation numbers in that file should exceed that. The error message appears to indicate that is false. The filelist command (which is a user written command from SSC, as you are supposed to indicate per the FAQ) by default gives a report in the results window of how many files it found. Was that number > 10000? If not, I'd browse the file produced by -filelist- and see what's in it. My version of -filelist- (-which filelist- gives as 2.0.7) indicates that there is no limit to the number of filenames and that that was a change from an earlier version. Perhaps your problem might be solved by simply updating (-adoupdate-) -filelist.

    I suspect you *are* getting the "too many file names" report from -filelist-, and the "limited loop" you have can't solve that problem, as -filelist- is outside that loop. You will need to get the updated version of -filelist-.

    Comment


    • #3
      -filelist- only finds 10000 files, which is strange given that my version also says that there is no limit (I am using Stata 14) and the most recent version of filelist.


      Code:
      . filelist , dir("$maindir/Historical Assets/Assets_csv") pattern(*.csv) norecur
      Number of files found = 10000

      Comment


      • #4
        You might well be running into a limit in your flavor of Stata and in v. 14 related to the size of a local macro, which is a guess I'm making based on checking the help file for -filelist-.
        Is there some systematic structure to the names of the files? If so, it should be possible to pick out batches of them to work on at a time, though not necessarily with -filelist-.

        Also, there is at least one other user-written package (the name escapes me) for importing multiple csv files. You might look for that.

        Comment


        • #5
          I note that filelist depends on the Mata built-in function dir() to retrieve filenames, and although that function's documentation does not mention any limitation on the number of files that will be returned, an undocumented limit there seems likely to be the source of the current problem. Perhaps Robert Picard, the author of filelist, will be dismayed to learn of this unexpected issue.

          Comment


          • #6
            I first wrote filelist in pure Stata but then a user hit the macro length limit so I rewrote filelist in Mata. The current version of filelist can recursively scan a directory and return an unlimited number of files (it will happily scan a whole hard disk if you ask for it). Unfortunately, Mata's dir() function still has a hard coded limit of 10,000 files returned from a single directory.

            This was noticed back in 2015 (with great chagrin) and Hua Peng (StataCorp) acknowledged here that unicode translate is also affected.

            I have posted a request for this limit to be lifted here.

            In retrospect, I should have revised the help file to mention this limitation. I'll do that ASAP.

            Comment


            • #7
              The 10000 file limitation is mentioned several places. One might do something like the following (then appending dta files using similar strategy).
              Code:
              tempfile  files
              
              log using "`files'" , text replace
              dir "*.csv"
              log close
              
              infile str100 files using "`files'" , clear
              keep if strpos(files,".csv")
              compress 
               
              putmata files = files , replace
              
              local resdir "D:\dta\"
              
              mata:
              
                  for (i=1; i<= rows(files); i++) {
                  
                  fn = subinstr(files[i],".csv","",1)
                  readcsv = "import delimited using " + files[i] + " , clear"
                  savedta = "save " + "`resdir'" + fn + ", replace"
                  stata(readcsv)
                  stata(savedta)
              }
                  
              end

              Comment


              • #8
                (crossed in the ether with Bjarte's essentially similar approach.)

                Noting Bill's point about the limitations of Mata's dir(), here's a sketch of a hack that should work, given that you're on a Windows system. I don't have a good way to test it, but it should be close.

                Code:
                clear
                // Shell out to the Windows command line and get a list of CSV files as observations in the Stata data editor.
                // Stata's -ls-  can't send the file names to any saved location.
                tempfile temp
                !dir/b $maindir\*.csv > "`temp'"
                import delimited using "`temp'", delim("|") // assume no "|" in filenames
                //
                // Put this list of names into a Mata matrix, quick and dirty.
                putmata name = v1
                local nfiles = _N
                // Process each file whose name is in the Mata matrix
                clear
                forval i = 1/`nfiles' {
                  mata:st_local("fname", name[`i',1])
                  di "Working on `fname'"
                 insheet using "`fname'", clear
                 // do whatever.... with the current CSV file
                }

                Comment


                • #9
                  An alternative, avoiding 300,000+ imports and savings of data, could be to concatenate the csv files before importing. After import any extra rows of field headers can be deleted. Using the same example data (300,001 csv files with the first record of the auto data set) the code below runs in 60 seconds, compared to the timings of the example at the bottom using the strategy of post #7/#8
                  Code:
                  ********************************************************************************
                  tempfile files
                  
                  shell 2>NULL type *.csv > "`files'"
                  
                  import delim "`files'" , ///
                          clear ///
                          varnames(1) ///
                          stringcols(1,12) ///
                          numericcols(2(1)11)
                          
                  drop if mi(price) /* delete extra rows of field headers */
                  compress
                  ********************************************************************************
                  The code above ran in 60 seconds, compared to the timings for the alternative
                  Code:
                     1:      2.05 /        1 =       2.0540  making list of files
                     2:   7769.01 /        1 =    7769.0140  import csv, saving dta
                     3:    477.37 /        1 =     477.3690  appending dta files
                  Code:
                  ********************************************************************************
                  set maxvar 120000
                  
                  local csvdir "E:\csv\"
                  local dtadir "E:\dta\"
                  
                  cd "`csvdir'"
                  
                  tempfile files
                  shell dir /b *.csv > "`files'"
                  mata: files = subinstr(cat("`files'"),".csv","",.)
                  
                  cd "`dtadir'"
                  
                  mata: st_numscalar("nfiles", rows(files))
                  
                  qui forvalues i = 1/`=nfiles' {
                  
                      mata: st_local("fn", files[`i'])
                      import delim "`csvdir'\`fn'.csv"
                      save "`dtadir'\`fn'", replace
                      clear
                  }
                  
                  mata: st_strscalar("fn", invtokens(files'))
                  append using `=fn' , gen(indexfilename)
                  getmata filename = files
                  scalar drop fn
                  ********************************************************************************
                  Last edited by Bjarte Aagnes; 18 May 2019, 10:21.

                  Comment

                  Working...
                  X