Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Try

    Code:
    filelist, pat("*.TXT")

    Comment


    • #17
      ow, Robert, thank you so much! It seems that there is a limit of 10.000 files though, because I do have 10.903 txt files in this directory.

      Code:
      . filelist, pat("*.TXT")
      Number of files found = 10000

      Comment


      • #18
        If you have more than 10,000 files that match the pattern in the same directory, filelist will only return 10,000. This is apparently a hard coded limit (not documented anywhere) of the Mata function dir(). See this post that mentions the limit. A workaround is to do it in parts using a more specific pattern, e.g.

        Code:
        filelist, pattern("1*.TXT")
        filelist, pattern("2*.TXT")

        Comment


        • #19
          Robert, thank you for filelist. Stata commands like use and dir are not case sensitive. Would it be possible to add an option to filelist that allows users to find files regardless of whether the filenames are uppercase or lowercase?

          Comment


          • #20
            Hello all. Once there is this 10.000 limit files I had to do as below (all my files start with either 1, 2, 3, 4 or 5):

            Code:
            foreach k in 1 2 3 4 5 {
            
            filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
            use "name_files_`k'.dta", clear
            local obs = _N
            forvalues i=1/`obs' {
                use "name_files_`k'.dta" in `i', clear
                local f = filename
                clear
                infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
                tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
                gen source = "`f'"
                save "mydata`k'_`i'.dta", replace
            }
            }

            But now the append section code must be changed because the number of obs restart in every `k' concerned in the first foreach. How could I replace this `obs' in order to get all the 10.903 files which are not equally divided by the 5 k's in the foreach? Below is the old cold:

            Code:
            clear
            forvalues i=1/`obs' {
                append using "mydata`i'.dta"
            }
            save "mydatacombo.dta", replace

            Comment


            • #21
              I would think that the following would work:

              Code:
              clear
              save "my_txt_files.dta", emptyok replace
              foreach k in 1 2 3 4 5 {
                  filelist, pat("`k'*.TXT")
                  append using "my_txt_files.dta"
                  save "my_txt_files.dta", replace
              }
              
              use "my_txt_files.dta", clear
              local obs = _N
              forvalues i=1/`obs' {
                  use "my_txt_files.dta" in `i', clear
                  local f = filename
                  clear
                  infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 ///
                      str lati 322-336 str longe 337-351 ///
                      tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 ///
                      using "`f'"
                  gen source = "`f'"
                  save "mydata_`i'.dta", replace
              }
              
              clear
              forvalues i=1/`obs' {
                  append using "mydata_`i'.dta"
              }
              save "mydatacombo.dta", replace

              Comment


              • #22
                Add k to the obs counter.
                Code:
                foreach k in 1 2 3 4 5 {
                  filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
                  use "name_files_`k'.dta", clear
                  local obs`k' = _N
                  forvalues i=1/`obs`k'' {
                    use "name_files_`k'.dta" in `i', clear
                    local f = filename
                    clear
                    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
                    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
                    gen source = "`f'"
                    save "mydata`k'_`i'.dta", replace
                  }
                }
                Then add k to the append loop.
                Code:
                clear
                forval k in 1 2 3 4 5 {
                  forvalues i=1/`obs`k'' {
                    append using "mydata`k'_`i'.dta"
                  }
                }
                save "mydatacombo.dta", replace

                Comment


                • #23
                  Friedrich Huebler,

                  The pattern() option in filelist is the equivalent of using the strmatch() function and of course text matching is case sensitive. The great thing about filelist is that you can leverage all the power of Stata's data management features since it creates a dataset of files. Just use it without any pattern and do the matching yourself. Something like:

                  Code:
                  filelist
                  keep if regexm(lower(filename),"\.txt$")
                  The only exception is the limit encountered above when there is more than 10,000 files in a single directory. I show in #21 one way to circumvent the limitation.

                  Comment


                  • #24
                    Thany you again!!

                    Finally I ran this shorter code below and it worked!

                    Code:
                    foreach k in 1 2 3 4 5 {
                    
                    clear
                    tempfile temp
                    save `temp', emptyok
                    
                    filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
                    use "name_files_`k'.dta", clear
                    local obs = _N
                    forvalues i=1/`obs' {
                        use "name_files_`k'.dta" in `i', clear
                        local f = filename
                        clear
                        infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
                        tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
                        gen source = "`f'"
                        append using `temp'
                        display "`k' & `i'"
                        save `"`temp'"', replace
                    }
                    use `temp', clear
                    save "mydata_`k'.dta", replace
                    }

                    Comment


                    • #25
                      Paula de Souza Leao Spinola you don't need to have the files listed anywhere a priori (http://www.ats.ucla.edu/stat/stata/f...many_files.htm). This assumes all the files have the exact same specifications.

                      Code:
                      // Move to the directory where the files are located
                      cd C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto
                      
                      // List the files in the directory and pipe them into a text file
                      ! DIR *.txt /a-d /b > C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto\files.txt
                      
                      // Open a connection to the file
                      file open flist using C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto\files.txt, r
                      
                      // Read the first line of the file
                      file read flist fname
                      
                      // Load the data and clear any data currently in memory
                      infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///  
                      tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using `"`fname'"', clear
                      
                      // Reserve namespace for a temp file
                      tempfile mytempfile
                      
                      // Save the data temporarily
                      qui: save `mytempfile'.dta, replace
                      
                      // Loop over the file
                      while `r(eof)' != 1 {
                      
                          // Read the next line
                          file read flist fname
                      
                          // Load the data and clear any data currently in memory
                          infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///  
                          tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using `"`fname'"', clear
                      
                          // Append the first file
                          append using `mytempfile'.dta
                      
                          // Save over the temp file
                          qui: save `mytempfile'.dta, replace
                      
                      } // End WHILE Block
                      
                      // Close the open file connection
                      file close flist
                      
                      // save the data permanently
                      save finalfile.dta, replace

                      Comment


                      • #26
                        wbuchanan

                        The UCLA FAQ is a terrible example of how to combine a large number of files. Aside from the fact that direct file I/O (using the file command) is way above the skill set of many Stata users and that it requires shelling out, it continues to sell the input/append/save method which is terribly inefficient. Consider the following scenario where you are trying to append 10,000 files, each at 500K:

                        Code:
                        . clear
                        
                        . set obs 10000
                        number of observations (_N) was 0, now 10,000
                        
                        . gen fsize = 500000
                        
                        .
                        . * the size of the append dataset each time it is saved
                        . gen double appendsize = sum(fsize)
                        
                        .
                        . * the total number of bytes written
                        . gen double totalbytes = sum(appendsize)
                        
                        .
                        . dis %21.0fc totalbytes[_N]
                           25,002,500,000,000
                        That's 25 terabytes of I/O to create an appended dataset of 5GB!

                        Comment


                        • #27
                          Because of the back and forth on the 10,000 file limit and the confusion about case sensitive matching, it may appear that the task of appending a lot of files is more complicated than it is. Here's a summary of how I would handle the task:

                          First, make a list of files to append. This is easy to do with filelist (from SSC).

                          Code:
                          filelist, pat("*.TXT")
                          local obs = _N
                          save "myfiles.dta", replace
                          Next, use a loop to input each text file into Stata datasets. It may appear clumsy to reload the dataset of files ("myfiles.dta") at each pass but the data in memory needs to be cleared anyways to input the next text file so this adds very little overhead:

                          Code:
                          forvalues i=1/`obs' {
                              use "myfiles.dta" in `i', clear
                              local f = filename
                              insheet using "`f'", clear
                              gen source = "`f'"
                              save "mydata_`i'.dta", replace
                          }
                          Finally, use another loop to append the Stata datasets

                          Code:
                          clear
                          forvalues i=1/`obs' {
                              append using "mydata_`i'.dta"
                          }

                          Comment


                          • #28
                            How does the code in post #27 work with the 10,903 that Paula has? The command
                            Code:
                            filelist, pat("*.TXT")
                            reads at most 10,000 files.

                            Comment


                            • #29
                              It doesn't, I was just summarizing the general steps of appending lots of files. Another approach to deal with the 10,000 file limit per directory is to split them into sub-directories with less than 10,000 files. Since filelist will search for files recursively throughout sub-directories of the current directory, all that is needed is something like:

                              Code:
                              * make a list of files, include all sub-directories
                              filelist, pat("*.TXT")
                              local obs = _N
                              save "myfiles.dta", replace
                              
                              * input each file and save Stata datasets
                              forvalues i=1/`obs' {
                                  use "myfiles.dta" in `i', clear
                                  local f = dirname + "/" + filename
                                  insheet using "`f'", clear
                                  gen source = "`f'"
                                  save "mydata_`i'.dta", replace
                              }
                              
                              * Append!
                              clear
                              forvalues i=1/`obs' {
                                  append using "mydata_`i'.dta"
                              }
                              save "mydatacombo.dta", replace
                              Again, it's best to use Stata's data management features to first build the list of files to input (via creating a dataset of files to target).

                              Comment

                              Working...
                              X