Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • handling 10.000 files named with 11 digit numbers

    Hello. I have over 10.000 txt zipped files which I would like to unzip, convert to stata and append all in only one dta dataset. The problem is that the zipfiles's names are 11 digit number ones, with a big interval between them. So using forvalues combined with capture confirm file takes forever. Stata has been already running for 20 hours and it hasn't even unzipped half of all files. I hope there is a more efficient way of doing so.

    Actually is it possible to work with zipped files in Stata without having to unzip them?

    Below is the code which is taking forever:

    Code:
    forvalues i=11000000000/54000000000 {
    capture confirm file "`i'.zip"
    if _rc==0 {
    unzipfile `i'.zip, replace    
    }
    }
    
    clear
    tempfile temp
    save `temp', emptyok
    
    forvalues i=11000000000/54000000000 {
    
    capture confirm file "`i'.dta"
    if _rc==0 {
    use `i'.dta, clear
    
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using 11000150500.txt
    
    append using `temp'
    display "`i'"
    save `"`temp'"', replace
    }
    }
    
    save all.dta, replace

  • #2
    Dear Paola,
    I think the main problem with the code you have there is the append command, and that should be where you might spend the longest time in the code, particularly the farther you go in the process.
    Each time you "append" a file, it has to load the full extent of the previous one. And the file size of your `temp' file is growing on each pass. So your code just takes longer every time. My recommendation would be to do this by parts. Say, go from file 11000000000 to 22000000000, 22000000001 to 33000000000 , or any other partition you may have in mind, and then just append the largest files again. This should speed up the merging process.
    Also, assuming you have enough space on your computer, perhaps you could unzip all the files on your own. Instead of having Stata do it.
    Hope this helps.
    Fernando

    Comment


    • #3
      Your loop has to go through 43 billion possibilities to find 10,000 files. Try filelist (from SSC). Follow the example at the end of the help file. Heed Fernando's word of caution on how to append the data (again, follow the model in the filelist example).

      Comment


      • #4
        Originally posted by FernandoRios View Post
        I think the main problem with the code you have there is the append command, and that should be where you might spend the longest time in the code, particularly the farther you go in the process.
        Paula wrote that unzipping takes a long time. Her code hasn't gotten to the append part.

        Paula, your loop checks for files that don't exist. Try filelist from SSC instead.
        Code:
        ssc d filelist
        Code:
        filelist, pat("*.zip")
        levelsof filename, local(files)
        foreach f of local files {
          unzipfile "`f'", replace
        }

        Comment


        • #5
          Thank you all very much!!!
          Finally, I performed the unzip part manually and it was very quick.
          I will definetly use the filelist command in the append section! What a great command!

          I have just installed the filelist in my Stata, and have written my new as follow:

          Code:
          clear
          tempfile temp
          save `temp', emptyok
          
          filelist, pat("*.txt")
          levelsof filename, local(files)
          
          foreach f of local files {
          
          capture confirm file "`i'.txt"
          if _rc==0 {
          
          clear
          infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
          tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'.txt"
          
          append using `temp'
          display "`i'"
          save `"`temp'"', replace
          
          }
          }
          
          save all.dta, replace


          Code:
          . clear
          
          . tempfile temp
          
          . save `temp', emptyok
          (note: dataset contains 0 observations)
          file C:\Users\Paula\AppData\Local\Temp\ST_0j000001.tmp saved
          
          . 
          . filelist, pat("*.txt")
          Number of files found = 0
          
          . levelsof filename, local(files)
          no observations
          r(2000);
          
          end of do-file
          
          r(2000);
          However, I don't know why Stata does not recognize my .txt files. I have 10.903 txt files in the directory which I am currently working.

          Comment


          • #6
            Originally posted by Paula de Souza Leao Spinola View Post
            However, I don't know why Stata does not recognize my .txt files. I have 10.903 txt files in the directory which I am currently working.
            Are you looking in the right directory? filelist has a directory() option. See also help cd.

            Comment


            • #7
              You are not following the example in the help file for filelist, in particular with respect to how to append the data. Here's a simply port that matches your needs:

              Code:
              filelist, dir(".") pat("*.txt") save("files_to_insheet.dta") replace
              use "files_to_insheet.dta", clear
              local obs = _N
              forvalues i=1/`obs' {
                  use "files_to_insheet.dta" in `i', clear
                  local f = filename
                  clear
                  infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
                  tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
                  gen source = "`f'"
                  save "mydata`i'.dta", replace
              }
              
              clear
              forvalues i=1/`obs' {
                  append using "mydata`i'.dta"
              }
              save "mydatacombo.dta", replace
              With respect to why filelist is not finding your files, make sure that Stata's current directory is set to the directory that contains your text files. Type

              Code:
              pwd
              to display the current directory.

              Comment


              • #8
                It might be easier to -ls- the file names into a text file and then read the file names from the file in a loop. This would avoid any potential limits issues with the amount that can be stored in a single local and also has the advantage of providing a single file that can be used to describe the archive comments if those data need to be relocated or sent to anyone else.

                Comment


                • #9
                  I don't know what is happened. I definetly have txt files in my directory. The command below shows that when I run infix for one specific txt file, Stata recognizes it. But filelist does not recognize any txt file.


                  Code:
                  . clear
                  
                  . 
                  . infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
                  > tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using 11000150500.txt
                  (9409 observations read)
                  
                  . 
                  . filelist, pat("*.txt")
                  Number of files found = 0
                  
                  . 
                  end of do-file
                  
                  .

                  Comment


                  • #10
                    wbuchanan, I don't have all the file names listed anywhere. There are only listed as file names in a unique directory. Is it possible to have Stata list them for me?

                    Actually is there an easier way of making a loop command in Stata do the same thing to all txt files in a specific directory?

                    Comment


                    • #11
                      You would get what's described in #9 if the file extension has one or more extra space (i.e. "11000150500.txt "). Is that the problem?

                      Comment


                      • #12
                        Hi Robert. Actually the extension is not displayed but they are all txt files. When I click in the file properties, in the type of file I see: Documento de Texto (.TXT).

                        I have just tried as below to see if this could be the problem but it doesn't seem to be. Could it be a problem of too many files? In this same directory I have 10.903 txt files and 10.903 zip files with same names, except for the extension.

                        Code:
                        . filelist, pat("*.txt")
                        Number of files found = 0
                        
                        . filelist, pat("*.txt ")
                        Number of files found = 0
                        
                        . filelist, pat("*.txt  ")
                        Number of files found = 0

                        Comment


                        • #13
                          What do you get when you enter the commands below?
                          Code:
                          pwd
                          dir *.txt
                          dir

                          Comment


                          • #14
                            Code:
                            . pwd
                            C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto
                            I manually wrote "..." to show there is a lot more lines above those which I copied.

                            Code:
                             
                             dir *.txt .... 8352.4k  12/06/11 17:13  52146060500.TXT     951.0k  12/06/11 17:13  52146061000.TXT     959.2k  12/06/11 17:13  52146061500.TXT     898.0k  12/06/11 17:13  52146062500.TXT     647.0k  12/06/11 17:13  52147050500.TXT     729.0k  12/06/11 17:13  52148040500.TXT    2546.3k  12/06/11 17:13  52148380500.TXT    1775.7k  12/06/11 17:13  52148381000.TXT    2526.6k  12/06/11 17:13  52148610500.TXT     834.5k  12/06/11 17:13  52148790500.TXT    1337.7k  12/06/11 17:13  52149030500.TXT    2010.3k  12/06/11 17:13  52150090500.TXT    1384.7k  12/06/11 17:13  52152070500.TXT      16.8M  12/06/11 17:13  52152310500.TXT    1214.6k  12/06/11 17:13  52152560500.TXT    4649.0k  12/06/11 17:13  52153060500.TXT     519.5k  12/06/11 17:13  52153061000.TXT    1267.1k  12/06/11 17:13  52154050500.TXT    1639.5k  12/06/11 17:13  52155040500.TXT    7486.7k  12/06/11 17:13  52156030500.TXT     901.3k  12/06/11 17:13  52156520500.TXT    6495.2k  12/06/11 17:13  52157020500.TXT     659.5k  12/06/11 17:13  52158010500.TXT    1084.5k  12/06/11 17:13  52159000500.TXT     939.5k  12/06/11 17:13  52160070500.TXT    2401.9k  12/06/11 17:13  52163040500.TXT    3273.0k  12/06/11 17:13  52164030500.TXT     735.5k  12/06/11 17:13  52164520500.TXT    2795.1k  12/06/11 17:13  52168090500.TXT    1096.5k  12/06/11 17:13  52169080500.TXT    7720.8k  12/06/11 17:13  52171040500.TXT    3484.1k  12/06/11 17:13  52172030500.TXT    5853.2k  12/06/11 17:13  52173020500.TXT     336.9k  12/06/11 17:13  52173021000.TXT    7303.0k  12/06/11 17:13  52174010500.TXT      14.9M  12/06/11 17:13  52176090500.TXT     494.4k  12/06/11 17:13  52176091000.TXT    1597.4k  12/06/11 17:13  52176091500.TXT    4828.9k  12/06/11 17:13  52177080500.TXT      10.2M  12/06/11 17:13  52180030500.TXT     863.5k  12/06/11 17:13  52180520500.TXT    1104.1k  12/06/11 17:13  52181020500.TXT    8048.4k  12/06/11 17:13  52183000500.TXT    1145.7k  12/06/11 17:13  52183910500.TXT      10.5M  12/06/11 17:13  52185080500.TXT    2354.3k  12/06/11 17:13  52186070500.TXT     128.5k  12/06/11 17:13  52186071000.TXT    1211.9k  12/06/11 17:13  52187060500.TXT    1555.3k  12/06/11 17:13  52187890500.TXT     902.9k  12/06/11 17:13  52200091000.TXT     548.0k  12/06/11 17:14  52200580500.TXT    7948.8k  12/06/11 17:14  52201080500.TXT     527.2k  12/06/11 17:14  52201081000.TXT    1259.5k  12/06/11 17:14  52201570500.TXT    6218.5k  12/06/11 17:14  52202070500.TXT    1076.3k  12/06/11 17:14  52202640500.TXT     649.1k  12/06/11 17:14  52202800500.TXT    3652.0k  12/06/11 17:14  52204050500.TXT     517.3k  12/06/11 17:14  52204051000.TXT      17.3M  12/06/11 17:14  52204540500.TXT    2429.2k  12/06/11 17:14  52205040500.TXT    5932.5k  12/06/11 17:14  52206030500.TXT    1745.6k  12/06/11 17:14  52206860500.TXT     992.6k  12/06/11 17:14  52207020500.TXT    1110.2k  12/06/11 17:14  52210070500.TXT     743.8k  12/06/11 17:14  52210800500.TXT    1709.5k  12/06/11 17:14  52211970500.TXT    1327.3k  12/06/11 17:14  52213040500.TXT      21.7M  12/06/11 17:14  52214030500.TXT    1083.4k  12/06/11 17:14  52214520500.TXT    1450.9k  12/06/11 17:14  52215020500.TXT    1088.8k  12/06/11 17:14  52215510500.TXT     898.0k  12/06/11 17:14  52215770500.TXT    8740.7k  12/06/11 17:14  52216010500.TXT     270.2k  12/06/11 17:14  52216011500.TXT     249.9k  12/06/11 17:14  52216012000.TXT    3215.6k  12/06/11 17:14  52217000500.TXT     612.5k  12/06/11 17:14  52217001000.TXT     242.3k  12/06/11 17:14  52217002000.TXT    1379.8k  12/06/11 17:14  52218090500.TXT      27.8M  12/06/11 17:14  52218580500.TXT    1243.6k  12/06/11 17:14  52219080500.TXT    3327.2k  12/06/11 17:14  52220050500.TXT     192.5k  12/06/11 17:14  52220051000.TXT    2061.7k  12/06/11 17:14  52220540500.TXT    1101.4k  12/06/11 17:14  52222030500.TXT    1773.0k  12/06/11 17:14  52223020500.TXT      51.5M  12/06/11 17:05  53001080506.TXT      26.8M  12/06/11 17:05  53001080507.TXT      77.5M  12/06/11 17:05  53001080508.TXT      11.5M  12/06/11 17:05  53001080509.TXT      39.5M  12/06/11 17:05  53001080510.TXT      32.4M  12/06/11 17:05  53001080511.TXT      11.2M  12/06/11 17:05  53001080512.TXT      13.6M  12/06/11 17:05  53001080513.TXT    9196.3k  12/06/11 17:05  53001080514.TXT      75.8M  12/06/11 17:05  53001080515.TXT      29.4M  12/06/11 17:05  53001080516.TXT      19.0M  12/06/11 17:05  53001080517.TXT      36.2M  12/06/11 17:05  53001080518.TXT    2897.3k  12/06/11 17:05  53001080519.TXT      20.9M  12/06/11 17:05  53001080520.TXT    8460.2k  12/06/11 17:05  53001080521.TXT    5453.4k  12/06/11 17:05  53001080523.TXT      21.0M  12/06/11 17:05  53001080525.TXT      18.9M  12/06/11 17:05  53001080530.TXT     .

                            Code:
                             
                             dir ....  1083.4k  12/06/11 17:14  52214520500.TXT      27.5k  11/03/15 15:34  52214520500.zip    1450.9k  12/06/11 17:14  52215020500.TXT      36.2k  11/03/15 15:34  52215020500.zip    1088.8k  12/06/11 17:14  52215510500.TXT      28.0k  11/03/15 15:34  52215510500.zip     898.0k  12/06/11 17:14  52215770500.TXT      23.3k  11/03/15 15:34  52215770500.zip    8740.7k  12/06/11 17:14  52216010500.TXT     201.4k  11/03/15 15:34  52216010500.zip     270.2k  12/06/11 17:14  52216011500.TXT       5.6k  11/03/15 15:34  52216011500.zip     249.9k  12/06/11 17:14  52216012000.TXT       6.3k  11/03/15 15:34  52216012000.zip    3215.6k  12/06/11 17:14  52217000500.TXT      78.2k  11/03/15 15:34  52217000500.zip     612.5k  12/06/11 17:14  52217001000.TXT      13.7k  11/03/15 15:34  52217001000.zip     242.3k  12/06/11 17:14  52217002000.TXT       6.2k  11/03/15 15:34  52217002000.zip    1379.8k  12/06/11 17:14  52218090500.TXT      33.2k  11/03/15 15:34  52218090500.zip      27.8M  12/06/11 17:14  52218580500.TXT     510.1k  11/03/15 15:34  52218580500.zip    1243.6k  12/06/11 17:14  52219080500.TXT      34.1k  11/03/15 15:34  52219080500.zip    3327.2k  12/06/11 17:14  52220050500.TXT      73.8k  11/03/15 15:34  52220050500.zip     192.5k  12/06/11 17:14  52220051000.TXT       4.9k  11/03/15 15:34  52220051000.zip    2061.7k  12/06/11 17:14  52220540500.TXT      50.8k  11/03/15 15:34  52220540500.zip    1101.4k  12/06/11 17:14  52222030500.TXT      25.5k  11/03/15 15:35  52222030500.zip    1773.0k  12/06/11 17:14  52223020500.TXT      44.3k  11/03/15 15:34  52223020500.zip      51.5M  12/06/11 17:05  53001080506.TXT     727.1k  11/03/15 15:33  53001080506.zip      26.8M  12/06/11 17:05  53001080507.TXT     467.5k  11/03/15 15:33  53001080507.zip      77.5M  12/06/11 17:05  53001080508.TXT    1316.5k  11/03/15 15:33  53001080508.zip      11.5M  12/06/11 17:05  53001080509.TXT     226.8k  11/03/15 15:33  53001080509.zip      39.5M  12/06/11 17:05  53001080510.TXT     697.8k  11/03/15 15:33  53001080510.zip      32.4M  12/06/11 17:05  53001080511.TXT     577.3k  11/03/15 15:33  53001080511.zip      11.2M  12/06/11 17:05  53001080512.TXT     217.0k  11/03/15 15:33  53001080512.zip      13.6M  12/06/11 17:05  53001080513.TXT     232.0k  11/03/15 15:33  53001080513.zip    9196.3k  12/06/11 17:05  53001080514.TXT     158.5k  11/03/15 15:33  53001080514.zip      75.8M  12/06/11 17:05  53001080515.TXT    1252.9k  11/03/15 15:33  53001080515.zip      29.4M  12/06/11 17:05  53001080516.TXT     487.4k  11/03/15 15:33  53001080516.zip      19.0M  12/06/11 17:05  53001080517.TXT     261.6k  11/03/15 15:33  53001080517.zip      36.2M  12/06/11 17:05  53001080518.TXT     618.2k  11/03/15 15:33  53001080518.zip    2897.3k  12/06/11 17:05  53001080519.TXT      50.1k  11/03/15 15:33  53001080519.zip      20.9M  12/06/11 17:05  53001080520.TXT     361.5k  11/03/15 15:33  53001080520.zip    8460.2k  12/06/11 17:05  53001080521.TXT     128.0k  11/03/15 15:33  53001080521.zip    5453.4k  12/06/11 17:05  53001080523.TXT      91.0k  11/03/15 15:33  53001080523.zip      21.0M  12/06/11 17:05  53001080525.TXT     351.9k  11/03/15 15:33  53001080525.zip      18.9M  12/06/11 17:05  53001080530.TXT     342.6k  11/03/15 15:33  53001080530.zip      <dir>  11/18/15 13:16  dtas_consolidados

                            Comment


                            • #15
                              It seems you have to use
                              Code:
                              filelist, pat("*.TXT")

                              Comment

                              Working...
                              X