Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loop over different folders

    Hi,

    I have around 30000 files located in different folders. The name of each folder is the name of a country. I would like Stata to pick all the files contained in each of the 27 folders and append them into one dataset. All the files have "1_201" at the beginning of the name, so I have tried the following:

    cd "\\s-jrcsvqfs007p\Data_BigRepo\JRC.J.3\Estrella\itunes_ eu_us\time_series\data_eu"

    foreach x in at be bg cy cz de dk ee es fi fr gr hu ie it lt lu lv mt nl pl pt se si sk gb us {
    forv y=30828/41217{
    preserve
    insheet using "TopMovies\`x'\1_201`y'.txt", delimiter ("|") clear
    save tmp, replace
    restore
    append using tmp
    }
    }

    but I obtain the following error message:

    "file TopMovies`x'\1_20130828.txt not found"

    It seems I cannot use the name of the folder in the loop. Another problem is that all the files contained in the different folders have the same name (which is actually a date), so I cannot put all the files into one single folder
    Any idea?

    Thank you,
    Estrella Gomez

  • #2
    On the first problem: Don't use backslashes here, use forward slashes. This is documented several times over, most emphatically at http://www.stata-journal.com/sjpdf.h...iclenum=pr0042

    Comment


    • #3
      See also filelist (from SSC). There's an example that shows a more efficient way to do this.

      Comment


      • #4
        After you fix the problem that Nick pointed out, you might want to think about streamlining your loop. You are doing a lot of -preserve- and -restore- work that isn't necessary for the task. The following loop will run faster and do the same thing:

        Code:
        save tmp, replace 
        foreach x in at be bg cy cz de dk ee es fi fr gr hu ie it lt lu lv mt nl pl pt se si sk gb us {
             forv y=30828/41217{
                 insheet using "TopMovies\`x'\1_201`y'.txt", delimiter ("|") clear
                 append using tmp
                 save tmp, replace 
             }
        }
        NOTE: If there will be no data in memory when you get to this part of the code, then the -save tmp, replace- command at the top can be omitted. If there may or may not be data in memory at this point, you should add the -emptyok- option to that top -save- command.

        Finally, you might (or might not) want to use a Stata tempfile rather than creating a permanent file with the name tmp.


        Comment


        • #5
          Excellent comments by Robert and Clyde, but naturally with the proviso that backward slashes in #4 should be forward slashes, which is where we came in.

          Comment


          • #6
            In addition to Nick's very important point about not using backslashes in file paths, Clyde solution is still quite inefficient, particularly if 30,000 files are in play because it repeatedly saves a single dataset that is becoming increasingly large at each pass. Assuming that each pass reads in just 1K in data, then after 30K saves, the loop will have written about 430GB to disk when only 30MB are really needed. Here's a much more efficient approach using two loops (based on the example in filelist from SSC):

            Code:
            local i 0
            foreach x in at be bg cy cz de dk ee es fi fr gr hu ie it lt lu lv mt nl pl pt se si sk gb us {
                forv y=30828/41217{
                    insheet using "TopMovies/`x'/1_201`y'.txt", delimiter ("|") clear
                    local i = `i' + 1
                    tempfile save`i'
                    save "`save`i''"
                }
            }
            
            clear
            local obs `i'
            forvalues i=1/`obs' {
                append using "`save`i''"
            }

            Comment


            • #7
              Excellent point, Robert! I think you just saved Ms. Gomez a lot of time, and possibly a lot of headaches too.

              Comment


              • #8
                Very useful comments! Thanks a lot. Now I have a new problem: the names of the files are actually dates, so when I run the code, I obtain this error:

                "file TopMovies/at/1_20130832.txt not found"

                There is a gap of 70 numbers from 30831 to 30901, since after 31/08/2013 next day is 01/09/2013. This happens every 30 files, so I cannot use

                local i=`i' + 1

                Is there any command to tell Stata to merge all the files in each one of the folders into one, without specifying the names? Alternatively, a command telling to merge all files starting by "1_ 201" would also be useful...

                Thank you very much

                Comment


                • #9
                  Estrella,

                  Perhaps you can create a macro containing the dates and use that within your loops. An example

                  Code:
                  clear
                  set more off
                  
                  set obs 20
                  
                  // create variable with "usual" date format
                  gen date = _n
                  format %td date
                  
                  // your format
                  clonevar date2 = date
                  format %tdCC+YY+NN+DD date2
                  
                  // to string
                  generate textdate = string(date2, "%tdCC+YY+NN+DD")
                  
                  list
                  describe
                  
                  // create local macro
                  levelsof(textdate), local(mydates)
                  You should:

                  1. Read the FAQ carefully.

                  2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                  3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                  4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                  Comment


                  • #10
                    Ok, solved. Thank you very much!

                    Comment

                    Working...
                    X