Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating variables based on file names for +12,000 files

    Apologies, I am not very good at coding.

    I have +12,000 files in which I need to add the file name (the file names include information about the names of investment funds and the quarters of the fund holdings) as a variable in each file and then append these files into one large file. I have a problem with the first part, saving the file names as names.

    The following is an example of one of the file names:
    Adrigo Hedge portfolio 2006-12-31.dta

    I split the files into two folders as they exceed 10,000 which I understand is the maximum when using 'foreach' to loop over files. I then tried the following code to to save the file names as a substrings (dropping '.dta')

    local allfiles : dir . files "*.dta"

    foreach file of local allfiles {
    preserve
    use `file'.dta, clear
    gen str fund_date = substr(`file', 1, strlen(file') - 4)
    save temp, replace
    restore
    append using temp, force
    }

    I do not receive any error message so I do not know what is wrong with the code. Can anybody please help me with this?

    Mari Paananen

  • #2
    Actually, I found your code quite good, despite some "grammatical" errors.

    Code:
    local allfiles : dir . files "*.dta"
    foreach file of local allfiles {
    preserve
    use `file', clear                                                  // .dta to be cut out
    gen str fund_date = substr("`file'", 1, strlen("`file'") - 4)      // "" to be added
    save temp, replace
    restore
    append using temp, force
    }
    Besides, I suggest to add this part before starting your code.
    Code:
    clear
    capture erase temp.dta
    Last edited by Romalpa Akzo; 15 Nov 2017, 05:34.

    Comment


    • #3
      Thank you so much Romalpa, I added the citations but it is not working and still no error message.

      local allfiles : dir . files "*.dta"
      clear
      capture erase temp.dta
      foreach file of local allfiles {
      preserve
      use `file'.dta, clear
      gen str fund_date = substr("`file'", 1, strlen("`file'") - 4)
      save temp, replace
      restore
      append using temp, force
      }

      Comment


      • #4
        Hi Mari,

        you provide little information about the source data layout, so it is hard to guess what the problem's solution could be. I assume you will have to change the "." in the very first line of your code to the name of the (sub-) directory holding the dataset files. Also, I never heard of a limit of 10,000 iterations in a foreach loop (which doesn't mean it does not exist); where did you get this information from? I don't see it in Stata's documentation.

        Anyways, I think your code should generally work, but is not the most efficient way to solve the task. I made up a small example to illustrate a major surplus in efficiency: You can append all the files one after each other without saving the result dataset to disk. With lots of files (as is the case in your scenario) this should be much quicker.

        How about this:
        Code:
        // make up example data
        clear
        input contentvar1 contentvar2
        1 2
        3 4
        end
        save `"EXAMPLEFILE_Adrigo Hedge portfolio 2006-12-31.dta"' , replace
        clear
        input contentvar1 contentvar2
        5 6
        7 8
        end
        save `"EXAMPLEFILE_Adrigo Hedge portfolio 2007-01-01.dta"' , replace
        // end of example data
        
        // here comes the append-code
        tempvar appendsource
        local allfiles : dir . files "*.dta"
        clear
        generate filename=""
        foreach file of local allfiles {
            display `"adding file: {it:`file'}"'
            append using `"`file'"' , generate(`appendsource')
            quietly : replace filename=substr(`"`file'"', 1,strlen(`"`file'"')-4) if (`appendsource'==1)
            drop `appendsource'
            display `"{tab}...done"'
        }
        
        // add-on: extract date variable from file name
        generate fund_date=date(substr(filename,-10,.),"YMD")
        format fund_date %td
        I think the main issue still is the same: The code looks for all files with file extension .dta in the current working directory ("." in your "local allfiles" statement). I assume that is not the directory where your load of files is saved, thus the local allfiles is empty, and the foreach loop is a loop over nothing, so nothing happens.

        Kind regards
        Bela

        Comment


        • #5
          I think that the best tools for this type of task are filelist and runby, both from SSC.

          To illustrate, the following code creates 740 datasets with filenames that follow the example in #1, split between two subdirectories:
          Code:
          clear all
          sysuse auto
          keep make price foreign
          expand 10
          bysort make: gen date = dofq(yq(2000, 1) + _n)
          format %tdCCYY-NN-DD date
          gen fname = make + " " + string(date,"%tdCCYY-NN-DD")
          decode foreign, gen(dirname)
          gen relpath = dirname + "/" + fname
          isid relpath
          
          cap mkdir Domestic
          cap mkdir Foreign
          
          program makedata
            local f = relpath
            keep price
            save "`f'.dta", replace
          end
          runby makedata, by(relpath)
          Now that we have a data setup to work with, here's how to combine all files at once and generate variables for both the fund name and fund date, as extracted from the file name. The first part creates a dataset of files to read and append. Because filelist finds files in subdirectories, all you need is to fine tune the list. In this example, if there are no other datasets in the current directory, it's sufficient to prune down the list to files that contain ".dta". With the list of files to operate on in memory, the next lines create fdta, the relative path to each dataset. The fund's date and name are also extracted. I'm using simple string functions based on the observation that the date has a fixed length. If that's not the case, regular expressions would be a better tool.

          The program datacombo is called for each filename in the list. The first 3 lines copy the relevant information into locals. The next 3 lines load the dataset and creates the two desired variables. With runby, anything that is left in memory when the program terminates is considered results. These are collected (appended) and left in memory when runby terminates.

          Code:
          clear all
          
          filelist
          keep if strpos(filename,".dta")
          tab dirname
          gen fdta = dirname + "/" + filename
          gen funddate  = substr(filename, strlen(filename)- 13, 10)
          gen fundname = substr(filename, 1, strlen(filename)-15)
          
          program datacombo
            local f = fdta
            local fdate = funddate
            local fname = fundname
            use "`f'", clear
            gen fund_date = "`fdate'"
            gen fund_name = "`fname'"
          end
          runby datacombo, by(filename) verbose
          Last edited by Robert Picard; 15 Nov 2017, 10:02. Reason: switched the code to extract fund's name and date out of the datacombo program

          Comment


          • #6
            1. Thanks Daniel and Robert for providing out brilliant sample of effective solutions. I do learn a lot from those code.

            2. Anyhow, Mari, I still believe your code is working and not bad at all. It seems you still keep the mistake at
            Code:
            use `file'.dta, clear
            It should be
            Code:
            use `file', clear
            3. Another issue: this part
            Code:
            clear
            capture erase temp.dta
            should be at the very beginning,

            4. Then the full code, which I have tested and everything seems OK is:

            Code:
            clear
            capture erase temp.dta
            local allfiles : dir . files "*.dta"
            foreach file of local allfiles {
            preserve
            use `file', clear
            gen str fund_date = substr("`file'", 1, strlen("`file'") - 4)
            save temp, replace
            restore
            append using temp, force
            }
            Just give it one more try.
            Last edited by Romalpa Akzo; 15 Nov 2017, 19:55.

            Comment


            • #7
              Mari, one more thing. As noticed by Daniel Bela at #4, you should be sure you are working on the right folder. Good luck.
              Last edited by Romalpa Akzo; 15 Nov 2017, 22:30.

              Comment


              • #8
                Thank you so much for the help and suggestions. I was indeed working with the wrong directory. Also, the 10,000 limit on number of files that can be handled using ‘foreach’ comes from a Stata error message.

                Anyway, I used Romalpa’s code as follows:

                tempvar appendsource
                local allfiles : dir . files "*.dta"
                clear
                generate filename=""
                foreach file of local allfiles {
                display `"adding file: {it:`file'}"'
                cap noi append using `"`file'"' , generate(`appendsource') force
                quietly : replace filename=substr(`"`file'"', 1,strlen(`"`file'"')-15) if (`appendsource'==1)
                drop `appendsource'
                display `"{tab}...done"'
                }

                However, many files were omitted from the appending process and I cannot figure out why.

                Here is an example of a successfully appended file named Adrigo Hedge portfolio 2017-03-31.dta:

                [CODE]
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str20 Company long SumNumofShares double(ValueMSEK Verified)
                "Scandic Hotels Group" 387921 36.56155425 20909.083333333336
                "AAK" 35653 21.03527 20909.083333333336


                Here is a file not appended named nordea swedish stars portfolio 2017-04-14.dta:

                [CODE]
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str13 Company long SumNumofShares double(ValueMSEK Verified)
                "Atlas Copco" 3082435 961.5028965 20819.041666666664
                "Volvo" 7151532 906.8142576 20878.041666666664


                Have anybody encountered this before and know why one file of these was appended and the other not?

                Many thanks!

                Mari

                Comment

                Working...
                X