Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing CSV files from multiple folders and saving to a single folder

    I have several folders containing CSV files that I want to import into a single folder named "CSV Import". The folder structure I have looks like this: the main folder "C:/Main" has multiple subfolders, each in turn containing one or more subfolders with CSV files.

    Currently, the code saves each imported CSV file as a ".dta" file in the same folder where the original CSV file resides. How can I modify the code to save all the ".dta" files in the "C:/CSV Import" folder instead?

    Code:
    clear
    
    local save_dir "C:/CSV Import"
    filelist, dir("C:/Main") pattern("*.csv") save("csv_filelist.dta") replace
    
    use "C:/Main/csv_filelist.dta", clear
    gen filepath = dirname + "/" + filename
    levelsof filepath, local(csv_file)
    
    // Importing all csv files and saving as .dta
    foreach x in `csv_file' {
        import delimited using "`x'", clear
        local dta_file = subinstr("`x'", ".csv", ".dta", .)
        save "`dta_file'", replace
    }

  • #2
    How can I modify the code to save all the ".dta" files in the "C:/CSV Import" folder instead?
    I will assume that the way you wish to combine these into a single large file is by -append-ing them together. You don't say that, but that is the most common situation. If you want to combine them by -merge-ing or a combination of -merge- and -append- then it is more complicated and you will need to provide much more detail about the files themselves and how they fit together with each other.

    Code:
    clear
    
    tempfile building
    save `building', emptok
    
    local save_dir "C:/CSV Import"
    filelist, dir("C:/Main") pattern("*.csv") save("csv_filelist.dta") replace
    
    use "C:/Main/csv_filelist.dta", clear
    gen filepath = dirname + "/" + filename
    levelsof filepath, local(csv_file)
    
    // Importing all csv files and saving as .dta
    foreach x in `csv_file' {
        import delimited using "`x'", clear
        local dta_file = subinstr("`x'", ".csv", ".dta", .)
        save "`dta_file'", replace
       append using `building'
        save `"`building'"', replace
    }
    
    save "C:/CSV Import/CSV Import.dta", replace
    That said, I recommend against doing this in the first place. With a large number of files, it is highly unlikely that you can successfully append them together without further preparation. The probability is high that there will be incompatibilities among them: a variable that is a string in some files but numeric in others, a variable whose name is misspelled in some of the files, variables that are value labeled, but with different label-value correspondences in different files. This is true even if the source data sets are a series curated by a highly reputable purveyor. If you are lucky, Stata will detect the incompatibilities, give you an error message, and refuse to proceed. If you are unlucky, Stata will pass it all through, and at the end you have not so much a data file as data salad.

    My advice is to stick with the original plan of saving the separate data files. Then use the -precombine- command, by Mark Chatfield, available from SSC, to identify all of the incompatibilities that exist among them. Then fix up the incompatibilities -precombine- tells you about in the separate files. Finally, you can just append all the cleaned files together with:
    Code:
    clear
    filelist, dir("C:/Main") pattern("*.dta")
    gen filepath = dirname + "/" + filename
    levelsof filepath, local(dta_files)
    
    clear
    append using `dta_files'
    
    save "C:/CSV Import/CSV Import.dta", replace
    If storage capacity is at a premium, you can always delete the individual .dta files once the combining is complete.

    Added: The Forum FAQ, which all are supposed to read before posting, requests that when user-written programs are mentioned, an indication of where they can be found should be included in the post. Since this was not done in #1, I'll just note here that -filelist- is written by Robert Picard and is available from SSC.
    Last edited by Clyde Schechter; 11 Apr 2023, 15:02.

    Comment


    • #3
      Thank you, Clyde! Yes, the CSV files are not all identical. I am going to use your suggestion on using the -precombine- command to detect differences in variables before appending them. Is there a way to output all individual .dta files to the same folder? In my case, as shown in #1, to the folder "C:/CSV Import".

      Also, I apologize for not mentioning about the user-written program - the -filelist- is indeed from Robert Picard.

      Comment


      • #4
        Code:
        clear
        filelist, dir("C:/Main") pattern("*.csv")
        
        capture program drop one_csv_file
        program define one_csv_file
            local directory = dirname[1]
            local filename = filename[1]
            import delimited using `"`dirname'/`filename'"', clear
            local filename: subinstr local filename ".csv" ".dta"
            save "C:/CSV Import/`filename'", replace
            clear
            exit
        end
        
        runby one_csv_file, by(dirname filename)
        will read in all of the CSV files and save them as separate .dta files, all in the C:/CSV Import directory.

        -runby- is written by Robert Picard and me. It is available from SSC.

        Added: Caution--in order for this to work as intended, there must not be two (or more) CSV files that have the same filename in different directories. If that situation arises, each of them will clobber the preceding one with the -save- command, and only the last one in the -filelist- results will survive. A simple way to detect this problem is to put -isid filename- in immediately following the -filelist- command. That way, Stata will halt with an error message if it encounters this situation, and, if not, will just proceed.
        Last edited by Clyde Schechter; 11 Apr 2023, 16:53.

        Comment


        • #5
          As you had warned, I notice that there are multiple files with the same filename in different directories. I get the following message when I run the above code:

          Number of by-groups = 128
          by-groups with errors = 128
          by-groups with no data = 0
          Observations processed = 128
          Observations saved = 0

          Please suggest any workarounds for this problem. Thank you for your time.


          Comment


          • #6
            Oh, I see my mistake. Sorry, it should be:
            Code:
            clear
            filelist, dir("C:/Main") pattern("*.csv")
            
            capture program drop one_csv_file
            program define one_csv_file
                local directory = dirname[1]
                local filename = filename[1]
                import delimited using `"`directory'/`filename'"', clear
                local filename: subinstr local filename ".csv" ".dta"
                save "C:/CSV Import/`filename'", replace
                clear
                exit
            end
            
            runby one_csv_file, by(dirname filename)

            Comment


            • #7
              That works perfectly. Thank you, Clyde!

              Comment

              Working...
              X