Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • import multiple csv/xls with INVALID FILE NAMES

    Hello,
    I am using a loop to import multiple dataset with different extensions and append them. The problem is that some of these files have invalid characters in their names (blanks, $ etc.).
    I am wondering if there is a way to overcome this, without:
    -affecting the original filename (so no !ren fileold filenew)
    -having to create a duplicate of the original file and doing the above
    -resorting to Python
    I know this might be easier this Python (or, at least, I only know how to do it with Python), but I would love to prove that Stata can make it.
    Kind Regards,
    G.

  • #2
    Off hand, I can't think of a way to handle this that doesn't involve renaming the files (directly or indirectly. Maybe somebody else will suggest one.

    But I am very curious why you want to do this. If the filenames are illegal, you are just going to repeatedly run into problems with any attempts to use them, and you will spend more of your time working around this than you will on actually managing your data and analyzing it. Why don't you want to just rename them once and for all, and move forward? Evidently, in doing that, you should keep an audit trail of what you've done so you can ultimately trace everything back to its source. But what reason do you have for wanting to retain these names?

    Finally, my experience over the decades is that the presence of any irregularity in a data source often presages finding more irregularities elsewhere. If the source of this data is producing invalid file names, do you really trust the data inside those files?

    Comment


    • #3
      I see your skepticism on possible irregularities, but here there is no link between invalid charachters in the title name and data quality. As for why I don't want to change the file names, I believe that, especially when working with thousands of different source files, it is good practice to have the original input files unchanged, to the extent to which this is possible of course (which it is with Python). However, I have good enough reasons (not least because I am a stata fan) to be willing to replicate the exercise using stata.

      Comment


      • #4
        I'm not sure which system you are working on, but blanks and "$" are not necessarily invalid characters in file names.

        In Windows, for example:
        Code:
        sysuse auto, clear
        local myfile `"my file name $$$.csv"'
        export delimited "`myfile'", replace
        clear
        import delimited using "`myfile'"
        Stata/MP 14.1 (64-bit x86-64)
        Revision 19 May 2016
        Win 8.1

        Comment


        • #5
          Are you putting double quotation marks around all the names, as in Carole's example?
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Carole J. Wilson ,

            if you run the code you posted you may end up with an error, at least in Stata 13.0 (Revision 02 Jul 2013) / Windows:

            Code:
            . export delimited "`myfile'", replace
            invalid 'file' 
                             stata():  3598  Stata returned error
            export_delim_write_file():     -  function returned error
            export_delim_export_file():     -  function returned error
                             <istmt>:     -  function returned error
            r(3598);
            
            end of do-file
            
            r(3598);

            But you are right that the file names with dollar signs or spaces are valid file names in Windows.

            Note, however, that save command has no problem saving to the same file where export delimited fails.

            Giovanni Di Prete has mentioned that he is importing the files, and perhaps has encountered the same problem. Unfortunately Giovanni didn't follow the FAQ and didn't specify which version of Stata he is running, nor exactly the commands he typed and Stata's error messages. Had he done so, we could be more helpful, and decide, whether this is the known bug with export delimited command or some sort of a new issue.

            Best, Sergiy Radyakin

            Comment


            • #7
              Sergiy Radyakin, indeed, a bug was fixed in Stata 13 (update 15jan2014); it should run correctly in up-to-date 13 & 14:

              9. export delimited failed to respect double quotes embedded in string data. export delimited now
              correctly respects double quotes embedded in string data by exporting two double quotes for each
              real double quote.
              Stata/MP 14.1 (64-bit x86-64)
              Revision 19 May 2016
              Win 8.1

              Comment


              • #8
                Thanks Carole,
                Very useful, your code works fine. However if you try to change the name of the file into "my file name$co.csv" (which is similar to one of the instances that creates me problems), typing

                Code:
                sysuse auto, clear
                local myfile `"my file name$co.csv"'
                export delimited "`myfile'", replace
                clear
                import delimited using "`myfile'"
                will still work fine. However, if you check your working directory you will see that the file has been exported as "my file name.csv", insteaad of "my file name$co.csv". If you then change it manually to "my file name$co.csv" and type again

                Code:
                local myfile `"my file name$co.csv"'
                import delimited using "`myfile'"
                you will get an error "file my file name.csv not found". Thus, it seems to me that the problem lies in the fact that the local myfile register only the string until the $ symbol.
                Any idea of why this might happen?
                Many thanks
                G.

                Comment


                • #9
                  So, strange characters are not the problem here, but it's rather $. If the file is named "my file name┬úÔé¼-)(.__.csv", everything works fine

                  Code:
                  sysuse auto, clear
                  local myfile `"my file name┬úÔé¼-)(.__.csv"'
                  export delimited "`myfile'", replace
                  clear
                  import delimited using "`myfile'"
                  However, if you manually change the name into "my file name┬úÔ$é¼-)(.__.csv" and run

                  Code:
                  local myfile `"my file name┬úÔ$é¼-)(.__.csv"'
                  import delimited using "`myfile'"
                  will return the error "file my file name┬úÔ¼-)(.__.csv not found", meaning that $é are skipped when defining the local.

                  Comment


                  • #10
                    You definitely need to escape the $ (\$), but that doesn't solve the problem b/c each time it encounters the macro, Stata will attempt to expand the global macro $co. Perhaps someone else has an idea.

                    Code:
                    local myfile = "my file name\$co.csv"
                    mac dir
                    *macro is correct
                    
                    *but will attempt to expand the global again (which is empty)
                    noi di "`myfile'"
                    Here's an old reference to using scalars for a similar situation, but I didn't have any luck getting scalars to work in this situation:
                    http://www.stata.com/statalist/archi.../msg00070.html
                    Stata/MP 14.1 (64-bit x86-64)
                    Revision 19 May 2016
                    Win 8.1

                    Comment


                    • #11
                      You can prevent the expansion of macros using macval(). Using Carole's example:

                      Code:
                      . local myfile = "my file name\$co.csv"
                      
                      . mac list _myfile
                      _myfile:        my file name$co.csv
                      
                      . noi di "`macval(myfile)'"
                      my file name$co.csv
                      But this is of little help when you hand off such a string to a program. The program handles arguments internally using macros and when these macros are expanded, pouf, the $ sign triggers the expansion of a global macro. This appears to be the case with export delimited:

                      Code:
                      . mac list _myfile
                      _myfile:        my file name$co.csv
                      
                      . export delimited "`macval(myfile)'", replace
                      (note: file my file name.csv not found)
                      file my file name.csv saved
                      One workaround is to use a dummy but valid filename and then rename the file ex post:

                      Code:
                      sysuse auto, clear
                      
                      * this is the desired file name
                      local myfile "my file name\$co.csv"
                      mac list _myfile
                      
                      * call the program with a dummy but valid name
                      export delimited "dummy_name.csv", replace
                      
                      * move the content to the file name you want while controlling expansion
                      copy "dummy_name.csv" "`macval(myfile)'"
                      
                      * read back the file
                      clear
                      import delimited using "`macval(myfile)'"
                      Keen eyes will notice that import delimited does not have problems with the filename. A quick look at both import.ado and export.ado shows that import delimited handles the filename this way:

                      Code:
                          if `"`subcmd'"' == "delim" {
                              `version' ImpDelim `macval(0)'
                          }
                      
                      [...]
                      
                      program ImpDelim
                          version 13
                          local version : di "version " string(_caller()) ":"
                          `version' import_delimited `macval(0)'
                      end
                      while export delimited handles the filename like this:

                      Code:
                          if `"`subcmd'"' == "delim" {
                              ExpDelim `0'
                          }
                      
                      [...]
                      
                      program ExpDelim
                          version 13
                      
                          export_delimited `0'
                      end
                      So import guards against expansion and export does not.

                      Comment


                      • #12
                        Many thanks Carole and Robert.

                        Code:
                        local myfile "my file name\$co.csv"
                        import delimited using "`macval(myfile)'", clear
                        would be a great solution, if only I was able to insert the \ when naming the local, as I am looping over all the files in each given folder (see an extract below)

                        Code:
                        global path "mypath"
                        global csv delimited                                                     
                        global xlsx excel
                        global xls excel
                        ...
                        ...
                        forvalues t=1/52{                
                        cd "$path\Week `t'\XXXXXXX"
                        fs *`ext'
                        foreach f in `r(files)'{
                        import $`ext' using "`f'", clear
                        ....
                        ....
                        }
                        }
                        Any idea about how to replace $ with \$ when "`f'" contains $? I am failing to manipulate the string "`f'" precisely b/c of the macro expansion, but as I am not confiden using macval, I may be missing something.


                        Comment


                        • #13
                          May I suggest that you start by not using global macros. Also, I'm not a fan of the macro manipulation loops that people who want to process a batch of files have been historically steered to. I always use filelist (from SSC) to capture and refine the list of files that I need to work with. To install it, type in Stata's Command window:

                          Code:
                          ssc install filelist
                          Here's an example, derived from the help file. Before running the code, I changed Stata's current directory to the one that contains the "test" directory. In it, I placed two sub-directories called "week1" and "week2". In each of these, 3 files with a name that contains a $ character.

                          Code:
                          filelist , dir("test") pattern("*.csv") save("csv_datasets.dta") replace
                          
                          use "csv_datasets.dta", clear
                          local obs = _N
                          forvalues i=1/`obs' {
                              use "csv_datasets.dta" in `i', clear
                              local f = dirname + "/" + filename
                              import delimited using "`macval(f)'", clear
                              gen source = "`macval(f)'"
                              tempfile save`i'
                              save "`save`i''"
                          }
                          
                          use "`save1'", clear
                          forvalues i=2/`obs' {
                              append using "`save`i''"
                          }
                          
                          tab source
                          and here's the result:
                          Code:
                          . tab source
                          
                                                   source |      Freq.     Percent        Cum.
                          --------------------------------+-----------------------------------
                           test/week1/my file name$co.csv |         74       16.67       16.67
                          test/week1/my file name$co2.csv |         74       16.67       33.33
                          test/week1/my file name$co3.csv |         74       16.67       50.00
                           test/week2/my file name$co.csv |         74       16.67       66.67
                          test/week2/my file name$co2.csv |         74       16.67       83.33
                          test/week2/my file name$co3.csv |         74       16.67      100.00
                          --------------------------------+-----------------------------------
                                                    Total |        444      100.00

                          Comment


                          • #14
                            Many thanks, filelist is great!

                            Comment

                            Working...
                            X