import multiple csv/xls with INVALID FILE NAMES

Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#1

import multiple csv/xls with INVALID FILE NAMES

26 Apr 2016, 11:00

Hello,
I am using a loop to import multiple dataset with different extensions and append them. The problem is that some of these files have invalid characters in their names (blanks, $ etc.).
I am wondering if there is a way to overcome this, without:
-affecting the original filename (so no !ren fileold filenew)
-having to create a duplicate of the original file and doing the above
-resorting to Python
I know this might be easier this Python (or, at least, I only know how to do it with Python), but I would love to prove that Stata can make it.
Kind Regards,
G.
Tags: data, import
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

26 Apr 2016, 11:16

Off hand, I can't think of a way to handle this that doesn't involve renaming the files (directly or indirectly. Maybe somebody else will suggest one.

But I am very curious why you want to do this. If the filenames are illegal, you are just going to repeatedly run into problems with any attempts to use them, and you will spend more of your time working around this than you will on actually managing your data and analyzing it. Why don't you want to just rename them once and for all, and move forward? Evidently, in doing that, you should keep an audit trail of what you've done so you can ultimately trace everything back to its source. But what reason do you have for wanting to retain these names?

Finally, my experience over the decades is that the presence of any irregularity in a data source often presages finding more irregularities elsewhere. If the source of this data is producing invalid file names, do you really trust the data inside those files?
Comment
Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#3

26 Apr 2016, 16:51

I see your skepticism on possible irregularities, but here there is no link between invalid charachters in the title name and data quality. As for why I don't want to change the file names, I believe that, especially when working with thousands of different source files, it is good practice to have the original input files unchanged, to the extent to which this is possible of course (which it is with Python). However, I have good enough reasons (not least because I am a stata fan) to be willing to replicate the exercise using stata.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

26 Apr 2016, 17:13

I'm not sure which system you are working on, but blanks and "$" are not necessarily invalid characters in file names.

In Windows, for example:

Code:

sysuse auto, clear local myfile `"my file name $$$.csv"' export delimited "`myfile'", replace clear import delimited using "`myfile'"

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

26 Apr 2016, 18:35

Are you putting double quotation marks around all the names, as in Carole's example?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

26 Apr 2016, 18:48

Carole J. Wilson ,

if you run the code you posted you may end up with an error, at least in Stata 13.0 (Revision 02 Jul 2013) / Windows:

Code:

. export delimited "`myfile'", replace invalid 'file' stata(): 3598 Stata returned error export_delim_write_file(): - function returned error export_delim_export_file(): - function returned error <istmt>: - function returned error r(3598); end of do-file r(3598);

But you are right that the file names with dollar signs or spaces are valid file names in Windows.

Note, however, that save command has no problem saving to the same file where export delimited fails.

Giovanni Di Prete has mentioned that he is importing the files, and perhaps has encountered the same problem. Unfortunately Giovanni didn't follow the FAQ and didn't specify which version of Stata he is running, nor exactly the commands he typed and Stata's error messages. Had he done so, we could be more helpful, and decide, whether this is the known bug with export delimited command or some sort of a new issue.

Best, Sergiy Radyakin
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#7

26 Apr 2016, 19:31

Sergiy Radyakin, indeed, a bug was fixed in Stata 13 (update 15jan2014); it should run correctly in up-to-date 13 & 14:

9. export delimited failed to respect double quotes embedded in string data. export delimited now
correctly respects double quotes embedded in string data by exporting two double quotes for each
real double quote.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#8

27 Apr 2016, 02:06

Thanks Carole,
Very useful, your code works fine. However if you try to change the name of the file into "my file name$co.csv" (which is similar to one of the instances that creates me problems), typing

Code:

sysuse auto, clear local myfile `"my file name$co.csv"' export delimited "`myfile'", replace clear import delimited using "`myfile'"

will still work fine. However, if you check your working directory you will see that the file has been exported as "my file name.csv", insteaad of "my file name$co.csv". If you then change it manually to "my file name$co.csv" and type again

Code:

local myfile `"my file name$co.csv"' import delimited using "`myfile'"

you will get an error "file my file name.csv not found". Thus, it seems to me that the problem lies in the fact that the local myfile register only the string until the $ symbol.
Any idea of why this might happen?
Many thanks
G.
Comment
Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#9

27 Apr 2016, 02:22

So, strange characters are not the problem here, but it's rather $. If the file is named "my file name┬úÔé¼-)(.__.csv", everything works fine

Code:

sysuse auto, clear local myfile `"my file name┬úÔé¼-)(.__.csv"' export delimited "`myfile'", replace clear import delimited using "`myfile'"

However, if you manually change the name into "my file name┬úÔ$é¼-)(.__.csv" and run

Code:

local myfile `"my file name┬úÔ$é¼-)(.__.csv"' import delimited using "`myfile'"

will return the error "file my file name┬úÔ¼-)(.__.csv not found", meaning that $é are skipped when defining the local.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#10

27 Apr 2016, 06:54

You definitely need to escape the $ (\$), but that doesn't solve the problem b/c each time it encounters the macro, Stata will attempt to expand the global macro $co. Perhaps someone else has an idea.

Code:

local myfile = "my file name\$co.csv" mac dir *macro is correct *but will attempt to expand the global again (which is empty) noi di "`myfile'"

Here's an old reference to using scalars for a similar situation, but I didn't have any luck getting scalars to work in this situation:
http://www.stata.com/statalist/archi.../msg00070.html

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

27 Apr 2016, 08:50

You can prevent the expansion of macros using macval(). Using Carole's example:

Code:

. local myfile = "my file name\$co.csv"

. mac list _myfile
_myfile:        my file name$co.csv

. noi di "`macval(myfile)'"
my file name$co.csv

But this is of little help when you hand off such a string to a program. The program handles arguments internally using macros and when these macros are expanded, pouf, the $ sign triggers the expansion of a global macro. This appears to be the case with export delimited:

Code:

. mac list _myfile
_myfile:        my file name$co.csv

. export delimited "`macval(myfile)'", replace
(note: file my file name.csv not found)
file my file name.csv saved

One workaround is to use a dummy but valid filename and then rename the file ex post:

Code:

sysuse auto, clear

* this is the desired file name
local myfile "my file name\$co.csv"
mac list _myfile

* call the program with a dummy but valid name
export delimited "dummy_name.csv", replace

* move the content to the file name you want while controlling expansion
copy "dummy_name.csv" "`macval(myfile)'"

* read back the file
clear
import delimited using "`macval(myfile)'"

Keen eyes will notice that import delimited does not have problems with the filename. A quick look at both import.ado and export.ado shows that import delimited handles the filename this way:

Code:

    if `"`subcmd'"' == "delim" {
        `version' ImpDelim `macval(0)'
    }

[...]

program ImpDelim
    version 13
    local version : di "version " string(_caller()) ":"
    `version' import_delimited `macval(0)'
end

while export delimited handles the filename like this:

Code:

    if `"`subcmd'"' == "delim" {
        ExpDelim `0'
    }

[...]

program ExpDelim
    version 13

    export_delimited `0'
end

So import guards against expansion and export does not.

Comment

Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#12

27 Apr 2016, 11:50

Many thanks Carole and Robert.

Code:

local myfile "my file name\$co.csv" import delimited using "`macval(myfile)'", clear

would be a great solution, if only I was able to insert the \ when naming the local, as I am looping over all the files in each given folder (see an extract below)

Code:

global path "mypath" global csv delimited global xlsx excel global xls excel ... ... forvalues t=1/52{ cd "$path\Week `t'\XXXXXXX" fs *`ext' foreach f in `r(files)'{ import $`ext' using "`f'", clear .... .... } }

Any idea about how to replace $ with \$ when "`f'" contains $? I am failing to manipulate the string "`f'" precisely b/c of the macro expansion, but as I am not confiden using macval, I may be missing something.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

27 Apr 2016, 12:58

May I suggest that you start by not using global macros. Also, I'm not a fan of the macro manipulation loops that people who want to process a batch of files have been historically steered to. I always use filelist (from SSC) to capture and refine the list of files that I need to work with. To install it, type in Stata's Command window:

Code:

ssc install filelist

Here's an example, derived from the help file. Before running the code, I changed Stata's current directory to the one that contains the "test" directory. In it, I placed two sub-directories called "week1" and "week2". In each of these, 3 files with a name that contains a $ character.

Code:

filelist , dir("test") pattern("*.csv") save("csv_datasets.dta") replace

use "csv_datasets.dta", clear
local obs = _N
forvalues i=1/`obs' {
    use "csv_datasets.dta" in `i', clear
    local f = dirname + "/" + filename
    import delimited using "`macval(f)'", clear
    gen source = "`macval(f)'"
    tempfile save`i'
    save "`save`i''"
}

use "`save1'", clear
forvalues i=2/`obs' {
    append using "`save`i''"
}

tab source

and here's the result:

Code:

. tab source

                         source |      Freq.     Percent        Cum.
--------------------------------+-----------------------------------
 test/week1/my file name$co.csv |         74       16.67       16.67
test/week1/my file name$co2.csv |         74       16.67       33.33
test/week1/my file name$co3.csv |         74       16.67       50.00
 test/week2/my file name$co.csv |         74       16.67       66.67
test/week2/my file name$co2.csv |         74       16.67       83.33
test/week2/my file name$co3.csv |         74       16.67      100.00
--------------------------------+-----------------------------------
                          Total |        444      100.00

Comment

Giovanni Di Prete

Join Date: Feb 2015

Posts: 19
#14

27 Apr 2016, 15:15

Many thanks, filelist is great!
Comment

Announcement