As with some of the scenarios listed in a previous thread (see link below), I have a number of .xlsx files in multiple subdirectories that I would like to pull, save, and append in Stata. Defining a macro that could be used in a loop to pull all .xlsx files is difficult because of the nesting of the subdirectories. My original solution was to take the easy way out using filelist (SSC) that Robert Picard wrote, but the problem here is that is translates all strings to lowercase, meaning that when I try to import these files later they are not recognized in Windows. I know that when defining a macro, one could add a restriction for respectcase, and I wanted to see if there was a way with either filelist or a systematically with macros to capture all the nested subdirectories without having to define these individually (best case scenario I have one subdirectory, worst case scenario I have four). One not-so-intelligent solution was to manipulate these folders and their contents (first in DOS, then using Altap Salamander when the former did not achieve what I needed) such that all filepaths in their entirety would be in lowercase starting with the first level of directories until the file name in .xlsx itself. The reason I am trying to avoid this fix is that all the data is hosted on Dropbox, and it appears that all users would have to rename all folders and files to lowercase on each of their computers. Just as an example, I renamed everything in the directory to lowercase on my laptop, and obviously this changed does not appear on my desktop). In addition to having a number of nested files, plenty of the files contain multiple spreadsheets. Assuming that renaming files and folders was the way out, this is the code I wrote based on Robert Picard's suggested code (I am not on my work computer and have not yet rename files to lowercase, but the error is the same regardless):
The problem with the above is that within the macros, it seems that Stata gets confused when a spreadsheet/tab in Excel contains quotation marks. If, before trying to open the sheet, I ask to display the values in r(worksheet_`s'), it displays the spreadsheet name correctly from r() (eg. ALUMNOS INSCRITOS A 1RO "G"), but refuses to display the value held in the macro (`sheet')
(output from trace):
- local sheet = r(worksheet_`s')
= local sheet = r(worksheet_1)
- dis "`sheet'"
= dis "ALUMNOS INSCRITOS A 1RO "G""
ALUMNOS INSCRITOS A 1RO G"" invalid name
(on the last line, you can see that the quotation mark before the G disappeared)
and even if I run through the import excel command, I get a consistent error in the import excel using "`f'", sheet("`sheet'") line of code.
worksheet ALUMNOS INSCRITOS A 1RO G"" not found
I realize this problem could probably be resolved with programming but I am not a very good programmer. So my questions are:
(1) In the event that this is my best bet in terms of time consumption (renaming all directories and files to lowercase and using filelist), is their a way to deal with the problem of disappearing quotation marks?
(2) If (1) is not possible or my renaming strategy is not "intelligent", is there a way to pull all files within nested subdirectories without loosing the case of the files and folders (without the need to define more than a handful of subdirectories)?
(3) Robert: any chance you could add some kind of a respectcase option to filelist (SSC)? I realize I could rewrite a program myself with this restriction, but I am not a very good programmer...
Thanks!
Code:
** Define macro that captures all xls, xlsx files in directory $dir and pulls them out of subfolders cd "$dir" ** Using filelist below is more systematic, but converts all filenames and paths to lower case filelist, p("*.xlsx") list gen fileid = _n save "myxlsx", replace * loop over each file and save a copy of the imported data; use "myxlsx", clear local obs = _N forval i=1/`obs' { use "myxlsx" in `i', clear local f = dirname + "/" + filename import excel "`f'", describe dis "`f'" local num = r(N_worksheet) forval s = 1/`num' { local sheet = r(worksheet_`s') dis "`sheet'" import excel using "`f'", sheet("`sheet'") cellrange(A7:K350) clear //K350 is arbitrary and will create blank cells, not sure how else to deal with this tempfile s`s' save `s`s'' } use `s1', clear if `sheet' > 1 { forval s = 2/`sheet' { append using `s`s'' } } gen source = "`f'" tempfile save`i' save "`save`i''" } * loop over the saved datasets and append them clear use "`save1'" forval i=2/`obs' { append using "`save`i''" }
(output from trace):
- local sheet = r(worksheet_`s')
= local sheet = r(worksheet_1)
- dis "`sheet'"
= dis "ALUMNOS INSCRITOS A 1RO "G""
ALUMNOS INSCRITOS A 1RO G"" invalid name
(on the last line, you can see that the quotation mark before the G disappeared)
and even if I run through the import excel command, I get a consistent error in the import excel using "`f'", sheet("`sheet'") line of code.
worksheet ALUMNOS INSCRITOS A 1RO G"" not found
I realize this problem could probably be resolved with programming but I am not a very good programmer. So my questions are:
(1) In the event that this is my best bet in terms of time consumption (renaming all directories and files to lowercase and using filelist), is their a way to deal with the problem of disappearing quotation marks?
(2) If (1) is not possible or my renaming strategy is not "intelligent", is there a way to pull all files within nested subdirectories without loosing the case of the files and folders (without the need to define more than a handful of subdirectories)?
(3) Robert: any chance you could add some kind of a respectcase option to filelist (SSC)? I realize I could rewrite a program myself with this restriction, but I am not a very good programmer...
Thanks!
Comment