Using multiple wildcards to identify stata files

Tim Evans

Join Date: May 2014

Posts: 96
#1

Using multiple wildcards to identify stata files

11 Jul 2017, 05:24

Hi, I'm using Stata 14 on Windows. I am trying to export results of analysis that I save into Stata data files into Excel. When I originally started the analysis I broke the analysis into incidence ("inc") and prevalence ("pr") .dta files. This meant that I could set up a local macro to look for files beginning with "inc" or "pr". This is the code I used successfully for this part of the export routine:

Code:

* export_to_excel.do * * TE 20170425 * * projectname * * Export .dta files into .xlsx worksheet, import .png files to worksheet cd "${outdir}" local outputtypes "inc pr" foreach outputtype of local outputtypes { local files : dir "${outdir}" files "`outputtype'*.dta" foreach f of local files { use `f', replace local tabname = regexr("`f'","\.dta","") di "`tabname'" export excel using "`outputtype'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables) } }

I originally did this for one disease area - stroke, but now I have included a category of "TIA" and "STIA". Each file is saved with "pr" or "inc" in the filename along with "stroke", "TIA" or "STIA". I am trying to amend my code so that not only do I search for files with the "pr" or "inc" prefix, but also to loop through "stroke", "STIA" and "TIA", so that I can have an excel file each for "pr_stroke", "pr_STIA", "pr_TIA", "inc_stroke", "inc_TIA" and "inc_TIA".

I have amended my code to this:

Code:

cd "${outdir}" local outputtypes "inc pr" foreach outputtype of local outputtypes { local datafiles: dir "${outdir}" files "*stroke*.dta *TIA*.dta *STIA*.dta" foreach datafile of local datafiles { local files : dir "${outdir}" files "`outputtype'`datafile'" use `f', replace local tabname = regexr("`f'","\.dta","") di "`tabname'" export excel using "`outputtype'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables) } }

At the moment I run the code and no output is produced or an error message for me to chase down.

Any help appreciated.

Last edited by Tim Evans; 11 Jul 2017, 05:30.
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

11 Jul 2017, 07:22

What happens when you display the local datafiles? I'm not 100% sure, but I believe you cannot do

Code:

local datafiles: dir "${outdir}" files "*stroke*.dta *TIA*.dta *STIA*.dta"

That is, you can only define one pattern at a time. You should either find something that identifies all of them (but not the others) or do it separately and then combine them.

Code:

// At once local datafiles : dir . files "*`outputtype'*".dta // Separately local datafiles1 : dir . files "*stroke*".dta local datafiles2 : dir . files "*TIA*".dta local allDatafiles : list datafiles1 | datafiles2

(untested code)

Two more notes. If you have set your current directory, you don't need to add it to the dir.. files command, you can just put a dot. Also, there is a subinstr(var,StringToFind, StringToPutIn, IntHowOften) command in Stata which replaces "StringToFind" by "StringToPutIn". Of course the regex function works fine, but it's less familiar to most Stata users.
Comment

Tim Evans

Join Date: May 2014
Posts: 96

11 Jul 2017, 08:48

Thanks for the suggestion Jesse,

My problem I have at the moment is that I want to save off the files for stroke incidence in a inc_stroke excel file and incidence TIA in another excel file called inc_TIA.

With my current code, everything .dta file gets saved into its own named excel file, without actually being saved into an inc_TIA or inc_stroke excel file.

Code:

cd "${outdir}"
local outputtypes  "inc pr"
// Separately
local datafiles1 : dir . files "*stroke*.dta"
local datafiles2 : dir . files "*STIA*.dta"
di `datafiles1'
local allDatafiles : list datafiles1 | datafiles2

foreach outputtype of local outputtypes {
 foreach allDatafile of local allDatafiles {
 use `allDatafile', replace
 local tabname = regexr("`allDatafile'","\.dta","")
 di "`tabname'"
 export excel using "`outputtype'_`allDatafile'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
  }
}

Any ideas on where I might be going wrong?

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

11 Jul 2017, 09:25

For this type of problem, I find that filelist (from SSC) is simpler and more convenient since it creates a dataset of filenames and you can use the full complement of Stata data management tools to whittle down the list of files. The following picks up all datasets in the "main" subdirectory and creates an indicator if the filename matches 3 patterns:

Code:

filelist, dir("main") pattern("*.dta") gen tokeep = strmatch(filename,"*stroke*") | strmatch(filename,"*TIA*") | /// strmatch(filename,"*STIA*")

See the example in the help file on how to create a loop that will process the list.
2 likes
Comment

Tim Evans

Join Date: May 2014
Posts: 96

11 Jul 2017, 16:54

Thanks Robert for your suggestion. I think this is my solution using the -filelist- command. Not sure what I have done is entirely clean and I have a feeling I am entering into a never ending loop as the routine seems to last longer than I would expect, but here it is:

Code:

filelist, dir("$outdir") pattern("*.dta") norecursive

generate group1 = 1 if regexm(filename, "^pr") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
replace group1 = 2 if  regexm(filename, "^pr") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
replace group1 = 3 if  regexm(filename, "^pr") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "stroke_per") | regexm(filename, "d_stroke"))
replace group1 = 4 if  regexm(filename, "^inc") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
replace group1 = 5 if  regexm(filename, "^inc") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
replace group1 = 6 if  regexm(filename, "^inc") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "_stroke_per") | regexm(filename, "d_stroke"))

label define Group 4 "inc_STIA" 5 "inc_TIA" 6 "inc_stroke" 1 "pr_STIA" 2 "pr_TIA" 3 "pr_stroke"
label values group1 Group
drop if group1==.
decode group1, gen(group2)

save "${outputs}dta_excel_export.dta", replace

use "${outputs}dta_excel_export.dta", clear
         local obs = _N
   forvalues i=1/`obs'  {
    forval z=1/6 {
    preserve
    keep if group1 == `z'
    local group = group2
    use "${outputs}dta_excel_export.dta" in `i', clear
    local name = filename
    use "`name'", replace
    local tabname = regexr("`name'","\.dta","")
    di "`group'"
    di "`name'"
    export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
           restore
    }
     }

Happy to receive any advice to clean this up!

Comment

Tim Evans

Join Date: May 2014
Posts: 96

11 Jul 2017, 17:39

Having tweaked and played around with this a little more, I've managed to remove an unnecessary loop and this means I don't have the never ending loop. Here is the finalised code

Code:

filelist, dir("$outdir") pattern("*.dta") norecursive

generate group1 = 1 if regexm(filename, "^pr") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
replace group1 = 2 if  regexm(filename, "^pr") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
replace group1 = 3 if  regexm(filename, "^pr") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "stroke_per") | regexm(filename, "d_stroke"))
replace group1 = 4 if  regexm(filename, "^inc") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
replace group1 = 5 if  regexm(filename, "^inc") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
replace group1 = 6 if  regexm(filename, "^inc") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "_stroke_per") | regexm(filename, "d_stroke"))

label define Group 4 "inc_STIA" 5 "inc_TIA" 6 "inc_stroke" 1 "pr_STIA" 2 "pr_TIA" 3 "pr_stroke"
label values group1 Group
drop if group1==.
decode group1, gen(group2)

save "${outputs}dta_excel_export.dta", replace

 use "${outputs}dta_excel_export.dta", clear
    gen obs = _n
    qui su obs
    forvalues i=1(1)60 {
     preserve
     keep if obs ==`i'
     local group = group2
     local name = filename
     use "`name'", replace
     count
     local tabname = regexr("`name'","\.dta","")
     di "`tabname'"
     export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
      restore
     }

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

12 Jul 2017, 10:11

It's hard to tell without knowing the exact name of each file and the logic behind the file naming convention but I think your code conflates two diseases. Since "TIA_y" is a subset of "STIA_y", a file that falls in group 1 can be turned into a group 2 file. Here's code that creates a bunch of files that match the patterns you spelled out and code to create groups. I think that generally, it's better to use standard string functions like strpos() and limit regular expressions to when there's a clear advantage. The ability to specify an alternate pattern makes for simpler code in this case. As I recommended earlier, the loop avoids needless preserve/restore cycles and follows the model for looping over files in the help file for filelist.

Code:

cap mkdir main

sysuse auto, clear
keep in 1

save "main/pr_STIAper.dta", replace
save "main/prSTIA_y.dta", replace
save "main/pr_STIA_per.dta", replace
save "main/prd_TIA.dta", replace

save "main/pr_TIAper.dta", replace
save "main/prTIA_y.dta", replace
save "main/pr_TIA_per.dta", replace
save "main/prd_STIA.dta", replace

save "main/prstrokeper.dta", replace
save "main/prstroke_y.dta", replace
save "main/prstroke_per.dta", replace
save "main/prd_stroke.dta", replace

save "main/inc_STIAper.dta", replace
save "main/incSTIA_y.dta", replace
save "main/inc_STIA_per.dta", replace
save "main/incd_TIA.dta", replace

save "main/inc_TIAper.dta", replace
save "main/incTIA_y.dta", replace
save "main/inc_TIA_per.dta", replace
save "main/incd_STIA.dta", replace

save "main/incstrokeper.dta", replace
save "main/incstroke_y.dta", replace
save "main/inc_stroke_per.dta", replace
save "main/incd_stroke.dta", replace

global outdir main

filelist, dir("$outdir") pattern("*.dta") norecursive

* identify incidence and prevalence
gen group = "pr" if strpos(filename, "pr") == 1
replace group = "inc" if strpos(filename, "inc") == 1
keep if !mi(group)

* identify diseases: "STIA", "TIA", and "stroke"
replace group = group + "_STIA" if regexm(filename, "(_STIAper|STIA_y|_STIA_per|d_TIA)")
replace group = group + "_TIA" if regexm(filename, "(_TIAper|[^S]TIA_y|_TIA_per|d_STIA)")
replace group = group + "_stroke" if strpos(filename, "stroke") > 0

tab group

save "dta_excel_export.dta", replace
local obs = _N

forvalues i=1/`obs' {
    use in `i' using "dta_excel_export.dta", clear
    local group = group
    local tabname = regexr(filename,"\.dta","")
    local fullname = dirname + "/" + filename
    use "`fullname'", clear
     export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
}

Comment

Tim Evans

Join Date: May 2014

Posts: 96
#8

14 Jul 2017, 08:01

Thanks Robert Picard I've subsequently updated my naming convention to remove all of the replace group section and use a direct string approach. I'll take a look at the loop section to see if I can benefit from that.
Comment

Tim Evans

Join Date: May 2014
Posts: 96

14 Jul 2017, 09:13

Robert Picard This is what I've ended up with:

Code:

clear
filelist, dir("$outdir") pattern("*.dta") norecursive
generate group1 = 1 if regexm(filename, "^incOfSTIA")
replace group1  = 2 if regexm(filename, "^incOfTIA")
replace group1  = 3 if regexm(filename, "^incOfstroke")
replace group1  = 4 if regexm(filename, "^prOfSTIA")
replace group1  = 5 if regexm(filename, "^prOfTIA")
replace group1  = 6 if regexm(filename, "^prOfstroke")

label define Group 1 "inc_STIA" 2 "inc_TIA" 3 "inc_stroke" 4 "pr_STIA" 5 "pr_TIA" 6 "pr_stroke"
label values group1 Group
drop if group1==.
decode group1, gen(group2)

save "${outputs}dta_excel_export.dta", replace

 use "${outputs}dta_excel_export.dta", clear
    local obs = _N
    gen obs2 = _n
      forvalues i=1(1)`obs' {
     preserve
     keep if obs2 ==`i'
     local group = group2
     local name = filename
     use "`name'", replace
     local tabname = regexr("`name'","\.dta","")
     di "`tabname'"
     export excel using "`group'.xlsx", sheet("`tabname'") sheetmodify firstrow(variables)
     putexcel set "`group'", sheet("`tabname'") modify
     putexcel A2:I50, nformat(number_sep)
     putexcel A1:H1, bold border(bottom)
      restore

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#10

14 Jul 2017, 09:25

In my loop, the command

Code:

use in `i' using "dta_excel_export.dta", clear

indicates to clear the memory and load line `i' from "dta_excel_export.dta". That's all that you need and is simpler and more efficient than using preserve/keep/restore.
Comment
Tim Evans

Join Date: May 2014

Posts: 96
#11

14 Jul 2017, 16:46

Okay will have another luck!
Comment

Announcement