Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using multiple wildcards to identify stata files

    Hi, I'm using Stata 14 on Windows. I am trying to export results of analysis that I save into Stata data files into Excel. When I originally started the analysis I broke the analysis into incidence ("inc") and prevalence ("pr") .dta files. This meant that I could set up a local macro to look for files beginning with "inc" or "pr". This is the code I used successfully for this part of the export routine:

    Code:
    * export_to_excel.do
    *
    * TE 20170425
    *
    * projectname
    *
    * Export .dta files into .xlsx worksheet, import .png files to worksheet
    
    cd "${outdir}"
    local outputtypes  "inc pr"
    foreach outputtype of local outputtypes {
     local files : dir "${outdir}" files "`outputtype'*.dta"
     foreach f of local files {
        use `f', replace
        local tabname = regexr("`f'","\.dta","")
        di "`tabname'"
        export excel using "`outputtype'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
       }
    }

    I originally did this for one disease area - stroke, but now I have included a category of "TIA" and "STIA". Each file is saved with "pr" or "inc" in the filename along with "stroke", "TIA" or "STIA". I am trying to amend my code so that not only do I search for files with the "pr" or "inc" prefix, but also to loop through "stroke", "STIA" and "TIA", so that I can have an excel file each for "pr_stroke", "pr_STIA", "pr_TIA", "inc_stroke", "inc_TIA" and "inc_TIA".

    I have amended my code to this:

    Code:
    cd "${outdir}"
    local outputtypes   "inc pr"
    foreach outputtype of local outputtypes {
     local datafiles: dir "${outdir}" files "*stroke*.dta *TIA*.dta *STIA*.dta"
      foreach datafile of local datafiles {
      local files : dir "${outdir}" files "`outputtype'`datafile'"
      use `f', replace
      local tabname = regexr("`f'","\.dta","")
      di "`tabname'"
     export excel using "`outputtype'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
      }
    }
    At the moment I run the code and no output is produced or an error message for me to chase down.

    Any help appreciated.
    Last edited by Tim Evans; 11 Jul 2017, 05:30.

  • #2
    What happens when you display the local datafiles? I'm not 100% sure, but I believe you cannot do

    Code:
     local datafiles: dir "${outdir}" files "*stroke*.dta *TIA*.dta *STIA*.dta"
    That is, you can only define one pattern at a time. You should either find something that identifies all of them (but not the others) or do it separately and then combine them.

    Code:
    // At once
    local datafiles : dir . files "*`outputtype'*".dta
    
    // Separately
    local datafiles1 : dir . files "*stroke*".dta
    local datafiles2 : dir . files "*TIA*".dta
    
    local allDatafiles : list datafiles1 | datafiles2
    (untested code)

    Two more notes. If you have set your current directory, you don't need to add it to the dir.. files command, you can just put a dot. Also, there is a subinstr(var,StringToFind, StringToPutIn, IntHowOften) command in Stata which replaces "StringToFind" by "StringToPutIn". Of course the regex function works fine, but it's less familiar to most Stata users.

    Comment


    • #3
      Thanks for the suggestion Jesse,

      My problem I have at the moment is that I want to save off the files for stroke incidence in a inc_stroke excel file and incidence TIA in another excel file called inc_TIA.

      With my current code, everything .dta file gets saved into its own named excel file, without actually being saved into an inc_TIA or inc_stroke excel file.

      Code:
      cd "${outdir}"
      local outputtypes  "inc pr"
      // Separately
      local datafiles1 : dir . files "*stroke*.dta"
      local datafiles2 : dir . files "*STIA*.dta"
      di `datafiles1'
      local allDatafiles : list datafiles1 | datafiles2
      
      foreach outputtype of local outputtypes {
       foreach allDatafile of local allDatafiles {
       use `allDatafile', replace
       local tabname = regexr("`allDatafile'","\.dta","")
       di "`tabname'"
       export excel using "`outputtype'_`allDatafile'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
        }
      }

      Any ideas on where I might be going wrong?

      Comment


      • #4
        For this type of problem, I find that filelist (from SSC) is simpler and more convenient since it creates a dataset of filenames and you can use the full complement of Stata data management tools to whittle down the list of files. The following picks up all datasets in the "main" subdirectory and creates an indicator if the filename matches 3 patterns:
        Code:
        filelist, dir("main") pattern("*.dta")
        
        gen tokeep = strmatch(filename,"*stroke*") | strmatch(filename,"*TIA*") | ///
            strmatch(filename,"*STIA*")
        See the example in the help file on how to create a loop that will process the list.

        Comment


        • #5
          Thanks Robert for your suggestion. I think this is my solution using the -filelist- command. Not sure what I have done is entirely clean and I have a feeling I am entering into a never ending loop as the routine seems to last longer than I would expect, but here it is:

          Code:
          filelist, dir("$outdir") pattern("*.dta") norecursive
          
          generate group1 = 1 if regexm(filename, "^pr") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
          replace group1 = 2 if  regexm(filename, "^pr") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
          replace group1 = 3 if  regexm(filename, "^pr") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "stroke_per") | regexm(filename, "d_stroke"))
          replace group1 = 4 if  regexm(filename, "^inc") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
          replace group1 = 5 if  regexm(filename, "^inc") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
          replace group1 = 6 if  regexm(filename, "^inc") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "_stroke_per") | regexm(filename, "d_stroke"))
          
          label define Group 4 "inc_STIA" 5 "inc_TIA" 6 "inc_stroke" 1 "pr_STIA" 2 "pr_TIA" 3 "pr_stroke"
          label values group1 Group
          drop if group1==.
          decode group1, gen(group2)
          
          save "${outputs}dta_excel_export.dta", replace
          
          use "${outputs}dta_excel_export.dta", clear
                   local obs = _N
             forvalues i=1/`obs'  {
              forval z=1/6 {
              preserve
              keep if group1 == `z'
              local group = group2
              use "${outputs}dta_excel_export.dta" in `i', clear
              local name = filename
              use "`name'", replace
              local tabname = regexr("`name'","\.dta","")
              di "`group'"
              di "`name'"
              export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
                     restore
              }
               }
          Happy to receive any advice to clean this up!

          Comment


          • #6
            Having tweaked and played around with this a little more, I've managed to remove an unnecessary loop and this means I don't have the never ending loop. Here is the finalised code

            Code:
            filelist, dir("$outdir") pattern("*.dta") norecursive
            
            generate group1 = 1 if regexm(filename, "^pr") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
            replace group1 = 2 if  regexm(filename, "^pr") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
            replace group1 = 3 if  regexm(filename, "^pr") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "stroke_per") | regexm(filename, "d_stroke"))
            replace group1 = 4 if  regexm(filename, "^inc") & (regexm(filename, "_STIAper") | regexm(filename, "STIA_y") | regexm(filename, "_STIA_per") | regexm(filename, "d_TIA"))
            replace group1 = 5 if  regexm(filename, "^inc") & (regexm(filename, "_TIAper") | regexm(filename, "TIA_y") | regexm(filename, "_TIA_per") | regexm(filename, "d_STIA"))
            replace group1 = 6 if  regexm(filename, "^inc") & (regexm(filename, "strokeper") | regexm(filename, "stroke_y") | regexm(filename, "_stroke_per") | regexm(filename, "d_stroke"))
            
            label define Group 4 "inc_STIA" 5 "inc_TIA" 6 "inc_stroke" 1 "pr_STIA" 2 "pr_TIA" 3 "pr_stroke"
            label values group1 Group
            drop if group1==.
            decode group1, gen(group2)
            
            save "${outputs}dta_excel_export.dta", replace
            
             use "${outputs}dta_excel_export.dta", clear
                gen obs = _n
                qui su obs
                forvalues i=1(1)60 {
                 preserve
                 keep if obs ==`i'
                 local group = group2
                 local name = filename
                 use "`name'", replace
                 count
                 local tabname = regexr("`name'","\.dta","")
                 di "`tabname'"
                 export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
                  restore
                 }

            Comment


            • #7
              It's hard to tell without knowing the exact name of each file and the logic behind the file naming convention but I think your code conflates two diseases. Since "TIA_y" is a subset of "STIA_y", a file that falls in group 1 can be turned into a group 2 file. Here's code that creates a bunch of files that match the patterns you spelled out and code to create groups. I think that generally, it's better to use standard string functions like strpos() and limit regular expressions to when there's a clear advantage. The ability to specify an alternate pattern makes for simpler code in this case. As I recommended earlier, the loop avoids needless preserve/restore cycles and follows the model for looping over files in the help file for filelist.

              Code:
              cap mkdir main
              
              sysuse auto, clear
              keep in 1
              
              save "main/pr_STIAper.dta", replace
              save "main/prSTIA_y.dta", replace
              save "main/pr_STIA_per.dta", replace
              save "main/prd_TIA.dta", replace
              
              save "main/pr_TIAper.dta", replace
              save "main/prTIA_y.dta", replace
              save "main/pr_TIA_per.dta", replace
              save "main/prd_STIA.dta", replace
              
              save "main/prstrokeper.dta", replace
              save "main/prstroke_y.dta", replace
              save "main/prstroke_per.dta", replace
              save "main/prd_stroke.dta", replace
              
              save "main/inc_STIAper.dta", replace
              save "main/incSTIA_y.dta", replace
              save "main/inc_STIA_per.dta", replace
              save "main/incd_TIA.dta", replace
              
              save "main/inc_TIAper.dta", replace
              save "main/incTIA_y.dta", replace
              save "main/inc_TIA_per.dta", replace
              save "main/incd_STIA.dta", replace
              
              save "main/incstrokeper.dta", replace
              save "main/incstroke_y.dta", replace
              save "main/inc_stroke_per.dta", replace
              save "main/incd_stroke.dta", replace
              
              global outdir main
              
              filelist, dir("$outdir") pattern("*.dta") norecursive
              
              * identify incidence and prevalence
              gen group = "pr" if strpos(filename, "pr") == 1
              replace group = "inc" if strpos(filename, "inc") == 1
              keep if !mi(group)
              
              * identify diseases: "STIA", "TIA", and "stroke"
              replace group = group + "_STIA" if regexm(filename, "(_STIAper|STIA_y|_STIA_per|d_TIA)")
              replace group = group + "_TIA" if regexm(filename, "(_TIAper|[^S]TIA_y|_TIA_per|d_STIA)")
              replace group = group + "_stroke" if strpos(filename, "stroke") > 0
              
              tab group
              
              save "dta_excel_export.dta", replace
              local obs = _N
              
              forvalues i=1/`obs' {
                  use in `i' using "dta_excel_export.dta", clear
                  local group = group
                  local tabname = regexr(filename,"\.dta","")
                  local fullname = dirname + "/" + filename
                  use "`fullname'", clear
                   export excel using "`group'.xlsx", sheet("`tabname'") sheetreplace firstrow(variables)
              }

              Comment


              • #8
                Thanks Robert Picard I've subsequently updated my naming convention to remove all of the replace group section and use a direct string approach. I'll take a look at the loop section to see if I can benefit from that.

                Comment


                • #9
                  Robert Picard This is what I've ended up with:

                  Code:
                  clear
                  filelist, dir("$outdir") pattern("*.dta") norecursive
                  generate group1 = 1 if regexm(filename, "^incOfSTIA")
                  replace group1  = 2 if regexm(filename, "^incOfTIA")
                  replace group1  = 3 if regexm(filename, "^incOfstroke")
                  replace group1  = 4 if regexm(filename, "^prOfSTIA")
                  replace group1  = 5 if regexm(filename, "^prOfTIA")
                  replace group1  = 6 if regexm(filename, "^prOfstroke")
                  
                  label define Group 1 "inc_STIA" 2 "inc_TIA" 3 "inc_stroke" 4 "pr_STIA" 5 "pr_TIA" 6 "pr_stroke"
                  label values group1 Group
                  drop if group1==.
                  decode group1, gen(group2)
                  
                  save "${outputs}dta_excel_export.dta", replace
                  
                   use "${outputs}dta_excel_export.dta", clear
                      local obs = _N
                      gen obs2 = _n
                        forvalues i=1(1)`obs' {
                       preserve
                       keep if obs2 ==`i'
                       local group = group2
                       local name = filename
                       use "`name'", replace
                       local tabname = regexr("`name'","\.dta","")
                       di "`tabname'"
                       export excel using "`group'.xlsx", sheet("`tabname'") sheetmodify firstrow(variables)
                       putexcel set "`group'", sheet("`tabname'") modify
                       putexcel A2:I50, nformat(number_sep)
                       putexcel A1:H1, bold border(bottom)
                        restore

                  Comment


                  • #10
                    In my loop, the command
                    Code:
                    use in `i' using "dta_excel_export.dta", clear
                    indicates to clear the memory and load line `i' from "dta_excel_export.dta". That's all that you need and is simpler and more efficient than using preserve/keep/restore.

                    Comment


                    • #11
                      Okay will have another luck!

                      Comment

                      Working...
                      X