Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending multiple excel files from an infinite number of subdirectories

    As with some of the scenarios listed in a previous thread (see link below), I have a number of .xlsx files in multiple subdirectories that I would like to pull, save, and append in Stata. Defining a macro that could be used in a loop to pull all .xlsx files is difficult because of the nesting of the subdirectories. My original solution was to take the easy way out using filelist (SSC) that Robert Picard wrote, but the problem here is that is translates all strings to lowercase, meaning that when I try to import these files later they are not recognized in Windows. I know that when defining a macro, one could add a restriction for respectcase, and I wanted to see if there was a way with either filelist or a systematically with macros to capture all the nested subdirectories without having to define these individually (best case scenario I have one subdirectory, worst case scenario I have four). One not-so-intelligent solution was to manipulate these folders and their contents (first in DOS, then using Altap Salamander when the former did not achieve what I needed) such that all filepaths in their entirety would be in lowercase starting with the first level of directories until the file name in .xlsx itself. The reason I am trying to avoid this fix is that all the data is hosted on Dropbox, and it appears that all users would have to rename all folders and files to lowercase on each of their computers. Just as an example, I renamed everything in the directory to lowercase on my laptop, and obviously this changed does not appear on my desktop). In addition to having a number of nested files, plenty of the files contain multiple spreadsheets. Assuming that renaming files and folders was the way out, this is the code I wrote based on Robert Picard's suggested code (I am not on my work computer and have not yet rename files to lowercase, but the error is the same regardless):

    Code:
     ** Define macro that captures all xls, xlsx files in directory $dir and pulls them out of subfolders
        cd                     "$dir"
    
    ** Using filelist below is more systematic, but converts all filenames and paths to lower case
        filelist,             p("*.xlsx") list
        gen                 fileid = _n
        save               "myxlsx", replace
    
    * loop over each file and save a copy of the imported data;
        use                 "myxlsx", clear
        local                obs = _N
        forval               i=1/`obs' {
            use             "myxlsx" in `i', clear
            local             f = dirname + "/" + filename
            import excel  "`f'", describe
            dis                "`f'"
            local             num = r(N_worksheet)
            forval             s = 1/`num' {
                local        sheet = r(worksheet_`s')
                dis          "`sheet'"
                import excel using "`f'", sheet("`sheet'") cellrange(A7:K350) clear //K350 is arbitrary and will create blank cells, not sure how else to deal with this
                tempfile     s`s'
                save        `s`s''
            }
            use              `s1', clear
            if                  `sheet' > 1 {
                forval        s = 2/`sheet' {
                    append using `s`s''
                }
            }
            gen             source = "`f'"
            tempfile       save`i'
            save           "`save`i''"
        }
    
    * loop over the saved datasets and append them
        clear
        use                 "`save1'"
        forval              i=2/`obs' {
          append using "`save`i''"
        }
    The problem with the above is that within the macros, it seems that Stata gets confused when a spreadsheet/tab in Excel contains quotation marks. If, before trying to open the sheet, I ask to display the values in r(worksheet_`s'), it displays the spreadsheet name correctly from r() (eg. ALUMNOS INSCRITOS A 1RO "G"), but refuses to display the value held in the macro (`sheet')

    (output from trace):
    - local sheet = r(worksheet_`s')
    = local sheet = r(worksheet_1)
    - dis "`sheet'"
    = dis "ALUMNOS INSCRITOS A 1RO "G""
    ALUMNOS INSCRITOS A 1RO G"" invalid name

    (on the last line, you can see that the quotation mark before the G disappeared)

    and even if I run through the import excel command, I get a consistent error in the import excel using "`f'", sheet("`sheet'") line of code.

    worksheet ALUMNOS INSCRITOS A 1RO G"" not found

    I realize this problem could probably be resolved with programming but I am not a very good programmer. So my questions are:

    (1) In the event that this is my best bet in terms of time consumption (renaming all directories and files to lowercase and using filelist), is their a way to deal with the problem of disappearing quotation marks?
    (2) If (1) is not possible or my renaming strategy is not "intelligent", is there a way to pull all files within nested subdirectories without loosing the case of the files and folders (without the need to define more than a handful of subdirectories)?
    (3) Robert: any chance you could add some kind of a respectcase option to filelist (SSC)? I realize I could rewrite a program myself with this restriction, but I am not a very good programmer...

    Thanks!
    I have a folder (with subfolders) containing many excel files. Each file has two columns: Name and Code. I'm trying to figure out a way to import and combine
    Last edited by Jonathan Karver; 12 Nov 2014, 09:29.

  • #2
    I don't really have answers to your questions, but I have a few thoughts.

    First, with regard to your macro manipulations choking when a contained word has embedded quotation marks, you need in your coding to be using Stata's compound double quotes (`" to open, "' toclose). When you do, the embedded quotation marks problem should disappear.

    Second, I'm perplexed by your difficulties with case. Windows, as far as I know, is case-insensitive with respect to filenames. So I don't understand why this is causing you problems.

    Third, it should be a possible to write a Stata program that will call itself recursively to pull together a complete file-list throughout the original directory structure. Stata programs can call themselves. You would be responsible for coding the details of keeping track of where you are and where you are going next, but, in principle it can be done. That said, it's a bit complicated and not something that I could just spit out quickly here. Whether it is worth the trouble depends on whether this entire situation is a one-off for you, or whether you will be doing things like this repeatedly. If the latter, it might well be worth your time to go this route and have done with it. If the former, then probably just flattening the directory structure this once makes more sense.

    Comment


    • #3
      Thanks for your quick reply, Clyde. As it relates to your thoughts:

      1. I had tried using double quotes before and was not getting anywhere but I must have been specifying it wrong. I changed the code to

      Code:
      import excel using "`f'", sheet(`"`sheet'"')
      And it seems this error is no longer present.

      2. I am incredibly confused now about the case insensitivity because I read somewhere that Windows is in fact case sensitive after reading elsewhere that by definition it is case insensitive. It might have been that the macro quotes problem is what was initially causing the error before I realized it and not the case of the filepaths, and I did not put two and two together. In any event, this resolves the case issue when using filelist (sorry), and my laptop directory is now a mess thanks to my indiscretion.

      3. Thanks for your suggestion on creating programs to deal with nested directories. This is an issue which came up because non-data people sent us a non-user-friendly directory. Though I imagine this will not happen again, I will look into the possibility of doing something like this because working with the public sector on data undoubtedly lends itself to these kind of problems. I also realize that it should not be difficult in terms of programming, only time consuming (for me at least). For the time being, though, Robert Picard's program is extremely helpful.

      As is common in these threads, one solution uncovered another problem. As you can see in my earlier code, I put an arbitrary end cell for each sheet because this varies by sheet and file (but will never exceed 350 rows by definition and will never contain variables beyond the column K), but Stata rejects this saying

      Code:
      invalid row range in cellrange() option
      lower right row is after data area in worksheet
      I know I can manipulate the macro that would be defined using the matrix r(range_) to keep only the end range and not the start range (because of weird formatting in these files, I need to start at A7). I will post the resolved code once I get this to run, as I imagine someone else out there will find it useful.

      On an unrelated note, I did not want my first post to include an obvious solution that I could have figured out on my own. Sorry for this, beginners misfortune...

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Third, it should be a possible to write a Stata program that will call itself recursively to pull together a complete file-list throughout the original directory structure. Stata programs can call themselves.
        Clyde, this is exactly what filelist does.

        The help file for the respectcase option says:

        In Windows only, the respectcase option specifies that dir respect the case of filenames when performing matches. Unlike other operating systems, Windows has, by default, case-insensitive filenames. respectcase is ignored in operating systems other than Windows.
        My interpretation is that respectcase is useful in terms of which files are matched and thus returned. Regardless, case is ignored when files are accessed. While it would be relatively easy to include a respectcase option to filelist, I'm not fond of operating-system-specific functionality. Maybe in a future version.

        Comment


        • #5
          Thanks, Robert, and as you said, the writing of programs is what filelist does, so writing them would be mostly redundant (I was thinking, however, if I wanted only a subset of the nested directories, for example). Also, since this would be Windows specific and many people use Unix and other operating systems, you make a good point to not include it since it is not an immediate issue.

          The code above was filled with incorrect macros being called. I am pasting the code that works below for reference if anyone is interested. note that I had to add restrictions to disallow Stata from doing anything with empty spreadsheets (i.e., the "Sheet2" variety that people inadvertantely create in Excel...the stub "Hoja" is use below because these files are coming from Latin America). Thanks again to you both!

          Code:
              ssc install         filelist
              
          ** Define macro that captures all xls, xlsx files in directory $dir and pulls them out of subfolders
              cd                     "$dir"
          
          ** Using filelist below is more systematic, but converts all filenames and paths to lower case
              filelist,             p("*.xlsx") list
              gen                 fileid = _n
              drop if                filename == "~.s.t. 102 listas oficiales piloto (24-09-14).xlsx" //I am not sure why this file is being duplicated...
              save                 "myxlsx", replace
          
          * loop over each file and save a copy of the imported data;
              use                 "myxlsx", clear
              local                 obs = _N
              forval                 i=1/`obs' {
                  use             "myxlsx" in `i', clear
                  local             f = dirname + "/" + filename
                  import excel     "`f'", describe
                  local             num = r(N_worksheet)
                  forval             s = 1/`num' { //for some reason, the results in r() get repeated on line 78, and since we are dealing with a single sheet at that point, all other r(worksheet_*) are missing
                      local        sheet`s' = r(worksheet_`s')
                      local        check = substr("`sheet`s''",1,4)
                      local        r`s' = r(range_`s')
                      local        range`s' = "A7" + substr("`r`s''",3,.)
                  }
                  forval             s = 1/`num' {
                      if            "`r`s''" != "." & "`check'" != "Hoja" {
                          import excel using "`f'", sheet(`"`sheet`s''"') cellrange(`range`s'') allstring clear
                      }
                      tempfile     sheet`s'
                      save        `sheet`s'', emptyok
                  }
                  use                `sheet1', clear
                  if                `num' > 1 {
                      forval        s = 2/`num' {
                          append using `sheet`s''
                      }
                  }
                  gen             source = "`f'"
                  tempfile         save`i'
                  save             `save`i''
              }
          
          * loop over the saved datasets and append them
              use                 `save1', clear
              forval                i=2/`obs' {
                append using         `save`i''
              }

          Comment

          Working...
          X