Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending excel files considering sheet names

    Hi,

    I have attendance data for many schools, stored in separate Excel files. Each file contains multiple sheets corresponding to different months—for example, data for February 2022 is stored in a sheet named "02_2022". Since the number of days varies by month, the number of columns also differs. For instance, January sheets span from column A to column AP, while April sheets go from column A to column AO. I’d like to use the sheet names to identify and handle these differences in column count.

    In addition, some schools don’t have attendance data and instead contain a placeholder or dummy sheet, often named simply "M". These sheets also go up to column AP. I want to include these dummy sheets in the final dataset as well, appending them alongside the rest.

    Currently I am using the following code. It is working but taking a long time (3 hours) to process my entire list of excels, I am hoping the approach above is more efficient. Feel free to share your thoughts on this as well.

    Code:
    forvalues s = 1/`nschool' {
        local eno = `eiin`s''
    
        /** Get sheet info **/
        import excel using "***********directory to excel files**************", describe
        local nsheet = r(N_worksheet)
        forvalues i = 1/`nsheet' {
            local sheetname`i' = r(worksheet_`i')
        }
    
        forvalues i = 1/`nsheet' {
            clear
            local imported = 0
    
            // Try column ranges from AP (42) down to AM (39)
            foreach col in AP AO AN AM {
                capture {
                    import excel using "**********directory to excel files**********", ///
                        sheet("`sheetname`i''") cellrange(A1:`col'2) firstrow allstring
                }
                if _rc == 0 {
                    local imported = 1
                    continue, break
                }
            }
    
            if `imported' == 0 {
                display as error "‼️ EIIN `eno', sheet `sheetname`i'' could not be read (39–42 cols failed)."
                continue
            }
    
            if (`i' > 1 | `s' > 1) {
                append using `tmp'
            }
            save `tmp', replace
        }
    }

  • #2
    It is easy to obtain the exact number of days (and thus the name of the last column) for a sheet with a name like "02_2022". Consider:
    Code:
    . dis daysinmonth(date("02_2022","MY"))
    28
    . dis "`:word `=daysinmonth(date("02_2022","MY"))-27' of AM AN AO AP'"
    AM
    Or the more reader- and programmer-friendly version:
    Code:
    local num_days = daysinmonth(date("02_2022","MY"))
    local last_columns AM AN AO AP
    local last_col: word `=`num_days'-27' of `last_columns'
    That said, I am not sure if all this work is even needed. Why do you necessarily need to specify the cellrange option to import excel at all? Let Stata import the sheet; you can then decide to
    Code:
    keep in 1
    to just keep the second row (i.e. one observation after the variable names). The append command should allow you to append sheets with different numbers of columns without throwing errors, especially if all the variables are numeric (empty columns will by default be read as numeric, so if the same column in other sheets is actually a string, this may need a little bit of extra code) -- all the more so if you can start the process off with a sheet that has the maximal set of columns.
    Last edited by Hemanshu Kumar; 17 Jun 2025, 06:15.

    Comment


    • #3
      Nomin Byambaa I don't think you need the cell range and if that's right you can simplify your loop to import and combine your data without respect to the varying # of days per month (or any errors in data input of the sheet name that would signal the # of days in a month).
      The example loop below creates fake files and then loops over them and combines them. Notice that even though the ranges & number of cols varies and the names of the sheets all change, the loop can use the saved/background elements of -import excel, describe- to flexibly loop over the files with something like

      Code:
              import excel using `"${myfolder}/`f'"', sheet(`"`r(worksheet_`n')'"') clear  
      in a full working example this might look like:

      Code:
      **create fake data**
      global myfolder `"/folderpath/here//"'
      cap mkdir `"${myfolder}"'
      
      
      forval n = 1/50 { //create files with various names and sizes
      clear
      set obs 10000
          forval var = 1/150 {
          if `=rbinomial(1,`=runiform()')'==1  g v`var' = runiform()   //random addit. cols for example
      } //end var loop
          export excel using `"${myfolder}/file`n'.xlsx"', sheet(02_`n') replace
          export excel using `"${myfolder}/file`n'.xlsx"', sheetreplace sheet(03_`n')  //multiple sheets per file
          
      } //end file n loop
      
      ls ${myfolder}
      
      
      
      
      **read in all files and sheets in the folder, no range needed**
      clear
      sa `"${myfolder}/master.dta"', replace emptyok //if you want to append them during this process!
      **
      global files: dir `"${myfolder}"' file "*.xlsx", respectcase nofail
      di `"${files}"'
      **
      foreach f of global files {
          di `"`f'"'
          import excel using  `"${myfolder}/`f'"', describe
          forval n = 1/`r(N_worksheet)' {
              di `"sheet: `n' / range: `r(range_`n')' "' //helps with diagnosing issues
             import excel using `"${myfolder}/`f'"', sheet(`"`r(worksheet_`n')'"') clear  
              g file = `"`f'"'
              g date = `"`n'"'
               **if yiou want to append them:
               append using `"${myfolder}/master.dta"'
               sa `"${myfolder}/master.dta"', replace
          } //end n loop
      } //end f loop
      
      
      
      u `"${myfolder}/master.dta"', clear
      ta file
      Last edited by eric_a_booth; 17 Jun 2025, 07:23.
      Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

      Comment


      • #4
        eric_a_booth Hemanshu Kumar thanks a lot for replies. When I do not define the cell range, STATA captures bunch of empty columns as variables, which pushes me over the limit of number of variables.

        Comment


        • #5
          Nomin Byambaa So I guess it's capturing surplus blank columns that might be an after effect of some kind of output from the system? I would probably figure out the maximum range for actual data and specify that range in the option (some maximum that is larger than you'd ever expect to receive (with padding) but smaller than your Stata version's limits) . Then add a command for -dropmiss- *(from SSC) to drop empty columns after importing.
          Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

          Comment


          • #6
            When I do not define the cell range, STATA captures bunch of empty columns as variables, which pushes me over the limit of number of variables.
            Really? Even if you are using Stata BE, which is the smallest flavor of Stata, you can have up to 2,048 variables. With columns out as far as AP you are only at 42 variables. Even assuming you have additional variables for each school, it is hard for me to believe that you will get up anywhere close to 2,048 variables. So either I'm misunderstanding something, or something else is wrong that has not been discussed, and more information about this data is needed to solve the problem.

            Comment

            Working...
            X