Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing multiple excel sheets using file directory

    I would like to import multiple sheets of an excel file. I'm using a the current directory to select the files for import, so that I can use a fuzzy match with the file names. I see the correct file in the display of the filelist macro, here called "populationlinelist_healthstats 2024-09-15t19_13_04.111z.xlsx"

    I'm receiving an error on the import step, r(106) file populationlinelist_healthstats 2024-09-15t19_13_04.111z.xlsx" not found

    Code:
    *** 0.A.1 Folder Navigation
    local fldr_pipeline        "I:/EandE/5. Pipeline"
    local fldr_refresh        "I:/EandE/4. Refreshes"
    local fldr_PEAR_HAPI     "`fldr_refresh'/Human API (HAPI) Files/Imms_PEAR"
    
    ***************************************************************************************************
    *** PART 1:    Changing working directory and Load Population Line List
    ****************************************************************************************************
    
    *** 0.a Change working directory
    cd "`fldr_PEAR_HAPI'"
    
    ***0.b. Load HAPI Population Line List
    
    dir *.xlsx
    local filelist: dir "." files "PopulationLineList_HealthSTATS*.xlsx"
    di `"`filelist'"'
    
    *** 1.a Append each worksheet for Population Line List
    
    forvalues i=1/3 {
    tempfile population
    save `population', emptyok
    import excel `"`filelist'"', sheet("Page1_`i'") firstrow clear
      append using `population'
    }
    save `"`population'"', replace
    Any suggestions on improvement are appreciated.

  • #2
    Code:
    import excel `"`filelist'"', sheet("Page1_`i'") firstrow clear
    is a mistake because `filelist' contains the names of all three files, but the -import excel- command expects only a single filename there. The error message, admittedly, is not quite accurate: the problem is not that the file can't be found but that your syntax is wrong.

    That said, there are also some errors in the logic of your code. You are -save-ing `population' at the wrong place in the loop. And it is, I think, better when you need parallel looping between the filename and the page number, to loop over the filename and update the page number inside the loop.

    So here's how I would do this:
    Code:
    clear
    
    local filelist: dir "." files "PopulationLineList_HealthSTATS*.xlsx"
    di `"`filelist'"'
    
    tempfile population
    save `population', emptyok
    local i = 1
    foreach f of local filelist {
        import excel `"`filelist'"', sheet("Page1_`i'") firstrow clear
        append using `population'
        local ++i
        save `"`population'"', replace
    }
    All of that said, I wouldn't actually do this at all. Given a batch of files, the likelihood that there will be incompatibilities between the files that will result either in -append- not accepting them at all or in loss of data when it does. My workflow when I need to do this is to save each of the worksheets as a separate .dta file first. I then use Mark Chatfield's -precombine- program (available from SSC) to find any incompatibilities. Then I fix them in the separate .dta files. And, as the final step, I run another program to append the files together.

    While with only 3 files, there is a reasonable probability that no incompatibilities will turn up, Excel spreadsheets are high risk files for problems like a variable being imported as string from one worksheet and numeric from another. But better to play it safe. If -precombine- finds no incompatibilities, then you just move on to the -append-ing program and have lost only the few seconds it takes to run it.


    Comment


    • #3
      Thank you Clyde Schechter for your reply. When I apply the code below, only the first excel sheet was added (Page1_1). In my test case, there is one excel file, PopulationLineList_HealthSTATSDate.xlsx with sheets Page1_1, Page1_2 and Page1_3. I'm going to investigate -precombine, thank you for the suggestion.

      Code:
      dir *.xlsx
      
      local filelist: dir "." files "PopulationLineList_HealthSTATS*.xlsx"
      di `"`filelist'"'
      
      *** 1.a Append each worksheet for Population Line List
      
      tempfile population
      save `population', emptyok
      local i = 1
      foreach file of local filelist {
          
          import excel "`file'", sheet("Page1_`i'") firstrow clear
          append using `population'
          local ++i
          save `"`population'"', replace
      }

      Comment


      • #4
        In my test case, there is one excel file
        Oh, I misunderstood. I thought there were three separate excel files. That's why the code isn't working properly. You refer to this in #3 as a "test case." I will assume, then, that in your ultimate application of this code, there are multiple Excel files, each with multiple worksheets.

        To do this with 3 worksheets in a single Excel file, the code is different.
        Code:
        clear*
        
        dir *.xlsx
        
        local filelist: dir "." files "PopulationLineList_HealthSTATS*.xlsx"
        di `"`filelist'"'
        
        tempfile population
        save `population', emptyok
        foreach file of local filelist {
            import excel `"`file'"', describe
            local n_sheets `r(N_worksheet)'
            forvalues i = 1/`n_sheets' {
                local sheet_`i' `r(worksheet_`i')'
            }
            forvalues i = 1/`n_sheets' {
                import excel "`file'", sheet(`"`sheet_`i''"') firstrow clear
                gen filename = "`file'"
                gen worksheet = `"`sheet_`i''"'
                append using `population'
                save `"`population'"', relace
            }
        }
        This will go through the PopulationLineList_HealthSTATS*.xlsx files one at a time, and will pull every worksheet out and save it in `population'. Note that so you can distinguish which observations in the resulting data set come from which of the files, I have added a new variable to the data set, called filename, and another indicating the source worksheet as well.

        My advice regarding not directly appending, but rather saving each worksheet in a separate .dta file, checking with -precombine-, and then writing a separate short program to -append- them after fixing any incompatibilities among them, remains unchanged.
        Last edited by Clyde Schechter; 16 Sep 2024, 15:02.

        Comment

        Working...
        X