Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to Import many excel files with multiple sheets and appending each sheet into one dta file for each file with the same excel file name

    Hello everyone,

    I just joined Statalist. So this is my first post.

    I would like to give you a brief idea about what I want to do and then explain the problem.

    I have daily meter reading data of many firms stored in excel files of 3 sheets for each firms. I want to make a single dta file for each firm or file which has the same name as the excel file. I had to clean each excel sheets for a suitable data structure.

    Following some posts in this forum, I was able to make a do file that worked partially:-

    - cd "${january_T1}/J"

    // number of sheets 3, so i=3
    foreach f of local filenames {
    forval i=1/3 {
    display `"Importing `f'"'

    import excel using "${january_T1}/J/`f'" ,sheet("Table `i'") cellrange(A6:R31) clear // the cellrange for sheets 1 and 2 is same
    tostring _all, replace // making all observations string

    replace A = "int_" + A if _n!=1 // adding prefix to intervals

    rename A intervals // rename first column as intervals which is hourly intervals

    ** add prefix to first row as they start with numbers

    foreach v of var B-Q {
    replace `v' = "date_" + `v' if _n == 1

    }
    *** Replacing first row as variable names

    foreach x of varlist B-Q {
    local vname = strtoname(`x'[1])
    rename `x' `vname'
    }

    drop in 1 // dropping first row

    sxpose2, clear varname // transpose columns to rows for appending sheets

    *** Replacing first row as variable names
    foreach x of varlist _varname-_var24 {
    local v_name = strtoname(`x'[1])
    rename `x' `v_name'
    }

    cap gen filename = "`f'_Sheet`i'"

    save "${meter_temp}\January\T1b/`f'_sheet`i'.dta", replace //
    }

    // loop over all sheets in each data file, while appending them and converting them to one dta for each firm for the month

    *I am struggling to do the appending of 3 sheets using the code below, as I am not sure how to put it inside the loop above or even outside the loop.*

    // number of sheets 3, so i=3

    foreach f of local filenames {

    use "${meter_temp}\January\T1b/`f'_sheet1.dta" , clear
    append using "${meter_temp}\January\T1b/`f'_sheet2.dta"
    *append using "${meter_temp}\January\T1b/`f'_sheet3.dta"

    save "${meter_work}\January\T1b/`f'.dta", replace

    }

    Issues:

    1) in ${meter_temp}\January\T1b folder, sheet 1 and 2 dta files are separately generated (for example t-6_8639746_january.xlsx_sheet1.dta and t-6_8639746_january.xlsx_sheet2.dta) however in the file names of sheet 1 and 2 ".xlsx" gets captured which I do not want. How can I remove this .xlsx part?
    2) How can I append data of sheet 1-3 for each firm from the same folder keeping the same name? - shared an image file of all the files in the folder (https://ibb.co/ZHnJpyX)

  • #2
    1) in ${meter_temp}\January\T1b folder, sheet 1 and 2 dta files are separately generated (for example t-6_8639746_january.xlsx_sheet1.dta and t-6_8639746_january.xlsx_sheet2.dta) however in the file names of sheet 1 and 2 ".xlsx" gets captured which I do not want. How can I remove this .xlsx part?
    Before your -save- command insert a command:
    Code:
    local target: subinstr local f ".xlsx" ""
    and in your -save- command, replace `f' by `target'.

    2) How can I append data of sheet 1-3 for each firm from the same folder keeping the same name?
    Looking at the code you wrote, the only problem I can see is that asterisk at the start of the last -append- command. That turns that command into a comment, so the third sheet will never get appended. But if you take out that asterisk, I don't see any reason why it won't do the job. While it is possible to incorporate it into the other loop, there is no benefit to doing so. Just leave it on its own at the end. In my experience, mass importation and appending of a large number of files usually doesn't work out well, because even after carefully cleaning the original files, there are often problems that escaped notice. So I think breaking the task into smaller groups of files to append, as you have done here, is a better approach.
    Last edited by Clyde Schechter; 22 Feb 2023, 21:34.

    Comment


    • #3
      Dear Clyde,

      Thanks alot for your valuable suggestion and quick response. The subinstr code worked like magic. However, I am currently facing a new error variable date__ already defined. I am sharing a snap of the data below:


      input str15(intervals date_04_01_2023 date_03_01_2023 date_02_01_2023 date_01_01_2023 date_31_12_2022) str6(date__ Q) str7 R
      "" "date_04/01/2023" "date_03/01/2023" "date_02/01/2023" "date_01/01/2023" "date_31/12/2022" "date_." "date_." "8639599"
      "int_00:00-01:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_01:00-02:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_02:00-03:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_03:00-04:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_04:00-05:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_05:00-06:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_06:00-07:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_07:00-08:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_08:00-09:00" "0.061" "0.085" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_09:00-10:00" "0.070" "0.118" "0.086" "0.000" "0.000" "." "." "8639599"
      "int_10:00-11:00" "0.072" "0.113" "0.131" "0.000" "0.000" "." "." "8639599"
      "int_11:00-12:00" "0.069" "0.131" "0.114" "0.000" "0.000" "." "." "8639599"
      "int_12:00-13:00" "0.031" "0.082" "0.110" "0.000" "0.000" "." "." "8639599"
      "int_13:00-14:00" "0.000" "0.000" "0.001" "0.000" "0.000" "." "." "8639599"
      "int_14:00-15:00" "0.012" "0.000" "0.130" "0.000" "0.005" "." "." "8639599"
      "int_15:00-16:00" "0.065" "0.000" "0.110" "0.000" "0.000" "." "." "8639599"
      "int_16:00-17:00" "0.141" "0.000" "0.101" "0.000" "0.000" "." "." "8639599"
      "int_17:00-18:00" "0.000" "0.000" "0.030" "0.000" "0.000" "." "." "8639599"
      "int_18:00-19:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_19:00-20:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_20:00-21:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_21:00-22:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_22:00-23:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      "int_23:00-24:00" "0.000" "0.000" "0.000" "0.000" "0.000" "." "." "8639599"
      end
      [/CODE]

      The error is coming because there is no data in columns P-R in one of the excel sheets as I am asking to import from cell range (A6:R31). How should I overcome this problem as I am running this in a loop for all IDs?

      Comment


      • #4
        What is the code leading up to where you get this error? I'm guessing it happens when you try to rename the variables from B-Q to date_*. If I have that right, the code you want to change is:
        Code:
        ** add prefix to first row as they start with numbers
        
        foreach v of var B-Q {
            replace `v' = "date_" + `v' if _n == 1 & !missing(`v')
        
        }
        *** Replacing first row as variable names
        
        foreach x of varlist B-Q {
            if !missing(`x'[1]) {
                local vname = strtoname(`x'[1])
                rename `x' `vname'
            }
        }
        The way you describe the problem it seems that you have 3 variables, P, Q, and R which are empty in the spreadsheet, and therefore import as variables with all missing values into Stata. When you prefix date_ to the entry in the first observation (row), that leads to just plain date_ in variables (columns) P, Q, and R. You then try to rename each variable to the contents of its first observation. That's fine for P, the first one. But then when you hit Q, you have already renamed P as date_, and so you can't do that again with Q. The above code simply skips the entire process when there is nothing in the first observation of a variable.

        Comment


        • #5
          You are correct. the error comes due to renaming variables. However, !missing(`v') did not work as the dataset is not considering the variables empty instead its considering "." . Therefore I tried this and it seems to be working..

          Code:
          ds intervals, not
              
              foreach v of var `r(varlist)' {
              replace `v' = "date_" + `v' if _n == 1
              replace `v'="" if `v'=="date_." | `v'=="." 
              count if missing(`v')
              if r(N)>0 drop `v'
          
              
          }
          thank you so much for all the help.

          Comment

          Working...
          X