Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to append multiple worksheets in an excel file with local macros?

    My data is in .xls format with multiple worksheets. The worksheets are named by stated (for example, California, Arizona, Texas...). I realize that the import excel command would basically import just the specified worksheet or the first worksheet by default. I have the option of using the xls2dta command that converts each worksheet into dta and then I append each file individually.

    However, I am trying to append all the worksheets into a single excel file using a local macro.

    I have the following command:

    local mylist "California" "Texas" "Arizona"
    foreach x in local `mylist' {
    import excel using "C:\Users\Originaldata.xls", sheet("x") firstrow case(lower) clear
    save "Originaldata", replace
    }

    I keep getting an error message that worksheet "x" cannot be found. I would really appreciate any help or ideas on how I could do this. I use Stata 13.

    Thank you!!
    Debbie

  • #2
    If that really was your code, I don't think you would get nearly as far.

    It should be

    Code:
     
    foreach x of local mylist {
    and then within the loop you need to refer to

    Code:
     
    `x'
    not

    Code:
     
    x
    In fact the indirection here can be reduced.

    Code:
     
    foreach x in California Texas Arizona  {
         import excel using "C:\Users\Originaldata.xls", sheet("`x'") firstrow case(lower) clear
         save "Originaldata", replace
    }
    but that is still problematic. Your data file will come out of the loop just containing data for Arizona, and looking at California and Texas will be pointless. So, you need to change the last line too.

    Comment


    • #3
      It is possible that the names on the tabs of the Excel worksheets are "contaminated" with leading or trailing blank spaces or, less commonly, with embedded non-printing characters. It is also possible that some are simply misspelled.

      You can get around these problems with:

      Code:
      clear
      tempfile building
      save `building', emptyok
      
      import excel using "C:\Users\Originaldata.xls", describe
      return list
      local n_sheets `r(N_worksheet)'
      forvalues j = 1/`n_sheets' {
          local sheet`j' `r(worksheet_`j')'
      }
      
      
      
      forvalues j = 1/`n_sheets' {
          import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
          // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
          append using `building'
          save `"`building'"', replace
      }
      At the end of this you should have all of the data sheets appended into `building', which you can then load, clean up, and do whatever you please with.

      I have seen this approach sometimes fail with embedded non-printing characters in the names on the tabs. In that case I know of no solution other than going back to the Excel file and manually naming the tabs correctly, though perhaps someone else on the forum has a solution for this. (If so, I would love to see it!)

      Comment


      • #4
        Also see xls2dta from SSC.

        Best
        Daniel

        Comment


        • #5
          Thank you. I really appreciate the help. So basically, I was getting the quotations wrong. I kept on using ' when I should be using `. I corrected that.
          And Clyde, the codes you suggested worked perfectly!
          I just need to correct some appending errors in each file before appending.

          Comment


          • #6
            I need to correct a statement made earlier. Your code was entirely legal, and it would have stopped when you said, so I was wrong. Sorry about that.

            The explanation is possibly a little helpful or even interesting.

            Code:
             
            local mylist "California" "Texas" "Arizona"
            foreach x in local `mylist' {
            import excel using "C:\Users\Originaldata.xls", sheet("x") firstrow case(lower) clear
            save "Originaldata", replace
            }
            Given the first line, Stata would read the second line as

            Code:
             
            foreach x in local California Texas Arizona
            which at first sight looked illegal to me as I am aware of Stata's of local syntax. That's irrelevant, however, and Stata sees a list with four items,

            local
            California
            Texas
            Arizona


            and it starts looping over them. The first item is local, so Stata starts with that. Now you don't actually refer to the local macro within the loop (which isn't illegal) because as said you need to use quotation marks to do that. So Stata uses a literal x and stops because it can't find a sheet with that name.

            In other words, Stata just took the word local as arbitrary text and didn't pay any attention to its intended syntactic meaning. The use of the syntax element in was necessary and sufficient to alert Stata that a list was coming next. Hence the code was legal, just not what you wanted.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              It is possible that the names on the tabs of the Excel worksheets are "contaminated" with leading or trailing blank spaces or, less commonly, with embedded non-printing characters. It is also possible that some are simply misspelled.

              You can get around these problems with:

              Code:
              clear
              tempfile building
              save `building', emptyok
              
              import excel using "C:\Users\Originaldata.xls", describe
              return list
              local n_sheets `r(N_worksheet)'
              forvalues j = 1/`n_sheets' {
              local sheet`j' `r(worksheet_`j')'
              }
              
              
              
              forvalues j = 1/`n_sheets' {
              import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
              // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
              append using `building'
              save `"`building'"', replace
              }
              At the end of this you should have all of the data sheets appended into `building', which you can then load, clean up, and do whatever you please with.

              I have seen this approach sometimes fail with embedded non-printing characters in the names on the tabs. In that case I know of no solution other than going back to the Excel file and manually naming the tabs correctly, though perhaps someone else on the forum has a solution for this. (If so, I would love to see it!)

              hi, I have been trying to solve the same problem on my stata for days, and gladly found your solutions actually worked by me! finally! thanks!

              But there are some small questions:

              1, everything works fine except that in the final bigdataset, the data from the first worksheet appeared at the bottom of the final dataset, is there anyway to havea a final big dataset with the data from worksheet1 comes first, then worksheet2, 3, ..... ?

              2,how could I save this tempfile, please? I tried save, replace, it's not working ;P

              Thanks a loooot!
              Last edited by Melanie Fan; 03 Dec 2014, 07:26.

              Comment


              • #8
                Melanie,

                Having the sheets in reverse order is an unavoidable consequence of using the append command. Your best bet is to re-sort the data. If necessary you can save the value of the j macro to your data set so you can use it to sort in sheet order.

                Regarding your second question, you just need to do a final save, replace, giving it a permanent name.

                Code:
                 
                clear 
                tempfile building 
                save `building', emptyok  
                
                import excel using "C:\Users\Originaldata.xls", describe 
                return list local n_sheets `r(N_worksheet)' 
                forvalues j = 1/`n_sheets' { 
                  local sheet`j' `r(worksheet_`j')' 
                }    
                
                forvalues j = 1/`n_sheets' { 
                  import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear 
                  gen sheet = `j' 
                  append using `building' 
                  save `"`building'"', replace 
                } 
                
                sort sheet
                save final_data, replace
                Incidentally, once you are satisfied that this works, there's no particular reason to use a tempfile here if you plan to save it to a permanent file. Just put the name of the permanent file wherever you see `building'.

                Regards,
                Joe

                Comment


                • #9


                  Hi,Joe,

                  thanks a lot for your reply. It's really helpful! Now i could have a final file with all the data in it. The sort function is also useful. But could there be any ideas to sort it in the order of the original worksheets? I mean, for example, my excel has the data for countries, country 1 in worksheet1 could be Danemark, country 2 in worksheet2 could be Austria, 3:Belgium . Though it is not strictly required to sort them in order of Danemark, Austria, Belgium for final analysis, but it would be nice to have a *.dta file in resemblance with the original excel file. I'd be very thankful if you could help further.

                  Best,
                  Melanie
                  Last edited by Melanie Fan; 03 Dec 2014, 13:55.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    It is possible that the names on the tabs of the Excel worksheets are "contaminated" with leading or trailing blank spaces or, less commonly, with embedded non-printing characters. It is also possible that some are simply misspelled.

                    You can get around these problems with:

                    Code:
                    clear
                    tempfile building
                    save `building', emptyok
                    
                    import excel using "C:\Users\Originaldata.xls", describe
                    return list
                    local n_sheets `r(N_worksheet)'
                    forvalues j = 1/`n_sheets' {
                    local sheet`j' `r(worksheet_`j')'
                    }
                    
                    
                    
                    forvalues j = 1/`n_sheets' {
                    import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
                    // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
                    append using `building'
                    save `"`building'"', replace
                    }
                    At the end of this you should have all of the data sheets appended into `building', which you can then load, clean up, and do whatever you please with.

                    I have seen this approach sometimes fail with embedded non-printing characters in the names on the tabs. In that case I know of no solution other than going back to the Excel file and manually naming the tabs correctly, though perhaps someone else on the forum has a solution for this. (If so, I would love to see it!)

                    Hi, Clyde,
                    Thanks a lot for the sulotion of reading multiple worksheets. I have always worked quite well with this code. But recently I run into a little problem during applying it to another dataset.
                    The dataset has N worksheets, each numbered 1-9, 10-19, 20-29, ...90-99, an individual worksheet looks like
                    A B C D E
                    1 a 1a wood explanation for wood
                    1 b 1b wood explanation for wood
                    2 a 2a cheese explanation for cheese
                    ...
                    ...

                    After running the code, STATA tells me in red:

                    variable d is byte in master but str25 in using data
                    You could specify append's force option to ignore this numeric/string mismatch. The using variable would then be treated as if it contained
                    numeric missing value.
                    r(106);

                    I wonder:
                    1)what does mean, variable D is byte in master but str25 in using data? "master" means the original excel file?
                    2) how could I speficy append's force option to ignore this mismatch?

                    So far the data read by STATA contains only 20 to 29, which seems quite strange, since it is the 3rd sheet, since usually if the code does not run well, it read only the 1st sheet.
                    And all data are shown in red color. I do have a problem in understanding how STATA treats data with its float, double, byte, int, long system in real application. I hope I could have some further ideas in dealing with this kind of problems.

                    Thank you so much! as always!

                    Melanie

                    Comment


                    • #11
                      Your sample data does not seem consistent with the problems you are having. What are the values of variable D supposed to be? It looks to me like they are things like "wood" or "cheese", etc. Those are clearly strings, so I don't understand why variable D would be "byte" in any of these data sets.

                      In any case, the "master" data set that the error message refers to is the data that you have just read in from Excel at the time the -append-command is invoked.

                      You can specify the force option by just adding -, force- to the append command. But I really recommend strongly against doing that. The consequence is that you will get all missing values for that variable from that particular worksheet.

                      The first thing to do is to figure out what is going on in your Excel spreadsheets. It appears that column D contains numeric data in most of the worksheets, but in some of them it contains strings. Now, a common reason for this to happen is that the "string" variable is mostly numeric but also contains things like "N/A" or "<5" or the like. Actually, in your case, the offending version of variable D is str25, which is pretty long: so it sounds like somebody started to write an essay in the midst of recording numeric data! Those things cannot be imported as a numeric variable.So one thing you should do is identify the offending material. You are likely to find errors in the data that will need to be fixed before you can proceed.

                      Sometimes it is not apparent why Stata thinks a variable should be string instead of numeric: everything looks numeric. This is often due to invisible characters contaminating the data: spaces, em-dashes instead of minus signs, etc. In that case, reading in the offending spreadsheet by itself and running the -charlist- command (followed by -return list- so you can get the ascii codes) will show you what the problem is.

                      Finally, sometimes even after thoroughly reviewing a putatively string variable, there is no apparent reason why it imported from Excel as string. In that case, your best bet is probably to convert it to numeric in the loop itself when this happens. So something like this:

                      Code:
                      forvalues j = 1/`n_sheets' {
                          import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
                          // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
                          capture confirm string var D // ASK WHETHER D IS A STRING VARIABLE IN THIS SHEET
                          if c(rc) == 0 {  // IF SO
                              destring D, replace  // CONVERT IT TO NUMERIC
                          }
                          append using `building'
                          save `"`building'"', replace
                      }
                      Note: If D truly contains non-numeric material that you just didn't see when trying to review your spreadsheets, the -destring- command will complain and break. In that case, again -charlist D- followed by -return list- in that immediate setting will be your friend. If you need to specifically identify the offending observations, -list D if missing(real(D))- will do that.

                      Now, after you get around this, you may find that you have similar problems later on with other variables. Excel spreadsheets can be very messy because Excel, by design, tolerates the mingling of numeric and text data in the same columns. If this becomes really burdensome, instead of fixing each problem on the fly as suggested above, you can use the -allstring- option on your -import excel- command. That will cause Stata to import all of the variables in each of the worksheets as a string variable, even if it could be numeric. Then, after you have built the big data set, you can loop over each of the variables that should be numeric and -destring- it, finding and correcting any inappropriate non-numeric values.

                      Good luck! I've been down this road many times--it isn't fun, but trust me, there is light at the end of the tunnel. And often in the process you uncover important errors in the data that would have otherwise been missed and corrupted your later analyses.
                      Last edited by Clyde Schechter; 21 Feb 2015, 09:31.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        Your sample data does not seem consistent with the problems you are having. What are the values of variable D supposed to be? It looks to me like they are things like "wood" or "cheese", etc. Those are clearly strings, so I don't understand why variable D would be "byte" in any of these data sets.

                        In any case, the "master" data set that the error message refers to is the data that you have just read in from Excel at the time the -append-command is invoked.

                        You can specify the force option by just adding -, force- to the append command. But I really recommend strongly against doing that. The consequence is that you will get all missing values for that variable from that particular worksheet.

                        The first thing to do is to figure out what is going on in your Excel spreadsheets. It appears that column D contains numeric data in most of the worksheets, but in some of them it contains strings. Now, a common reason for this to happen is that the "string" variable is mostly numeric but also contains things like "N/A" or "<5" or the like. Actually, in your case, the offending version of variable D is str25, which is pretty long: so it sounds like somebody started to write an essay in the midst of recording numeric data! Those things cannot be imported as a numeric variable.So one thing you should do is identify the offending material. You are likely to find errors in the data that will need to be fixed before you can proceed.

                        Sometimes it is not apparent why Stata thinks a variable should be string instead of numeric: everything looks numeric. This is often due to invisible characters contaminating the data: spaces, em-dashes instead of minus signs, etc. In that case, reading in the offending spreadsheet by itself and running the -charlist- command (followed by -return list- so you can get the ascii codes) will show you what the problem is.

                        Finally, sometimes even after thoroughly reviewing a putatively string variable, there is no apparent reason why it imported from Excel as string. In that case, your best bet is probably to convert it to numeric in the loop itself when this happens. So something like this:

                        Code:
                        forvalues j = 1/`n_sheets' {
                        import excel using c":Users]Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
                        // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
                        capture confirm string var D // ASK WHETHER D IS A STRING VARIABLE IN THIS SHEET
                        if c(rc) == 0 { // IF SO
                        destring D, replace // CONVERT IT TO NUMERIC
                        }
                        append using `building'
                        save `"`building'"', replace
                        }
                        Note: If D truly contains non-numeric material that you just didn't see when trying to review your spreadsheets, the -destring- command will complain and break. In that case, again -charlist D- followed by -return list- in that immediate setting will be your friend. If you need to specifically identify the offending observations, -list D if missing(real(D))- will do that.

                        Now, after you get around this, you may find that you have similar problems later on with other variables. Excel spreadsheets can be very messy because Excel, by design, tolerates the mingling of numeric and text data in the same columns. If this becomes really burdensome, instead of fixing each problem on the fly as suggested above, you can use the -allstring- option on your -import excel- command. That will cause Stata to import all of the variables in each of the worksheets as a string variable, even if it could be numeric. Then, after you have built the big data set, you can loop over each of the variables that should be numeric and -destring- it, finding and correcting any inappropriate non-numeric values.

                        Good luck! I've been down this road many times--it isn't fun, but trust me, there is light at the end of the tunnel. And often in the process you uncover important errors in the data that would have otherwise been missed and corrupted your later analyses.


                        Hi, Clyde,

                        Thanks a lot for the detailed instruction for how to read STATA's mind! I have a better idea how stata works when reading excel, but still in this case, it is really confusing why stata would ever take D variable as byte, because as I just looked through every worksheets and browsed through every entry in D column, there are 1) absolutely no numbers (as long as my visible eyes could see), and 2) only words (not more than 10 words for each entry), or simply left blank (70 percent of the entries are empty without words), so it's also weird why stata think excel would have a long essay in one of the entries.

                        Is it possible that the person who made the entries typed in thousands of empty spaces in one of the these entries in D column, so that stata used str25. If this could be the reason, how could we detect that and/or delete that?

                        With these information being provided, I don't know if it is possible to get more hints from you, where the problem should lie?

                        Sure, it is not interesting to see command errors popping out again and again even after millions attempts have been made, but surely it reflects a missing point in the previous thinking or solution finding. I really appreciate it that here one could receive very valuable suggestions when going down this dark tunnel, before light appears on the other end

                        Comment


                        • #13
                          It may be that in those spreadsheets where there are only blank entries in column D, Stata imports that as numeric. If that is the problem, then it should be the case that D is always missing when D is not a string. If so,

                          Code:
                          forvalues j = 1/`n_sheets' {
                               import excel using c":Users/Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
                          
                               // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
                          
                               capture confirm string var D // ASK WHETHER D IS A STRING VARIABLE IN THIS SHEET
                               if c(rc) != 0 {      // IF NOT
                                    assert missing(D) // VERIFY THAT THERE ARE ONLY MISSING VALUES HERE
                                    tostring D, replace // CONVERT IT TO STRING
                               }
                               append using `building'
                               save `"`building'"', replace
                          }
                          should do the trick. This code will verify that there is in fact no real numeric information in those files where D is not a string. If you omit the -assert- statement, the code will run without this check--but I don't recommend doing that because if there is a worksheet where there are numbers in the D column, then that probably represents a data error and you should know about it--not just turn it into a string variable.

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            It may be that in those spreadsheets where there are only blank entries in column D, Stata imports that as numeric. If that is the problem, then it should be the case that D is always missing when D is not a string. If so,

                            Code:
                            forvalues j = 1/`n_sheets' {
                            import excel using c":Users/Originaldata.xls", sheet(`"`sheet`j''"') firstrow case(lower) clear
                            
                            // whatever it is you do with this data, perhaps -gen source = `"`sheet`j''"'-
                            
                            capture confirm string var D // ASK WHETHER D IS A STRING VARIABLE IN THIS SHEET
                            if c(rc) != 0 { // IF NOT
                            assert missing(D) // VERIFY THAT THERE ARE ONLY MISSING VALUES HERE
                             tostring D, replace // CONVERT IT TO STRING
                            }
                            append using `building'
                            save `"`building'"', replace
                            }
                            should do the trick. This code will verify that there is in fact no real numeric information in those files where D is not a string. If you omit the -assert- statement, the code will run without this check--but I don't recommend doing that because if there is a worksheet where there are numbers in the D column, then that probably represents a data error and you should know about it--not just turn it into a string variable.
                            Thanks for the further tip! Looks a possible reason!

                            So I have tried it, but unfortunately, the stata tells me in red that:

                            D not found
                            r(111);


                            How could this happen?
                            Do I have to specify something or rename the variable D in the code somehow? I think, stata treats each column of the excel worksheet as an individual variable already, right? as variables A B C D ... , just like how they were shown above the entries in each column in excel.

                            Comment


                            • #15
                              Clyde's code checks whether D is a string variable in your dataset and takes an error as meaning that D is a numeric variable in your dataset. But it would also be an error if D was not a variable in your dataset at all. So,

                              Code:
                               
                              assert missing(D)
                              cannot be executed whenever there is no D to check.

                              You need code for the non-existence of that variable. I've not read back far in the thread to follow what you are doing, but offer this diagnosis.

                              Comment

                              Working...
                              X