Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing excel files that contain an "enter" in a string variable

    Hi all

    I will try my best to explain the problem I am facing. I am importing multiple excel files using import excel, no issues here. Each column contains the variable name plus the year, for example Accounts2016. Since I import all the excel files and then append, I do not use the firstrow option as the variable name in the second excel file would be Accounts2015.

    The code works perfectly and I then use the autorename function to get the variable names from the firstrow, for example Accounts 2016 (the fact that it is 2016 is not an issue). But this where things get weird. The string in excel is actually "accounts(enter)2016" and not "accounts2016". If I copy paste the variable name from STATA browser I notice it is "Accounts(enter)2016" and the variable becomes useless as I cannot type the enter when writing additional code.

    I therefore need one of two solutions. 1) Append the datasets by just stacking them on top of each other (ignore the variable name). This will work as I can the use the firstrow opttion.
    2) get rid of the enter in the "string", either in the firstrow before autorenaming or after changing it to the variable name.

    Here is the code I have used:
    set more off
    cd "C:\...."
    local allfiles : dir "C:\..." files "*.xlsx"
    tempfile building
    clear
    save `building', emptyok
    foreach f of local allfiles {
    import excel using "`f'", sheet("Results") case(pre) all clear // You may need to apply certain other options here such
    // as -firstrow- or -cellrange()- depending on how these spreadsheets are laid out
    gen source = "`f'"
    append using `building', force
    save `building', replace
    }



    autorename
    Thank you very much


  • #2
    I think the simplest solution is to use the -firstrow- option when you import. Then Stata will deal automatically with turning those strings into legal variable names. Presumably you will then also need to do the -rename-ing in each iteration of the loop so that the different files have the same variable names and are ready to -append- nicely.

    Comment


    • #3
      I agree with Clyde that you should use firstrow when importing the files. I am also a bit surprised that you are not facing additional problems that I would expect from looking at your code. First, if you do not specify firstrow each dataset will have the variable names in its first row and all values will be imported as strings. When you append the files you will end up with variable names in the middle of the dataset. You could fix this by moving autorename (from SSC, I suppose) inside the loop before you call append.

      As usual, I will point to xls2dta (SSC).

      Best
      Daniel

      Comment


      • #4
        Thanks for your help, I had to write code to remove all integers and special characters from the first row. That seemed to do the trick.
        daniel klein the xls2dta would help a lot but I receive the following error message:
        PHP Code:
        xls2dtaclear firstrow append
        option firstrow not allowed 
        For reference, here is the code for removing integers and special characters from multiple files:
        PHP Code:
        cd "C:\Users\..."
        local allfiles dir "C:\Users\..." files "*.xls*"

        foreach f of local allfiles {
        import excel "`f'"sheet("Results"allstring clear

        foreach x of varlist _all {
        forvalues i 0/255 {
            if !
        inrange(`i', 65, 90) /// UPPER CASE LETTERS
                & !inrange(
        `i', 97, 122) { // LOWER CASE LETTERS
                    quietly replace `x' 
        subinstr(`x', `= `"char(`i')"' ', "", .) if _n==1
            }
        }
        }
        export excel using "C:\Users\....\temp2\\`f'
        .xls", replace

        Comment


        • #5
          Thanks for the interest in xls2dta. Check the help file again. This would be something like

          Code:
          xls2dta : import excel ./*.xlsx , sheet("Results") firstrow
          xls2dta : xeq data_cleaning_commands
          xls2dta , clear : append
          In the first step, you import all files. In the second you clean the individual datasets, i.e., do the renaming stuff. Then in a third step you append the files. I believe in your case it might be easier to set up the loop yourself and try doing it all in one run.

          Best
          Daniel

          Comment


          • #6
            Thought that it might be the case. xls2dta is great though, thanks!

            Comment


            • #7
              My reading of #1 is that "(enter)" is used to describe line breaks in Excel cells, something you can do by typing control-option-return on a Mac and something equivalent on a PC. So a column header for the variable Account(enter)2015 would have been created by typing Account followed by control-option-return followed by 2015.

              As Clyde and daniel have suggested, this type of issue is best handled by the firstrow option. That leaves the problem of how to remove the year from the variable name when appending files. The following example shows how to use runby (from SSC) to import such Excel files. The list of files to process is obtained using filelist (also from SSC). With runby, you run a series of commands (defined within a program) on by-group data subsets. The my_import program handles all the specific tweaks needed to get the data in shape. In this case, I assumed that all files have an Account variable and used that to extract the year.

              Code:
              clear all
              
              program my_import
              
                  local f = filename[1]
                  import excel "`f'", firstrow clear
                  
                  // get the year from a variable that occurs in all files
                  unab basevar : Account*
                  local year = subinstr("`basevar'", "Account", "" ,1)
                  
                  // group rename all variables to remove the year suffix
                  rename *`year' *
                  
                  // add variables for the year and the source filename
                  gen year = `year'
                  gen source = "`f'"
                  
                  list
                  
              end
              
              * make a dataset of Excel files to process from the current directory
              filelist , norecur
              keep if strpos(filename,".xlsx")
              
              runby my_import, by(filename) verbose
              
              list, sepby(year)
              and here are the results when used with 2 demonstration datasets I created in Excel:
              Code:
              . * make a dataset of Excel files to process from the current directory
              . filelist , norecur
              Number of files found = 5
              
              . keep if strpos(filename,".xlsx")
              (3 observations deleted)
              
              . list
              
                   +-----------------------------------+
                   | dirname   filename          fsize |
                   |-----------------------------------|
                1. | .         data_2015.xlsx   29,521 |
                2. | .         data_2016.xlsx   29,599 |
                   +-----------------------------------+
              
              . 
              . 
              . runby my_import, by(filename) verbose
              
                   +----------------------------------------------------+
                   | price   quantity   Account   year           source |
                   |----------------------------------------------------|
                1. |     1          2         3   2015   data_2015.xlsx |
                2. |     4          5         6   2015   data_2015.xlsx |
                   +----------------------------------------------------+
              
                   +----------------------------------------------------+
                   | price   quantity   Account   year           source |
                   |----------------------------------------------------|
                1. |     7          8         9   2016   data_2016.xlsx |
                2. |    10         11        12   2016   data_2016.xlsx |
                   +----------------------------------------------------+
              
              --------------------------------------
              Number of by-groups    =             2
              by-groups with errors  =             0
              by-groups with no data =             0
              Observations processed =             2
              Observations saved     =             4
              --------------------------------------
              
              . 
              . list, sepby(year)
              
                   +----------------------------------------------------+
                   | price   quantity   Account   year           source |
                   |----------------------------------------------------|
                1. |     1          2         3   2015   data_2015.xlsx |
                2. |     4          5         6   2015   data_2015.xlsx |
                   |----------------------------------------------------|
                3. |     7          8         9   2016   data_2016.xlsx |
                4. |    10         11        12   2016   data_2016.xlsx |
                   +----------------------------------------------------+

              Comment

              Working...
              X