Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excel files: Merge worksheets within Excel files and append Excel files with a loop

    Dear Statalist,

    We have collected annual data on minorities in several countries for 25 years. Now I would like to import all data into Stata and create a master dataset: 160 variables with data on 25 years for almost 800 minorities. The variables are in numeric and string format.

    The tricky part:
    The data for each minority is stored in a separate Excel file (i.e. 800 Excel files).
    Each Excel file consists of five worksheets.
    Each worksheet contains different variables (and a different number of variables, hence columns).
    In the first row of each worksheet the years are listed.
    Each worksheet contains two fields indicating country and group identifier.
    The design of the worksheets is unfavourable, because country and group identifier and year are in the same row. Similarly, the group identifier and the variables are in the same row as well.

    I would like to
    1. Merge all worksheets within each Excel file by year.
    2. Create variable “year”.
    3. Create variable “ID” which informs about country-group per year.
    4. Append all Excel files in order to have one master dataset.

    Unfortunately, I was not able to find help for my particular problem. Is there a code that I could use for a loop over all my 800 Excel files?

    This is how the worksheets look like:

    Worksheet 1 in Excel file A
    Country: FRA
    Group: B2 Var_1 Var_2 Var_3
    1990
    1991
    1992
    1993
    1994
    Worksheet 2 in Excel file A
    Country: FRA
    Group: B2 Var_4 Var_5 Var_6
    1990
    1991
    1992
    1993
    1994








    This is how the worksheets should look like after they were merged:
    ID year Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
    FRA-B2 1990
    FRA-B2 1991
    FRA-B2 1992
    FRA-B2 1993
    FRA-B2 1994
    In the above format I could append the various Excel files. I hope my description is specific enough. I very much appreciate your support.

    Thank you. Tom.
    Last edited by Tom Konzack; 09 Jan 2019, 09:28.

  • #2
    Tom, I don't have anything to test this on that is remotely like your situation, so there likely are a number of bugs in what follows, but this should give you a basic start. And, of course, I might have misunderstood your intent. Finally, I have assumed that your files relatively cleanly follow the format you describe above, so if they don't, all bets are off (grin).
    Code:
    local nsheet = 5
    cd YourExcelDirectory // You must replace
    local ExcelFileList: dir "." files "*.xls*" // Assume your files have conventional extension
    local filecount 0
    foreach f of local ExcelFileList {
       // Import and save each sheet from this Excel file
        forval i = 1/`nsheet' {
            import excel "`f'", sheet("Sheet`i'") clear // each of sheets in this file
            // renaming from rows 1 and 2
            gen id = subinstr(A[1], "Country:", "", .) + subinstr(A[2], "Group:", "", .)
            order id
            rename A year
            ds year id, not
            foreach v of varlist `r(varlist)' {
                local rightname = `v'[2]
                rename `v' `rightname'
            }
            drop in 1/2 // don't need these rows
            tempfile sheet`i'
            save `sheet`i''
        }
        // Merge together the sheets for this file.
        use "`sheet1'"
        forval i = 2/`nsheet' {
            merge 1:1 id year using "`sheet`i''"
            drop _merge
        }
        // Make the variables numeric as needed
        ds year id, not
        foreach v of varlist `r(varlist)' {
            capture confirm string variable `v'
            if (_rc == 0) {
                destring `v', replace
        }
        // clean up tempfiles
        forval i = 1/`nsheet' {
            erase "`sheet`i'"
        }
        // Now we have one of the Excel files translated, so count it and save it for later append
        local ++filecount
        tempfile file`filecount'
        save `file`filecount''"
    } // end of importing each Excel file
    //
    // Append all the translated files
    forval i = 1/`filecount' {
        append using "`file`i''"
    }
    Last edited by Mike Lacy; 09 Jan 2019, 11:35.

    Comment


    • #3
      Dear Mike,

      thanks so much for your quick reply, very much appreciated. I will try your code today and come back to you. All the best, Tom.

      Comment


      • #4
        I just noticed one mistake I made: There needs to be a -clear- right before the loop to append all the files at the very end.

        Comment


        • #5
          Dear Mike,
          thanks so much for your hint. I have been trying hard today, but was not successful - I have no real experience with loop and these Excel files are quite difficult to import.

          The five worksheets of each Excel file have different titles and I was replacing forval with foreach giving the titles of the worksheets, because I received an error message ("Worksheet1 not found.").

          Generating variables ID and year work perfectly, but unfortunately the renaming of the other variables with the values from the second row does not work, as Stata gives the error message "Syntax error" (198).
          This code produced error message 198:
          Code:
           foreach v of varlist `r(varlist)' {
                      local rightname = `v'[2]
                      rename `v' `rightname'
          The same outcome is produced by

          Code:
          rename `v' `=`v'[2]' // another quote sign
          However, it is possible to assign to each variable the labels of the values of the second row... Another problem is that many variables in each sheet and across sheets have identical names. Therefore, I was thinking about a way to rename each variable by hand for each of the five sheets. But it only works for the first sheet and ends with the error message "variable B not found" r(111). Obviously this is where Stata stops and does not proceed with the other sheets. Is there maybe a way in a loop to rename the variables of each sheet separately, save each sheet and then merge all sheets together?


          Code:
          cd "My_working_directory"
          local ExcelFileList: dir "." files "*.xls*"
          local filecount 0
          foreach f of local ExcelFileList {
            
              foreach i in "1. Static + Org" "2. Grievances" "3.1. NV Mobi" "3.2. V Mobi" "3.3. Lethality"  {
              display "`i'"
                  
                  import excel "`f'", sheet(`i') clear
                  // renaming from rows 1 and 2
                  gen id = subinstr(A[1], "Country:", "", .) + subinstr(A[2], "Group: ", "-", .)
                  order id
                  rename A year
                  ds year id, not
               
                 // give variable labels
                 foreach var of varlist `r(varlist)' {
                      label variable `var' "`=`var'[2]'"
                      }
                      
                // attempt to rename variables
              /*foreach var in i = "1. Static + Org" {    // I would do this for i = [the other four sheets]
                  rename B Ethnicity
                  rename C PC_01
                  rename D Comments_01
                  rename E Quotes_01
                  rename F Regional
                  rename G PC_02
                  rename H Comments_02
                 .....
                  drop in 1/2 // don't need these rows
                  drop if _n > 25 // don't need these lines
          }
          }
          }
          Do you have an idea how to proceed? Thanks so much for your support. Best, Tom.

          Comment


          • #6
            I need to be sure we are working with the same raw material, so I need an example. Take one of your Excel files, and save the first 10 lines or so of 3 worksheets into a separate CSV file.
            Post each of those here between code delimiters. That will permit me (and you) to make an Excel file so that we are both working on the same thing.

            Comment


            • #7
              Dear Mike,
              thanks for your reply. Below are examples of a few lines of three of the five worksheets. I also attach the CSV file here. Your support is much appreciated. Best, Tom.
              Sample.csv

              Worksheet "1. ABC":
              Code:
               
              Country: AUT 1. ABC
              Group: J5 a. Ethnicity PC Comments Quote(s) b. Region PC Comments Quote(s)
              1990 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1991 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1992 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1993 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1994 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1995 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1996 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1997 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1998 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              1999 1=significant 2=conflicting information • Text Text • Text Text 2=strong/fully 1=high precision • Text Text • Text Text
              Worksheet "2. DEF":
              Code:
               
              Country: AUT 2. DEF
              Group: J5 a. Existence PC Comments Quote(s) b. Religious PC Comments Quote(s)
              1990 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1991 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1992 1=yes 1=high precision • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1993 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1994 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1995 1=yes 1=high precision • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1996 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1997 1=yes 1=high precision • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1998 0=no reports 4=indirect information • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1999 1=yes 1=high precision • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              Worksheet "3. GHI":
              Code:
               
              Country: AUT 3. GHI
              Group: J5 3a. Mob-history PC Comments Quote(s) 3b. Mob-calls PC Comments Quote(s)
              1990 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1991 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1992 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1993 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1994 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1995 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1996 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1997 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1998 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text
              1999 0=no 5=other • Text Text • Text Text 0=no 4=indirect information • Text Text • Text Text

              Comment


              • #8
                Sorry I'm jumping in here late.

                I was able to rename variables to values of 1st row (obs==1) or 2nd row using the following:
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str10 A str17(B C D E)
                "Date2"      "Hospital5" "Hospital6"         "Hospital7" "Hospital8"
                "Date"       "Company1"  "Company2"          "Company3"  "Company4" 
                "12/31/1999" "6958.14"   "101.15"            "98.75"     "15.74"    
                " 1/7/2000"  "6780.96"   "101.15"            "107.08"    "16.2"     
                " 1/6/2000"  "6474.92"   "101.15"            "107.08"    "16.25"    
                " 1/5/2000"  "6502.07"   "98.54000000000001" "104.7"     "15.74"    
                " 1/4/2000"  "6586.95"   "101.15"            "103.51"    "15.74"    
                " 1/3/2000"  "6750.76"   "101.15"            "104.7"     "15.74"    
                " 1/18/2000" "7072.12"   "101.15"            "99.94"     "15.74"    
                " 1/17/2000" "7258.9"    "101.15"            "101.13"    "16.3"     
                " 1/14/2000" "7173.22"   "110.94"            "103.03"    "15.74"    
                " 1/13/2000" "6955.98"   "110.94"            "102.32"    "14.82"    
                " 1/12/2000" "6912.81"   "103.11"            "103.51"    "14.82"    
                " 1/11/2000" "6891.25"   "101.81"            "104.7"     "14.82"    
                " 1/10/2000" "6925.52"   "102.78"            "105.89"    "16.2"     
                end
                
                foreach var of varlist * {
                    rename `var' `=`var'[1]'
                }
                
                foreach var of varlist * {
                    rename `var' `=`var'[2]'
                }
                
                
                
                . foreach var of varlist * {
                  2.     rename `var' `=`var'[2]'
                  3. }
                
                
                . list in 1/5
                
                     +------------------------------------------------------------+
                     |       Date    Company1    Company2    Company3    Company4 |
                     |------------------------------------------------------------|
                  1. |      Date2   Hospital5   Hospital6   Hospital7   Hospital8 |
                  2. |       Date    Company1    Company2    Company3    Company4 |
                  3. | 12/31/1999     6958.14      101.15       98.75       15.74 |
                  4. |   1/7/2000     6780.96      101.15      107.08        16.2 |
                  5. |   1/6/2000     6474.92      101.15      107.08       16.25 |
                     +------------------------------------------------------------+
                
                
                . do "C:\Users\dfb6\AppData\Local\Temp\STD3af4_000000.tmp"
                
                . foreach var of varlist * {
                  2.     rename `var' `=`var'[1]'
                  3. }
                
                
                . list in 1/5
                
                     +------------------------------------------------------------+
                     |      Date2   Hospital5   Hospital6   Hospital7   Hospital8 |
                     |------------------------------------------------------------|
                  1. |      Date2   Hospital5   Hospital6   Hospital7   Hospital8 |
                  2. |       Date    Company1    Company2    Company3    Company4 |
                  3. | 12/31/1999     6958.14      101.15       98.75       15.74 |
                  4. |   1/7/2000     6780.96      101.15      107.08        16.2 |
                  5. |   1/6/2000     6474.92      101.15      107.08       16.25 |
                     +------------------------------------------------------------+
                I haven't walked through your code to see why it was throwing the error (I often get them when I paste data from Statalist code window into a do file (there must be non-printing unicode characters or something). But thought I would pass it along.

                Hope that helps!

                Comment


                • #9
                  Dear David,

                  thanks for your suggestions. It should work and I don't copy paste - but I still get the syntax error r(198). I guess something must be wrong in the setup prior to this line. I thought it might be caused by the fact that several of my variables have identical names. But even by manually renaming my variables in the Excel sheet before importing doesn't change anything. I'm looking for a way to within each Excel file 1) rename the variables for each work sheet separately, 2) then merging them, 3) then appending all Excel files.
                  Best, Tom.

                  Comment


                  • #10
                    David, thanks for jumping in here. Anyway, Tom, your example doesn't work very well , since it's not a CSV. It doesn't have any commas so it's very hard to tell what's going on in your file as regards where Excel columns are differentiated. It looks like you saved your Excel files using tabs or spaces as delimiters. And your attachment of a purported CSV doesn't work very well, since I and most other people here a file attachment as that presents malware risks. A simpler starting point here would be for just one CSV posted with your exact syntax that produces the r(198) error.

                    Comment


                    • #11
                      Hey all,

                      I have managed to adjust the codes accordingly: the main problem were the variable names, but manually renaming them in a loop worked. I will post the code as soon as I have tested it properly on all my files.

                      Thanks again for your help, in particular Mike with his master code.
                      Best, Tom.

                      Comment

                      Working...
                      X