Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Excel files using a loop: avoid importing a file multiple times

    Hi,
    I am trying to create a panel data.
    The raw data is multiple excel files (ranging from one to three) for each year, and I am using the following loop commands.
    (Please note that I have manually renamed excel files to "year_filenumber" where filename=1,2,3 representing the number of excel files in the year. Thus, the loop works)

    Code:
    *import .xls files into .dta
    foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015{
        forvalues i=1/3 {
            capture import excel "`year'_`i'", sheet("`year'_`i'") firstrow case(lower) allstring clear
            capture saveold "`year'_`i'", replace
            }
    }
    
    
    *create yearly .dta files
    clear all
    foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015{
        forvalues i=2/3 {
            use "`year'_1", clear
            capture append using "`year'_`i'"
        }
        generate time="`year'"
        saveold "`year'", replace
    }
    
    
    *append yearly .dta files
    clear all
    use 2001
    foreach year in 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015{
        append using `year'
    }
    The problem is that years which have data in only one or two excel files are imported extra times.
    For example, 2005_1 is the only excel file for year 2015. When it gets imported in Stata, I get three .dta files (2015_1, 2015_2, 2015_3) which get appended into one 2015.dta file (leading to duplicates).

    Is there any way to correct this loop (instead of using "duplicates drop" command in the end) ? Thank you.

  • #2
    Hi Arvind:
    I think you have to check whether the 3 sheets exist for each year. Also, you need to generate the variable year in the first loop instead of the second one. Try this:
    foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015{ forvalues i=1/3 { capture import excel "`year'_`i'", sheet("`year'_`i'") firstrow case(lower) allstring clear if _rc==0 { generate time="`year'" capture saveold "`year'_`i'", replace } } } *create yearly .dta files clear all foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015{ forvalues i=2/3 { capture use "`year'_1", clear if _rc==0 { capture append using "`year'_`i'" } } saveold "`year'", replace }
    *append yearly .dta files clear all use 2001 foreach year in 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 { append using `year' } Regards... Alberto

    Comment


    • #3
      Here is the code again in a better format:

      foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 {
      forvalues i=1/3 {
      capture import excel "`year'_`i'", sheet("`year'_`i'") firstrow case(lower) allstring clear
      if _rc==0 {
      generate time="`year'"
      capture saveold "`year'_`i'", replace
      }
      }
      }
      *create yearly .dta files
      clear
      foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 {
      forvalues i=2/3 {
      capture use "`year'_1", clear
      if _rc==0 {
      capture append using "`year'_`i'"
      }
      }
      saveold "`year'", replace
      }
      *append yearly .dta files
      clear
      use 2001
      foreach year in 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 {
      append using `year'
      }

      Comment


      • #4
        There is only one excel worksheet in each excel file. However, there can be multiple excel files for each year. Thus, three sheets do not exist for each year, and some years contain only two or even one excel files/sheets. One way to get around this is to have three loops (based on number of excel files), but i prefer to avoid that approach.

        Your "if _rc==0" approach works great to correctly import all files in one loop. Thank you Alberto !



        Comment


        • #5
          Incidental, but

          Code:
          foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
          can be rewritten

          Code:
          forval year = 2001/2015

          Comment


          • #6
            I think Arvind has a working solution at this point. So I'm going to chime in here to make a more abstract point.

            -capture- is an indispensable Stata command. It is used to deal with exceptional conditions that are expected to arise and require alternative treatment, as above, where certain filenames are expected not to exist. The problem with -capture- is that it is indiscriminate: it gives a pass to anything that goes wrong. Suppose that one of Arvind's files, say 2003_2.xls exists, but is corrupted and can't be read. In the code above, it will be treated as if it doesn't exist and simply passed over without warning. Arvind might never even realize that he is missing part of 2003 data. Or, worse, he may realize it after he as invested a lot of time and effort into analyzing the data and turns up some inexplicably bizarre result at the end. (Or, worse still, a member of the audience discovers a strange result while Arvind is making a presentation!)

            A safer way to use capture is to test whether _rc is in fact the error code of the anticipated problem. In this case, when -import excel- tries to read a non-existent file it returns error code 601. So a safer way to do this would be:
            Code:
            capture import excel using `year'_`i', ...
            if _rc == 0 {
                generate time = year // etc.
               ...
            }
            else if _rc != 601 { // SOME PROBLEM WITH A FILE THAT ACTUALLY EXISTS
                display as error "Unanticipated exception encountered reading `year'_`i'"
                err(_rc)
            }
            This code will trap any error that arose during the reading, interrupt execution, display an error message, and the error code associated with the problem. That way the nasty surprises come at the stage where it easiest to find and fix them.

            That's general advice for the use of capture. But in this situation, there may be an even safer way to do this. Rather than generating all combinations of 2001/2015 with 1/3 and then discarding the ones that don't exist, why not create a local macro that contains all and only the existing combinations. If the filenames that Arvind describes in #1 are the only ones in the working directory that end in .xls, he can do this:

            Code:
            local myfiles: dir "." files "*.xls"
            foreach f of local myfiles {
                import excel using `f' .....
                ...
                ... etc.
            }
            with no use of -capture- at all. If there are other Excel files in the directory, he might be able to capture these particular ones with a different wildcard, such as ????_?.xls.

            Comment


            • #7
              Digression on the theme: caught out in presentations:

              More than once I've heard of this dialogue:

              Could you remind us [veiled sarcasm] which way the gender dummy is defined in your data?

              Mm, can't remember.... Sorry.
              Moral: always name your dummies for the state coded 1.

              (Always call your dummies indicators, but that's another story.)

              Comment


              • #8
                Originally posted by Nick Cox View Post
                Incidental, but

                Code:
                foreach year in 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
                can be rewritten

                Code:
                forval year = 2001/2015
                I understand this point was incidental to the OP's solution, but it's something that's been nagging me a a few months in a couple of related projects. Thanks, Nick!

                Comment

                Working...
                X