Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • importing data from many different excel files

    Hello Statamasters,
    This is my first post! So thank you in advance for your attention.

    I am trying to import data from raw survey instruments, and that involves extracting datapoints from the same cells in many different (around 300) excel sheets divided in 21 excel files. Since all of the sheets have the same structure, the challenge is to write a stata code to loop over all those sheets, extract the same information, and produce a dataset with 300 observations, one for each sheet.

    I wrote already the code for each sheet. It goes like this:

    Code:
    import excel using "/Users/henrique/excelfile1.xlsx", sheet(sheet1) cellrange(R167:R167)
    rename R q128
    label var q128 "Você tem energia elétrica?"
    save "survey.dta", replace
    clear
    
    import excel using "/Users/henrique/excelfile1.xlsx", sheet(sheet1) cellrange(R166:R166)
    rename R q127
    label var q127 "Qual nível de Bolsa Família você recebe?"
    merge 1:1 _n using "survey.dta"
    drop _merge
    save "survey.dta", replace
    clear
    I repeated this routine on and on to get stata to produce a single observation with 128 variables out of each excel sheet (the data points are scattered all over the spreadsheet so I had to extract cell for cell instead of using some other efficient method).

    The idea I had so far was to create another dataset with one variable for the full file paths for all the spreadsheets and variables containing the sheets in each different file, and run a levelsof macro to connect this dataset with the .do file described above. Nevertheless, my attempts at this idea are not bearing fruits. Any different ideas?

    Thank you very much.

    Henrique
    Last edited by Henrique Barbosa; 03 Aug 2015, 20:34.

  • #2
    So to clarify, you need to:
    • Loop over the 21 files
    • Inside each file, loop over the sheets. You need one cell range in each sheet.
    It's not clear to me whether the cell ranges are the same across the sheets (or across the files). First you write:

    involves extracting datapoints from the same cells in many different (around 300) excel sheets
    but then you write

    (the data points are scattered all over the spreadsheet so I had to extract cell for cell instead of using some other efficient method).
    In any case, you may not need to create an index file. Say all your excel files are in one folder. You can use
    Code:
    glo files: dir .files "*.xls"
    to get a list of the excel files in that folder. (see help extended_fcn)

    You can also get a list of the sheets on an excel file, according to this post:
    http://www.stata.com/statalist/archive/2012-09/msg00396.html

    With these lists in a hand you could write a loop that goes over the lists.
    If the cell range is the same across files and sheets you're done. If the cell range differs across files but not across sheets, you'll need to specify it for each step of the file loop.
    If the cell range differs across both files and sheets, then you may need an index file for the cell ranges, as you originally thought.








    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

    Comment


    • #3
      You may also be interested in xls2dta (SSC), that basically implements loops you have in mind (see this thread for some examples). If I find the time, I will post a sketch to tackle the problem - as far as I understand it.

      Best
      Daniel

      Comment


      • #4
        Ok. I am not completely clear on the layout and, thus, not sure whether writing the loop(s) from scratch might be the better approach here. If you explain (much) more how exactly the sheets look like and how exactly you want the final dataset to look like, we might be able to give better advice.

        Anyway, I will, for now, take your example code seriously, considering it a complete description of the problem at hand. This means, I assume

        1. The cellrange is R166:R167 in all sheets. This means you extract two values, both in the same column, both in successive rows.
        2. The values in these two cells are to be put into variables q127 and q128. This means they are to be in two columns instead of two rows.

        I additionally assume that

        3. We have all Excel files located in one folder, say myxlsfolder.
        4. We want to import all sheets within a given Excel file.

        Let me start by rewriting your code for each sheet.

        Code:
        generate byte i = 1
        generate byte j = (_n + 126)
        reshape q , i(i) j(j)
        drop i
        The idea here is, obviously, to reshape the data instead of merging the different pieces together. I will save this in mydofile.do.

        I will next use xls2dta to convert all the sheets to Stata datasets, extracting only the cells that are needed

        Code:
        xls2dta , allsheets : import excel q = R using myxlsfolder , cellrange(R166:R167)
        Note that instead of renaming the variables later, I have renamed them while importing. Now I run mydofile.do on each dataset

        Code:
        xls2dta : do mydofile.do
        This gets us to the point of your example.

        Without more details on how these datasets are to be combined, I cannot say much more.

        Best
        Daniel

        Comment


        • #5
          Thank you all for the great advice!

          Comment

          Working...
          X