Looping over cell-range with import excel

Adriano Mariani

Join Date: Apr 2020

Posts: 13
#1

Looping over cell-range with import excel

05 Apr 2020, 03:34

Dear All,

This is my first post on the forum.

My task is to import into stata a number of excel files, clean them, merge them and make them ready for regression analysis.
For each file, I need to retrieve data from 3 separate specific tables (out of many tables in the file) and merge them together.

However, I have the problem that the cell-range changes, both the starting and ending cell, for each excel file. The coloumn's should be constant from table to table or at least within each table so I can already write them in the respective cell-ranges, but the starting and ending row are unknown and change for each table.

The common characteristic of where the starting cell is for the 3 tables across the excel files is 5 rows below : "Table 1" ,"Table 4", "Table M1". And they all end on the closest row to the start of the range with the phrase "Average of forecasts"

What the code should look like is something like this:

1) local directory over the files in the folder
2) for each file, import it into excel, read the rows and (maybe define a local) with the specific row for each of the 3 tables, given the common criteria listed above (i have tried with substr)
3) import excel for each of the cellranges separately (as for each table I need to manipulate it separately and then merge them together)

local myfilelist : dir "/Users/adrianomariani/Desktop/Research Assistant scheme/Forecasts excel" files "*.xlsx"

foreach filename of local myfilelist {

import excel using `filename', clear

gen start1=_n+5 if substr(A,1,7)=="Table 1"
gen end1=_n-1 if substr(A,1,7)=="Average" & _n<(start1+80) // i am using start2+80 as there are many averages, and I want the first one after 'start1', similarly for the others
gen start2=_n+5 if substr(A,1,7)=="Table 4"
gen end2=_n-1 if substr(A,1,7)=="Average" & _n<(start2+80)
gen start3=_n+5 if substr(A,1,8)=="Table M1"
gen end3=_n-1 if substr(A,1,7)=="Average" & _n<(start3+80)

local table1s dis(start1)
local table1e dis(min(end1))
local table4s dis(start2)
local table4e dis(min(end2))
local tableM1s dis(start3)
local tableM1e dis(min(end3))

import excel using `filename', cellrange(A`table1s':D`table1e') clear

......

import excel using `filename', cellrange(A`table4s':D`table4e') clear

.....

import excel using `filename', cellrange(A`tableM1s':H`tableM1e') clear //the coloumns 'A,H' in this case and 'A,D' in the other two are constant, just the rows are of interest

.....

}

When I try to run this (even without the loop over cell-files and hence on just one file) it returns the problem that cellrange is out of range, so there must be a problem with how I have defined it.

Thanks in advance for the help
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

06 Apr 2020, 01:54

Sorry, but what is dis() ? I am unable to find any such function.
Comment
Adriano Mariani

Join Date: Apr 2020

Posts: 13
#3

06 Apr 2020, 04:40

Dear Nick,

Thanks for your reply.

Probably I have used it in the wrong context, but I meant the command 'display', as for each local defined I would like to have the number of starting/ ending row.
I realized i probably also treated these locals as loops, (as then I called them in cell-range via the `tableM1e')
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

06 Apr 2020, 05:35

I see what you intend there. There are three problems ensuing from that. First, assignments like

Code:

local table1s dis(start1)

are perfectly legal but all they do is assign the literal text dis(start1) to the local in question. So Stata won't catch that this is a bad idea.

Second, it will bite you as presumably Stata can't make sense of a command like

Code:

import excel using `filename', cellrange(Adis(start1):Ddis(min(end1)) clear

even with a filename given.

Third, even an assignment such as

Code:

local table1s = start1

won't help unless what you really want is

Code:

local table1s = start1[1]

as the first observation is always that used by default.

An extra problem is that min() requires a minimum of two arguments any way, so the Stata function min() applied to a single variable, even if it numeric, is illegal. Also, min() acts rowwise, not columnwise, and isn't a way to extract the minimum of a variable. (I don't think this is especially well explained in the help.)

My guess is that your code should perhaps start more like this:

Code:

local myfilelist : dir "/Users/adrianomariani/Desktop/Research Assistant scheme/Forecasts excel" files "*.xlsx" foreach filename of local myfilelist { import excel using `filename', clear gen start1 = _n+5 if substr(A,1,7)=="Table 1" gen end1 = _n-1 if substr(A,1,7)=="Average" & _n<(start1+80) * I am using start2+80 as there are many averages, and I want the first one after 'start1', similarly for the others gen start2=_n+5 if substr(A,1,7)=="Table 4" gen end2=_n-1 if substr(A,1,7)=="Average" & _n<(start2+80) gen start3=_n+5 if substr(A,1,8)=="Table M1" gen end3=_n-1 if substr(A,1,7)=="Average" & _n<(start3+80) list start1-end3 if start1 < . | end1 < . | start2 < . | end2 < . | start3 < . | end3 < . su start1, meanonly local table1s = r(min) su end1, meanonly local table1e = r(min)

Code:

Last edited by Nick Cox; 06 Apr 2020, 06:14.
Comment
Adriano Mariani

Join Date: Apr 2020

Posts: 13
#5

06 Apr 2020, 08:12

Dear Nick

Thanks for the precious help.

I have attempted running the suggested code, however it still returns the same problem as before that the cell-range is invalid.
Trying to understand the problem, I have run the 'list' command you suggested and attached the results, this showed that for the start, the observation is the only one throughout the dataset thus to define the local, following your suggestions, I tried to use the command:

local table1s = start1[1]

Which should technically work, so I tried to define also end1-end3 so that there was only one observation, defined the local as above for start1 and tried running the import excel, however I still have the same problem of the invalid cell-range.

Looking at the list output attached, for the respective starts of each table it is the only observation, and this is the one i need. Instead, for the respective end's, the row number I need is the first one following the respective start row. (ie for table 1: 69-108, table 2: 242-282 etc.). So i was thinking of dropping/replacing the values of the respective end's as missing, so that the first observation for end1-end3 is the row number I need, and then I would apply the same method as above for start1, but for all the others:

local table1s = start1[1]

However, it seems the real problem, is the passing of the respective local's through the cell-range option. Any suggestions?

Thanks again.

Attached Files

Last edited by Adriano Mariani; 06 Apr 2020, 08:20.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

06 Apr 2020, 08:28

I don't think I can help further. I can't see what your Excel files look like or use your information to imagine them. It also seems that you didn't use my code but rewrote your own. That's fine by me but I don't see that anyone can help if you won't show what code you used.
Comment
Adriano Mariani

Join Date: Apr 2020

Posts: 13
#7

06 Apr 2020, 09:12

Dear Nick,

In the first phrase, I explained that following your code to the letter, it did not work as it returned the same problem.
Trying to understand why it was returning this problem, I played around with the code and tried running it by defining the local in a different way (all else equal), however, even though in both cases the local is defined well, the passing of the value in the local to the cell-range option seems to be the problem:

import excel using `filename', cellrange(A`table1s':D`table1e') clear

(the code before this is yours from above)

Given it is the same code but with the only modification of the local, I didn't think it needed to be re-included, also given in the end I found yours to be more efficient and thus changed back to it.
If you have a minute, I will attach an example of one of the excel files if it helps.

Thanks again for your help, note i meant absolutely no disrespect in my previous post, quite the contrary, I found your initial response very useful.
Attached Files

Forecomp_December_2019.xlsx (70.7 KB, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#8

06 Apr 2020, 09:33

I don't take offence and still less do I think that you were being disrespectful, I have given some time to this and I don't think I can help more; that's all.
Comment
Adriano Mariani

Join Date: Apr 2020

Posts: 13
#9

06 Apr 2020, 09:39

Ok thanks, appreciate your time
Comment

John DSouza

Join Date: Apr 2020
Posts: 7

#10

07 Apr 2020, 11:25

Hello Adriano,

I'm not sure I understand your problem exactly. But I don't think the idea of generating additional variables is the way to go. Something using locals like this might work. It assumes 'Table 1', 'Table 4' and 'Table M1' occur once in column A of each of your files (and if they occur more than once the first occurence is the important one). That might be an incorrect assumption; in that case you might want to make some changes.

Code:

local myfilelist : dir "DIRECTORY_NAME" files "*.xlsx"

foreach f of local myfilelist {
 clear
 import excel using "DIRECTORY_NAME/`f'"
 gen id = _n    // Row number

 foreach tbl in 1 4 M1 {
  local len = length("`tbl'") + 6
  summ id if substr(A,1,`len') == "Table `tbl'"
  assert r(N) > 0.   // Check the table name does occur
  local start = r(min) + 5 // Row of first occurence + 5
         // (Assumes first occurence is the important one)
  
  summ id if substr(A,1,7) == "Average" & id > `start'
  assert r(N) > 0   // Check "Average" does occur after the table name
  local end = r(min) - 1  // Row of first occurence of "Average" - 1
  preserve
   keep in `start'/`end'
   drop id
   save "`f'-Table `tbl'.dta"  // Filename followed by "-Table ??"
  restore
 }
}
// Compare a few of the saved Stata files to the original Excels
// Now do whatever you wanted to do with the tables

Comment

Adriano Mariani

Join Date: Apr 2020

Posts: 13
#11

08 Apr 2020, 07:45

Dear John,

thank you very much, the code works perfectly well .
I was probably looking in the wrong direction by trying to pass the row number on to the cell-range option.
Thanks again!
Comment

Announcement

Looping over cell-range with import excel

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment