Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping over cell-range with import excel

    Dear All,

    This is my first post on the forum.

    My task is to import into stata a number of excel files, clean them, merge them and make them ready for regression analysis.
    For each file, I need to retrieve data from 3 separate specific tables (out of many tables in the file) and merge them together.

    However, I have the problem that the cell-range changes, both the starting and ending cell, for each excel file. The coloumn's should be constant from table to table or at least within each table so I can already write them in the respective cell-ranges, but the starting and ending row are unknown and change for each table.

    The common characteristic of where the starting cell is for the 3 tables across the excel files is 5 rows below : "Table 1" ,"Table 4", "Table M1". And they all end on the closest row to the start of the range with the phrase "Average of forecasts"

    What the code should look like is something like this:

    1) local directory over the files in the folder
    2) for each file, import it into excel, read the rows and (maybe define a local) with the specific row for each of the 3 tables, given the common criteria listed above (i have tried with substr)
    3) import excel for each of the cellranges separately (as for each table I need to manipulate it separately and then merge them together)


    local myfilelist : dir "/Users/adrianomariani/Desktop/Research Assistant scheme/Forecasts excel" files "*.xlsx"

    foreach filename of local myfilelist {

    import excel using `filename', clear

    gen start1=_n+5 if substr(A,1,7)=="Table 1"
    gen end1=_n-1 if substr(A,1,7)=="Average" & _n<(start1+80) // i am using start2+80 as there are many averages, and I want the first one after 'start1', similarly for the others
    gen start2=_n+5 if substr(A,1,7)=="Table 4"
    gen end2=_n-1 if substr(A,1,7)=="Average" & _n<(start2+80)
    gen start3=_n+5 if substr(A,1,8)=="Table M1"
    gen end3=_n-1 if substr(A,1,7)=="Average" & _n<(start3+80)

    local table1s dis(start1)
    local table1e dis(min(end1))
    local table4s dis(start2)
    local table4e dis(min(end2))
    local tableM1s dis(start3)
    local tableM1e dis(min(end3))

    import excel using `filename', cellrange(A`table1s':D`table1e') clear

    ......

    import excel using `filename', cellrange(A`table4s':D`table4e') clear

    .....


    import excel using `filename', cellrange(A`tableM1s':H`tableM1e') clear //the coloumns 'A,H' in this case and 'A,D' in the other two are constant, just the rows are of interest

    .....

    }

    When I try to run this (even without the loop over cell-files and hence on just one file) it returns the problem that cellrange is out of range, so there must be a problem with how I have defined it.

    Thanks in advance for the help


  • #2
    Sorry, but what is dis() ? I am unable to find any such function.

    Comment


    • #3
      Dear Nick,

      Thanks for your reply.

      Probably I have used it in the wrong context, but I meant the command 'display', as for each local defined I would like to have the number of starting/ ending row.
      I realized i probably also treated these locals as loops, (as then I called them in cell-range via the `tableM1e')

      Comment


      • #4
        I see what you intend there. There are three problems ensuing from that. First, assignments like

        Code:
        local table1s dis(start1)
        are perfectly legal but all they do is assign the literal text dis(start1) to the local in question. So Stata won't catch that this is a bad idea.

        Second, it will bite you as presumably Stata can't make sense of a command like

        Code:
        import excel using `filename', cellrange(Adis(start1):Ddis(min(end1)) clear
        even with a filename given.

        Third, even an assignment such as

        Code:
        local table1s = start1
        won't help unless what you really want is

        Code:
        local table1s = start1[1]
        as the first observation is always that used by default.

        An extra problem is that min() requires a minimum of two arguments any way, so the Stata function min() applied to a single variable, even if it numeric, is illegal. Also, min() acts rowwise, not columnwise, and isn't a way to extract the minimum of a variable. (I don't think this is especially well explained in the help.)

        My guess is that your code should perhaps start more like this:

        Code:
        local myfilelist : dir "/Users/adrianomariani/Desktop/Research Assistant scheme/Forecasts excel" files "*.xlsx"
        
        foreach filename of local myfilelist {
        
        import excel using `filename', clear
        
        gen start1 = _n+5 if substr(A,1,7)=="Table 1"
        gen end1 = _n-1 if substr(A,1,7)=="Average" & _n<(start1+80)
        * I am using start2+80 as there are many averages, and I want the first one after 'start1', similarly for the others
        gen start2=_n+5 if substr(A,1,7)=="Table 4"
        gen end2=_n-1 if substr(A,1,7)=="Average" & _n<(start2+80)
        gen start3=_n+5 if substr(A,1,8)=="Table M1"
        gen end3=_n-1 if substr(A,1,7)=="Average" & _n<(start3+80)
        
        list start1-end3 if start1 < . | end1 < . | start2 < . | end2 < . | start3 < . | end3 < .
        
        su start1, meanonly
        local table1s = r(min)
        
        su end1, meanonly
        local table1e = r(min)
        Code:
        
        
        Last edited by Nick Cox; 06 Apr 2020, 06:14.

        Comment


        • #5
          Dear Nick

          Thanks for the precious help.

          I have attempted running the suggested code, however it still returns the same problem as before that the cell-range is invalid.
          Trying to understand the problem, I have run the 'list' command you suggested and attached the results, this showed that for the start, the observation is the only one throughout the dataset thus to define the local, following your suggestions, I tried to use the command:

          local table1s = start1[1]

          Which should technically work, so I tried to define also end1-end3 so that there was only one observation, defined the local as above for start1 and tried running the import excel, however I still have the same problem of the invalid cell-range.

          Looking at the list output attached, for the respective starts of each table it is the only observation, and this is the one i need. Instead, for the respective end's, the row number I need is the first one following the respective start row. (ie for table 1: 69-108, table 2: 242-282 etc.). So i was thinking of dropping/replacing the values of the respective end's as missing, so that the first observation for end1-end3 is the row number I need, and then I would apply the same method as above for start1, but for all the others:

          local table1s = start1[1]

          However, it seems the real problem, is the passing of the respective local's through the cell-range option. Any suggestions?

          Thanks again.

          Attached Files
          Last edited by Adriano Mariani; 06 Apr 2020, 08:20.

          Comment


          • #6
            I don't think I can help further. I can't see what your Excel files look like or use your information to imagine them. It also seems that you didn't use my code but rewrote your own. That's fine by me but I don't see that anyone can help if you won't show what code you used.

            Comment


            • #7
              Dear Nick,

              In the first phrase, I explained that following your code to the letter, it did not work as it returned the same problem.
              Trying to understand why it was returning this problem, I played around with the code and tried running it by defining the local in a different way (all else equal), however, even though in both cases the local is defined well, the passing of the value in the local to the cell-range option seems to be the problem:

              import excel using `filename', cellrange(A`table1s':D`table1e') clear

              (the code before this is yours from above)

              Given it is the same code but with the only modification of the local, I didn't think it needed to be re-included, also given in the end I found yours to be more efficient and thus changed back to it.
              If you have a minute, I will attach an example of one of the excel files if it helps.

              Thanks again for your help, note i meant absolutely no disrespect in my previous post, quite the contrary, I found your initial response very useful.
              Attached Files

              Comment


              • #8
                I don't take offence and still less do I think that you were being disrespectful, I have given some time to this and I don't think I can help more; that's all.

                Comment


                • #9
                  Ok thanks, appreciate your time

                  Comment


                  • #10
                    Hello Adriano,

                    I'm not sure I understand your problem exactly. But I don't think the idea of generating additional variables is the way to go. Something using locals like this might work. It assumes 'Table 1', 'Table 4' and 'Table M1' occur once in column A of each of your files (and if they occur more than once the first occurence is the important one). That might be an incorrect assumption; in that case you might want to make some changes.


                    Code:
                    local myfilelist : dir "DIRECTORY_NAME" files "*.xlsx"
                    
                    foreach f of local myfilelist {
                     clear
                     import excel using "DIRECTORY_NAME/`f'"
                     gen id = _n    // Row number
                    
                     foreach tbl in 1 4 M1 {
                      local len = length("`tbl'") + 6
                      summ id if substr(A,1,`len') == "Table `tbl'"
                      assert r(N) > 0.   // Check the table name does occur
                      local start = r(min) + 5 // Row of first occurence + 5
                             // (Assumes first occurence is the important one)
                      
                      summ id if substr(A,1,7) == "Average" & id > `start'
                      assert r(N) > 0   // Check "Average" does occur after the table name
                      local end = r(min) - 1  // Row of first occurence of "Average" - 1
                      preserve
                       keep in `start'/`end'
                       drop id
                       save "`f'-Table `tbl'.dta"  // Filename followed by "-Table ??"
                      restore
                     }
                    }
                    // Compare a few of the saved Stata files to the original Excels
                    // Now do whatever you wanted to do with the tables

                    Comment


                    • #11
                      Dear John,

                      thank you very much, the code works perfectly well .
                      I was probably looking in the wrong direction by trying to pass the row number on to the cell-range option.
                      Thanks again!

                      Comment

                      Working...
                      X