Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use a scalar variable in a loop: "forval i = 3/N {"

    I have a problem where I want to import several sheets of an Excel file into Stata.

    My code looks like this.

    First, I just describe the excel file to store the number of worksheets, as in r(N_worksheet). Then, I open the first relevant sheet (named Sheet2), and save it as a .dta-file, in order to then loop through sheets 3 to N, appending them to the .dta file one after another, saving, and continuing the loop. I append by closing the "main file" that I got from the first sheet (Sheet2), namely d_datastream2.dta, then clearing and importing the respective next sheet from the excel file (3-N), saving it as d_datastream`i'.dta, closing it, opening the main file and appending using the just saved d_datastream`i'.dta file. In the end, I delete the additional .dta-files that were created in the process, and rename d_datastream2.dta to d_datastream.dta for further use.

    So before the loop, I have the first sheet, and in each iteration of the loop, I add data from an additional sheet.

    The Excelfile is named d_returnretrieve1.xlsm, the Sheets are just named Sheet2, ..., Sheet9

    Code:
    clear
    
    *start with importing and saving the first sheet, and save number of worksheets:
    import excel using d_returnretrieve1.xlsm, sheet("Sheet2") firstrow
    scalar N = r(N_worksheet)
    
    save d_datastream2.dta, replace
    
    display N
    
    *import sheets 3 to end (number of sheets):
    forval i=3/N {
    
    *import sheet i and save as dummyfile
    import excel using d_returnretrieve1.xlsm, sheet("Sheet`i'") firstrow clear
    save d_datastream`i'.dta, replace
    
    *close sheet i, open sheet 2 (main file)
    use d_datastream2.dta, clear
    
    *append sheet i from dummyfile behind sheet 2
    append using d_datastream`i'.dta
    
    *save new sheet 2
    save d_datastream2.dta, replace
    
    }
    
    save d_datastream.dta, replace
    
    *delete the above dummyfiles
    forval i = 2/9 {
    erase d_datastream`i'.dta
    }
    The problem is the following:
    It gives me "invalid syntax" when running this snip. If I change the line
    Code:
    forval i=3/N {
    to
    Code:
    forval i=3/9 {
    then it works flawlessly.

    So I note that for some reason, the scalar variable N is not saved throughout the loop, or not accessible. However, the alternative,
    Code:
    *import sheets 3 to end (number of sheets):
    forval i=3/N {
    
    *import sheet i and save as dummyfile
    import excel using d_returnretrieve1.xlsm, sheet("Sheet`i'") firstrow clear
    save d_datastream`i'.dta, replace
    
    *close sheet i, open sheet 2
    use d_datastream2.dta, clear
    
    *append sheet i from dummyfile behind sheet 2
    append using d_datastream`i'.dta
    
    *save new main file
    save d_datastream2.dta, replace
    
    *save number of worksheets again
    clear
    import excel using d_returnretrieve1.xlsm, describe
    scalar N = r(N_worksheet)
    
    }
    i.e., trying to re-define N within the loop, gives the same error.

    However, the following code, where I replace N with 9 in the loop specs:
    Code:
    forval i=3/9 {
    
    *import sheet i and save as dummyfile
    import excel using d_returnretrieve1.xlsm, sheet("Sheet`i'") firstrow clear
    save d_datastream`i'.dta, replace
    
    *close sheet i, open sheet 2
    use d_datastream2.dta, clear
    
    *append sheet i from dummyfile behind sheet 2
    append using d_datastream`i'.dta
    
    *save new sheet 2
    save d_datastream2.dta, replace
    
    *check number of worksheets again
    display N
    
    }
    correctly displays the N in every iteration of the loop.

    What could be the problem?
    Last edited by Max Piper; 26 Aug 2016, 07:05.

  • #2
    It's the job -- indeed the entire purpose and rationale -- of display to calculate what it can and then display the result.

    forval is not so designed. But you can insist on a calculation by Stata even before it passes the baton to forval.


    Code:
    forval i = 3/`=N' {
    In fact you would be well advised to spell out that it's a scalar

    Code:
    forval i = 3/`=scalar(N)' {
    The reason is that scalars and variables share the same namespace. If there is a variable called N (or one that unambiguously abbreviates as such, unless you have disabled variable name abbreviation) then Stata will choose that interpretation for N (and to boot look in the first observation to determine the value to be used).

    Comment


    • #3
      that seems like a useful direction, but I experimented with
      Code:
      forval i=3/`=scalar(N)' {
      forval i=3/scalar(N) {
      forval i=3/`scalar(N)' {
      forval i=3/=scalar(N) {
      and received the same error throughout.

      Comment


      • #4
        The first is what I suggested and it works for me.

        Code:
        . scalar N = 7
        
        . forval j = 1/`=scalar(N)' {
          2. di `j'
          3. }
        1
        2
        3
        4
        5
        6
        7
        Complete, reproducible example please. The scalar name must be exactly as specified. Stata won't abbreviate scalar names. Is Stata finding some quite different problem?

        Comment


        • #5
          My comment here is tangential. You may come to regret the erasure of the d_datastream*.dta files at the end.

          In most real-world data situations, series of data files such as you are working with often contain inconsistencies in coding or naming of variables, or quirks that cause a variable to be imported as a string in one data set but numeric in another. The mass appending of these files (or mass merging as the case may be) may result in a very messy data set that defies clean-up and analysis because the fixes that are needed for some of the data will mess up the rest or will be prevented from running due to their inapplicability to the rest. I think a better practice is to first import each file separately to Stata, then work with each file to verify that they all use the same scheme of variable names (and don't use the same variable name to mean different things in different files), labeling, data storage types, encoding of missing values, etc. When inconsistencies are found, make the necessary modifications in these separate files. After that is done, a short simple loop will combine them into a single file via- append- (or -merge-).

          It may be that you have unusually high quality data and these problems will not arise. But my experience is that that is quite exceptional. And even if this is your situation, you lose nothing by following my suggestion above.

          Comment


          • #6
            Nick Cox:

            this is very interesting. I have narrowed it down to this, in a new and concise MWE:
            Code:
            import excel using d_returnretrieve1.xlsm, sheet("Sheet2") firstrow clear
            *scalar N = r(N_worksheet)
            scalar N = 7
            
            forval j = 1/`=scalar(N)' {
              di `j'
              }
            the above works, where I hard-code N = 7.

            however,
            Code:
            import excel using d_returnretrieve1.xlsm, sheet("Sheet2") firstrow clear
            scalar N = r(N_worksheet)
            
            forval j = 1/`=scalar(N)' {
              di `j'
              }
            this does not work. and my last hope does not work either:

            Code:
            import excel using d_returnretrieve1.xlsm, sheet("Sheet2") firstrow clear
            scalar M = r(N_worksheet)
            scalar N = M
            
            forval j = 1/`=scalar(N)' {
              di `j'
              }
            How could this be?

            Clyde Schechter:
            I understand your point. I totally agree, especially because it is easier to clean data in Stata than in Excel's VBA, I think.
            In this case, the data is completely identical. It comes from one large request in the Datastream's Excel add-in, divided into several smaller requests because the large one took too much time at once to complete, and I did not have such flexible time. Each request was fitted into a new sheet.

            Therefore, I can guarantee that all these sheets have identical variables and variable types. Otherwise, I would follow your path.

            Furthermore and that is the main reason, I only just created the files in the very same process. The original Excel-file is not deleted, and I did no work other than import, save, close.

            Best,
            Max
            Last edited by Max Piper; 26 Aug 2016, 09:14.

            Comment


            • #7
              I don't have your spreadsheet file to check anything here. What I suggest that you display N just before the loop.

              Code:
               
              di scalar(N)

              Comment


              • #8
                I figured it out.

                it is necessary to first run
                Code:
                import excel using d_returnretrieve1.xlsm, describe
                
                scalar N= r(N_worksheet)
                display r(N_worksheet)
                only the describe option allows to use r(N_worksheet), importing directly will not make r(N_worksheet) exist as required.

                Comment


                • #9
                  Added in edit: crossed with post #8. I too missed the clue in help import excel
                  Stored results

                  import excel filename, describe stores the following in r():
                  Here is a reproducible example - substituting sysuse and describe for import excel so that it does not require access to your workbook - of the second example from post #6, but in this case it works rather than fails. Can you try it on your setup and tell us what the results are?

                  Code:
                  sysuse auto, clear
                  quietly describe
                  return list
                  scalar N = r(N)
                  return clear
                  forval j = 70/`=scalar(N)' {
                    di `j'
                    }
                  Code:
                  . sysuse auto, clear
                  (1978 Automobile Data)
                  
                  . quietly describe
                  
                  . return list
                  
                  scalars:
                              r(changed) =  0
                                r(width) =  43
                                    r(k) =  12
                                    r(N) =  74
                  
                  . scalar N = r(N)
                  
                  . return clear
                  
                  . forval j = 70/`=scalar(N)' {
                    2.   di `j'
                    3.   }
                  70
                  71
                  72
                  73
                  74
                  
                  .
                  Last edited by William Lisowski; 26 Aug 2016, 09:27.

                  Comment


                  • #10
                    Thanks for closure. In short, if a scalar has a value of missing, then this loop "does not work" (meaning precisely, there is an invalid syntax message). forvalues can't construct a finite range if one bound is missing.

                    Comment


                    • #11
                      In the first example in Post #1, the code includes display N before the forvalue loop. Presumably this displayed a blank line, but that was unfortunately overlooked. So when Nick provided a solution for what turned out to be the second problem in the code, the unseen initial problem went unnoticed and we focused our attention on the wrong section of the code.

                      Comment


                      • #12
                        If a (numeric) scalar has missing as value, you should see a display of dot (stop, period). But the same point [sic] applies: it's easy to miss if you are not looking out for it.

                        Comment


                        • #13
                          Yes, the first occurence of display N should have given just "."

                          However, while I was trying everything out, I did not notice that clear does not actually delete scalars. Therefore, the code
                          Code:
                          clear
                          
                          *start with importing and saving the first sheet, and save number of worksheets:
                          import excel using d_returnretrieve1.xlsm, sheet("Sheet2") firstrow
                          scalar N = r(N_worksheet)
                          
                          save d_datastream2.dta, replace
                          
                          display N
                          will actually display the scalar that was last defined before this section of code, despite the command clear, which I found a bit surprising and did not help the troubleshooting. This way, manually setting the value of Nto circle the problem in one instance of the troubleshooting process will actually affect the subsequent running of any other code snippets, such that display N in the above code actually may have displayed the integer I wanted to have, prompting me to search the problem somewhere else.

                          help clear tells me that clear is equivalent to drop _all, but not including scalar drop _all. So that's that.

                          Thank you all for the help!

                          Comment


                          • #14
                            Thanks for the followup. Your explanation, however, doesn't demonstrate what you hope. If r(N_worksheet) does not exist, then the scalar N = r(N_worksheet) command would have set N to missing, as the code below shows. Whatever you did that caused N to display, it must have been more complicated than what you showed - perhaps you inadvertently omitted the scalar N = r(N_worksheet) command.
                            Code:
                            . scalar N = 42
                            
                            . clear
                            
                            . display N
                            42
                            
                            . sysuse auto, clear
                            (1978 Automobile Data)
                            
                            . // quietly describe
                            . return list
                            
                            macros:
                                             r(fn) : "/Applications/Stata/ado/base/a/auto.dta"
                            
                            . scalar N = r(N)
                            
                            . display N
                            .
                            
                            .

                            Comment

                            Working...
                            X