Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thrown for a loop: Invalid Syntax with forval

    I am using Stata 15.1 with Mac OS Monterey 12.6
    My goal is to produce a heat map with one axis "date" (from 1Oct2022 to 31Oct2022), the second axis "hour" (ranging from 0 to 23), and the intensity of each grid square based on the number of events that occurred on that date and hour.
    My dataset has nearly 600,000 observations. Variables include id, startdate (the date in October 2022 on which the event occurred), and starthour (the hour the event started, using a 24 hour clock). Here is a random sample of 25 observations:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id startdate starthour)
     5374 22921 11
    12099 22923 16
    11817 22925 19
     4131 22926 11
     5277 22927 10
     7038 22927 10
    13864 22928 17
     1912 22929  7
     2539 22929  8
     4620 22929  9
    10553 22930 18
     1228 22933  7
    11889 22936 20
     9889 22937 17
    17155 22939 18
    11620 22940 14
    20202 22940 17
    14980 22941 16
    19812 22941 18
     4989 22945 11
     1925 22946  8
     9086 22947 14
    14252 22948 22
     2298 22949  8
     2432 22949  9
    end
    format %td startdate
    In order to produce the heat map, I need to generate a variable named events: the number of events that occurred on that date and at an hour. I was able to visualize the concept using multiple histograms:
    Code:
    histogram starthour, discrete frequency by(startdate)
    The X axis is the start hour (0-23) and the Y axis the frequency of events. As you can see, the distribution of events varies from day to day. I could stop here, but I think a heat map makes for a more interesting visualization.
    Click image for larger version

Name:	histo.png
Views:	2
Size:	170.6 KB
ID:	1691645


    I next generated individual tables that show the number of events that occurred at a specified hour for every day in October.
    Code:
    by startdate, sort : tabstat starthour if starthour==0, statistics( count )
    An excerpt for starthour 0:
    ----------------------------------------------------------------------------------------------------------------
    -> startdate = 01oct2022

    variable | N
    -------------+----------
    starthour | 540
    ------------------------

    ----------------------------------------------------------------------------------------------------------------
    -> startdate = 02oct2022

    variable | N
    -------------+----------
    starthour | 626
    ------------------------

    ----------------------------------------------------------------------------------------------------------------
    -> startdate = 03oct2022

    variable | N
    -------------+----------
    starthour | 167
    ------------------------

    ----------------------------------------------------------------------------------------------------------------
    -> startdate = 04oct2022

    variable | N
    -------------+----------
    starthour | 117
    ------------------------

    ----------------------------------------------------------------------------------------------------------------
    -> startdate = 05oct2022

    variable | N
    -------------+----------
    starthour | 157
    ------------------------


    I could repeat this 24 times up to:
    Code:
    by startdate, sort : tabstat starthour if starthour==23, statistics( count )
    then add these observations to the variable event "by hand". But this will be tedious as there are 24*31=744 observations.

    I then tried a forval loop based on @nickcox: Stata tip 51: Events in intervals
    https://journals.sagepub.com/doi/pdf...867X0700700312

    Code:
    gen events = .
    local N = _N
    
    quietly forvalues i = 1/`N' {
                 count if inrange(starthour, 0, 24) & /// 
                 id == id[`i']                      & ///  
                 inrange(startdate[`i'] - startdate, 1, 31) 
        replace events = r(N) in `i'
    }
    This is the result:
    quietly forvalues i = 1/`N' {
    invalid syntax
    r(198);

    At this point, I am stumped.
    Help/advice will be much appreciated.
    Rich
    Attached Files

  • #2
    Code:
    contract startdate starthour
    gets you a dataset with up to 24 x 31 observations, and there is an option to get zeros too. That is the basis for your heat map. No loops.

    PS The problem with your last block of code is likely to be running code line by line from a do-file editor so that the definition of the local macro is not visible to the line that uses it. You may have access to https://journals.sagepub.com/doi/10....36867X20931028 which explains.

    Comment


    • #3
      Problem solved!!
      I am so grateful. I'm embarrassed to share how many hours I spent trying to solve this. Your solution is simple, elegant, and perfect. I will indeed review your linked Stata tip on local macros.
      Many thanks,
      Rich

      Comment


      • #4
        I have to say I'm pretty confused by your post here. First, the code that is giving you a syntax error runs with no error messages at all on my setup. (Version 17, Windows 10). My best guess is that you are trying to run the code line by line or in chunks. Code that uses local macros usually breaks when you try to do that. Supposing that you run the line -local N = _N- separately from the -forvalues ...- command, what happens is that after the -local N = _N- command finishes, the local macro N which was just defined goes away. So when the -forvalues- command tries to run, it does not know what you mean by `N', and so it interprets it as an empty string. Thus your forvalues command is seen by Stata as -forvalues i = 1/ {-, which is, indeed, a syntax error. When you are working with local macros, all code from the point where the macros are defined to their last use must be run together in one fell swoop. If you interrupt the code, the local macros go out of scope.

        Second, that loop does not do the equivalent of your series of -tabstat- commands in any case. Instead, that loop will calculate a variable that in each observation gives the number of observations of the same id where startdate is between 1 and 31 days before the current startdate. Moreover the code for the -count- command contains a superfluous -inrange(starthour, 0, 24)- condition--because starthour is always between 0 and 24.

        So I'm not sure what you are trying to do. If what you want is a better way to do what the series of -tabstat- commands would do, it is this:
        Code:
        collapse (count) events = id, by(startdate starthour)
        If, on the other hand you are trying to get, for each observations, the number of observations in the data set having the same id and a startdate that is between 1 and 31 days before that observation's startdate, I would do that with:

        Code:
        gen byte one = 1
        rangestat (count) wanted = one, by(id) interval(startdate -1 -31)
        -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

        Added: Crossed with #2 and #3. His -contract- command, in this instance, is equivalent to the -collapse- approach shown here.

        Comment


        • #5
          Thank you Clyde for your detailed response. I will have to review use of macros and -forvalues-. I also see that my original code dose not produce the variable I desired. The -collapse- and -contract- approaches work well.

          I am uploading a preliminary heatmap (using -heatplot- package by Ben Jann)
          Code:
          ssc install heatplot
          Click image for larger version

Name:	heatmap.png
Views:	1
Size:	120.7 KB
ID:	1691670
          Here is the code:
          Code:
          heatplot _freq i.starthour startdate, yscale(noline) ylabel(, nogrid labsize(*0.7)) ///
           xlabel(`x1' `x2', labsize(*0.6) angle(horizontal) format(%tdDD-mon-yy) nogrid) ///
           color(plasma, reverse)
          Many thanks again for the insights.
          ​​​​​​​Rich

          Comment


          • #6
            It would be interesting to compare that heat plot with a family of curves plotted against time of day -- possibly as model fits with some sine and cosine terms to smooth out some noise. If that is cryptic, https://www.stata-journal.com/articl...article=st0116 says much more.

            Comment


            • #7
              Still trying to figure out the trigonometric regression. But the paper Nick referenced led me to this paper: Speaking Stata: Graphs for all seasons , and an introduction to -cycleplot- and -sliceplot-.
              Code:
              ssc install cycleplot
              ssc install sliceplot
              With the contracted dataset I used to generate the heat plot, I created a cycleplot:
              Code:
              cycleplot _freq starthour startdate, length(24) ylabel( , angle(h))
              Click image for larger version

Name:	cycleplot.png
Views:	1
Size:	380.4 KB
ID:	1692289


              And a sliceplot:
              Code:
              sliceplot line _freq starthour, slices(3) ylabel( , angle(h))
              Click image for larger version

Name:	sliceplot.png
Views:	1
Size:	563.1 KB
ID:	1692290



              I'll keep working on the sine/cosine smoothing. Any hints would be appreciated.
              Thanks
              Rich

              Comment


              • #8
                I suspect you need c(L) in the sliceplot.

                Some years on, I found a way to avoid sliceplot altogether, at least in some cases. See Section 3 in https://journals.sagepub.com/doi/epu...36867X20976341

                Comment

                Working...
                X