Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Individual time span reference per observation

    Dear community,

    I have a dataset with about 3 million observations. Each observation has a time id, at which it was observed, and a position.

    The variables have these formats:
    time_id = {1, 2, 3, ... , 286, 287, 288}. Each id covers a time slot of 5 min, so 12 ids are 60 min.
    position = {1, 2, 3, ... , 19, 20, 21}

    I want to create a variable that gives me the highest position value of all observations in the past hour for each observation.

    So basically, the command I wanna use is: "for observations betwenn time_id and time_id - 12": egen lowestposition = max(position).

    For example if the observation has the time_id 250, I would like to get the highest position value of all observation with time_id 238, 239, 240, ... 249, 250. Then for an observation with time_id 251, I would like to get the highest position value of all observation with time_id 239, 240, 241, ... 250, 251.

    Can you help me with finding a command for that?


    Thank you so much!
    Carina

  • #2
    Per the FAQ, please show us a sample of your data using -dataex-. Describing a data set clearly without an example is quite difficult.

    Comment


    • #3
      Following on Mike's post #2, does your data include some sort of ID, so that what you want is not

      the highest position value of all observations in the past hour for each observation
      but instead

      the highest position value of all observations in the past hour for each observation with the same ID
      As Mike suggested, you should take a few minutes to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

      Comment


      • #4
        Sure, here is a data example:

        nput float idarticle byte position long time_landing_only_numeric str8 time_landing_only
        265 9 123 "10:10:00"
        265 16 138 "11:25:00"
        265 16 145 "12:00:00"
        265 16 133 "11:00:00"
        265 20 165 "13:40:00"
        265 16 144 "11:55:00"
        265 10 122 "10:05:00"
        265 18 153 "12:40:00"
        265 9 118 "09:45:00"
        265 13 131 "10:50:00"
        265 10 112 "09:15:00"
        265 11 128 "10:35:00"
        265 17 148 "12:15:00"
        265 10 116 "09:35:00"
        265 9 121 "10:00:00"
        265 16 158 "13:05:00"
        265 18 155 "12:50:00"
        265 11 125 "10:20:00"
        265 10 114 "09:25:00"
        265 17 151 "12:30:00"
        265 18 160 "13:15:00"
        265 10 113 "09:20:00"
        265 16 143 "11:50:00"
        265 18 154 "12:45:00"
        265 10 111 "09:10:00"
        265 9 119 "09:50:00"
        265 10 115 "09:30:00"
        265 18 156 "12:55:00"
        265 17 150 "12:25:00"
        265 16 139 "11:30:00"
        265 9 117 "09:40:00"
        265 19 162 "13:25:00"
        265 11 126 "10:25:00"
        265 17 147 "12:10:00"
        265 9 120 "09:55:00"
        265 17 149 "12:20:00"
        265 8 110 "09:05:00"
        265 16 140 "11:35:00"
        265 16 157 "13:00:00"
        265 11 129 "10:40:00"
        265 16 134 "11:05:00"
        265 19 161 "13:20:00"
        265 8 109 "09:00:00"
        265 13 130 "10:45:00"
        265 11 127 "10:30:00"
        265 13 132 "10:55:00"
        265 18 159 "13:10:00"
        265 16 137 "11:20:00"
        265 16 136 "11:15:00"
        265 18 152 "12:35:00"
        265 16 135 "11:10:00"
        265 16 142 "11:45:00"
        265 9 124 "10:15:00"
        265 16 146 "12:05:00"
        265 16 141 "11:40:00"

        The first number is the article ID, so yes, to be more specific, I am looking for the highest value with the same article ID, but I would have simply but "bysort idarticle:" at the beginning of the command. By article ID, I mean that each ID is a news article from a news websites that was observed at different times, in different positions (the more relevant the topic, the higher up the position). This example only contains observations that belong to the same ID.

        The second number is the position. The 3rd number is the time id and the 4th number simply the time correspomding to the time id. So, the 4th number is not really of interest, it's just to understand the time id.

        I use the Stata version 15.1. It's my first time posting in this forum, so I hope I've given all the information required by this forum now. If not, I'm happy to share more information

        Comment


        • #5
          Thank you for the more complete example.

          I would have simply but "bysort idarticle
          Not every command in Stata can be preceded with by: so it is best not to assume that you will receive advice that will allow you to do something you haven't asked to be able to do, and it can lead the person trying to help you to waste time on a solution that does not do what you need.

          Comment


          • #6
            Say that you want to find the min, max, and mean of position for idarticle 265 using observations that fall within a time window that starts at time 161-12 and ends at time 161 (note that there are 13 time periods within such a window). You could get there with:
            Code:
            sum position if inrange(time_landing_only_numeric, 161-12, 161) & idarticle == 265
            You can use rangestat (from SSC) to repeat this calculation for all observations. You can install it by typing the following into Stata's Command window:
            Code:
            ssc install rangestat
            Your observations are not ordered by time (within groups of idarticle) so it's a bit hard to visualize your data. I took the liberty of changing your data example to create time gaps, jumbled the order, and added a single observation with a different idarticle.

            Here's an example of how to calculate the individual case above and an overall solution using rangestat:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float idarticle byte position long time_landing_only_numeric str8 time_landing_only
            265 10 116 "09:35:00"
            265 19 162 "13:25:00"
            265 16 158 "13:05:00"
            265 10 122 "10:05:00"
            265 17 147 "12:10:00"
            265 10 111 "09:10:00"
            265 13 130 "10:45:00"
            265  8 110 "09:05:00"
            265  9 124 "10:15:00"
            265 18 159 "13:10:00"
            265  8 109 "09:00:00"
            265  9 121 "10:00:00"
            265 16 144 "11:55:00"
            265 16 146 "12:05:00"
            265 16 157 "13:00:00"
            265 11 125 "10:20:00"
            265 16 134 "11:05:00"
            265 18 152 "12:35:00"
            265 17 151 "12:30:00"
            999 33 161 "13:20:00"
            265 19 161 "13:20:00"
            265 20 165 "13:40:00"
            265 17 150 "12:25:00"
            265 16 145 "12:00:00"
            265  9 117 "09:40:00"
            265 11 126 "10:25:00"
            265 16 133 "11:00:00"
            265 18 156 "12:55:00"
            265 18 160 "13:15:00"
            end
            
            sum position if inrange(time_landing_only_numeric, 161-12, 161) & idarticle == 265
            
            rangestat (max) pmax=position (min) pmin=position (mean) pmean=position, ///
                interval(time_landing_only_numeric -12 0) by(idarticle)
            
            sort idarticle time_landing_only_numeric
            list
            and the results (with the individual case highlighted in bold in the observation list):
            Code:
            . sum position if inrange(time_landing_only_numeric, 161-12, 161) & idarticle == 265
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                position |          9    17.44444    1.013794         16         19
            
            . 
            . rangestat (max) pmax=position (min) pmin=position (mean) pmean=position, ///
            >         interval(time_landing_only_numeric -12 0) by(idarticle)
            
            . 
            . sort idarticle time_landing_only_numeric
            
            . list
            
                 +---------------------------------------------------------------------+
                 | idarti~e   position   time_l~c   time_l~y   pmax   pmin       pmean |
                 |---------------------------------------------------------------------|
              1. |      265          8        109   09:00:00      8      8           8 |
              2. |      265          8        110   09:05:00      8      8           8 |
              3. |      265         10        111   09:10:00     10      8   8.6666667 |
              4. |      265         10        116   09:35:00     10      8           9 |
              5. |      265          9        117   09:40:00     10      8           9 |
                 |---------------------------------------------------------------------|
              6. |      265          9        121   10:00:00     10      8           9 |
              7. |      265         10        122   10:05:00     10      8   9.3333333 |
              8. |      265          9        124   10:15:00     10      9         9.4 |
              9. |      265         11        125   10:20:00     11      9   9.6666667 |
             10. |      265         11        126   10:25:00     11      9   9.8571429 |
                 |---------------------------------------------------------------------|
             11. |      265         13        130   10:45:00     13      9        10.5 |
             12. |      265         16        133   11:00:00     16      9   11.285714 |
             13. |      265         16        134   11:05:00     16      9   12.285714 |
             14. |      265         16        144   11:55:00     16     16          16 |
             15. |      265         16        145   12:00:00     16     16          16 |
                 |---------------------------------------------------------------------|
             16. |      265         16        146   12:05:00     16     16          16 |
             17. |      265         17        147   12:10:00     17     16       16.25 |
             18. |      265         17        150   12:25:00     17     16        16.4 |
             19. |      265         17        151   12:30:00     17     16        16.5 |
             20. |      265         18        152   12:35:00     18     16   16.714286 |
                 |---------------------------------------------------------------------|
             21. |      265         18        156   12:55:00     18     16      16.875 |
             22. |      265         16        157   13:00:00     18     16      16.875 |
             23. |      265         16        158   13:05:00     18     16      16.875 |
             24. |      265         18        159   13:10:00     18     16      17.125 |
             25. |      265         18        160   13:15:00     18     16       17.25 |
                 |---------------------------------------------------------------------|
             26. |      265         19        161   13:20:00     19     16   17.444444 |
             27. |      265         19        162   13:25:00     19     16        17.6 |
             28. |      265         20        165   13:40:00     20     16          18 |
             29. |      999         33        161   13:20:00     33     33          33 |
                 +---------------------------------------------------------------------+
            
            .

            Comment


            • #7
              It appears that your variable time_landing_only_numeric (I assume this corresponds to the time_id you described in post #1) represents the number of 5-minute increments between midnight (0:00) and time_landing_only, which is a string variable that appears to give hours and minutes.

              Sorting your example data, the article appears first at 09:00 (109) and last at 13:14 (165).

              I'm concerned about what happens when an article crosses a date boundary. Suppose an article appeared first at 23:30 (283). You tell us that time_id runs from 1 to 288. If this article is observed in the next day, what does its data look like? Does the time_landing_only_numeric loop back to zero? How do you know you're on a new day? The hour before time_landing_only_numeric 6 starts at time_landing_only_numeric 283 of the previous day.

              How would you address these issues?

              An alternative approach: If I were creating this dataset, rather than create this artificial counter of five-minute-increment-since-midnight, I would use Stata a Stata Internal Format datetime value to store the full date and time of observation, and then look at times between the time of the observation and that time minus one hour.

              If the previous paragraph isn't clear, then perhaps you are new to working with dates and times in Stata. Stata's date and time variables are complicated and there is a lot to learn. If you have not already read the very detailed Chapter 24 (Working with dates and times) of the Stata User's Guide PDF, do so now. If you have, it's time for a refresher. After that, the help datetime documentation will usually be enough to point the way. You can't remember everything; even the most experienced users end up referring to the help datetime documentation or back to the manual for details. But at least you will get a good understanding of the basics and the underlying principles. An investment of time that will be amply repaid.

              All Stata manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

              Added in edit: This crossed with Robert Picard's post. The rangestat command is the tool I would recommend, certainly for using actual time values. But before using rangestat with your data as it now is, and as Robert demonstrated, you need to address the issues I raise at the top of this post to be sure the results will be correct for all your potential data.
              Last edited by William Lisowski; 27 Jan 2019, 11:41.

              Comment

              Working...
              X