Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help obtaining number of previous appearances of unique ID within designated timeframe

    Hello all,


    I have a "date" variable and an "ID" variable for a number of observations. I want to generate a new variable ("desired variable"), the value of which represents the number of previous occurrences of the corresponding "ID" within exactly one year prior to the date of that observation. An example of what I want to get is shown in the table below where I have manually calculated the values of the "desired variable" to demonstrate what I wish to output.
    ID Date Desired variable
    1 01/08/2003 0
    1 05/09/2003 1
    1 06/03/2005 0
    2 13/05/2010 0
    2 01/05/2015 0
    3 03/09/2015 1
    4 01/01/1999 0
    4 01/02/1999 1
    4 05/08/1999 2
    4 10/10/1999 3
    I would be very grateful for your help with this.

  • #2
    Please use dataex to give example data. As explained at FAQ Advice #12, date variables are ambiguous otherwise.

    Comment


    • #3
      Hi Nick,

      Apologies, hope this works:

      Code:
      *Example generates by -dataex-.
      clear
      input int ID float opdate
      1 19876
      1 19884
      1 19092
      1 19359
      1 20906
      139 25617
      139 25649
      141 26049
      197 19424
      197 19493
      end
      format %d opdate

      Comment


      • #4
        Thanks for the example. There is small print over what do with leap years. but rangestat from SSC can help. Naturally, replace missings with 0 if needed.

        Code:
        *Example generates by -dataex-.
        clear
        input int ID float opdate
        1 19876
        1 19884
        1 19092
        1 19359
        1 20906
        139 25617
        139 25649
        141 26049
        197 19424
        197 19493
        end
        format %d opdate
        
        rangestat (count) wanted=opdate, int(opdate -365 -1) by(ID)
        
        list 
        
            +--------------------------+
             |  ID      opdate   wanted |
             |--------------------------|
          1. |   1   02jun2014        . |
          2. |   1   10jun2014        1 |
          3. |   1   09apr2012        . |
          4. |   1   01jan2013        1 |
          5. |   1   28mar2017        . |
             |--------------------------|
          6. | 139   19feb2030        . |
          7. | 139   23mar2030        1 |
          8. | 141   27apr2031        . |
          9. | 197   07mar2013        . |
         10. | 197   15may2013        1 |
             +--------------------------+

        Comment


        • #5
          Thanks Nick that works extremely well.
          Best wishes
          Markos

          Comment


          • #6
            Nick Cox, apologies for continuing the conversation on an older thread but I am trying to troubleshoot something in my code. Is there any chance that this line of code as demonstrated above
            Code:
             
             rangestat (count) wanted=opdate, int(opdate -365 -1) by(ID)
            has any random/pseudorandom number generation in the process?

            I have a very large piece of code and I have noticed that every time I run the entire section I get a slightly different value at the end (I know this is vague), and so I am just trying to see figure out whether any lines of my code are perhaps generating some pseudorandomness that is not reproduced the same each time I run the code.

            Many thanks

            Comment


            • #7
              Nothing random there.

              Comment


              • #8
                I have a very large piece of code and I have noticed that every time I run the entire section I get a slightly different value at the end (I know this is vague), and so I am just trying to see figure out whether any lines of my code are perhaps generating some pseudorandomness that is not reproduced the same each time I run the code.
                In a situation like this, the most common cause is some line of code that produces results that are dependent on the sort order of the data, but the data have been sorted in an indeterminate way. For example, if you have a command like -by v1 (v2), sort: gen last_value_of_x = x[_N]-, that result will evidently depend on the last value of x within each group of observations defined by pair of values of v1 and v2. Now, if v1 and v2, between them, uniquely identify observations in the data, then for a given value of v1, the result will be the one and only value of x associated with the largest value of v2 that occurs with the given value of v1. But if there are multiple observations with the same values of v1 and v2, and if they have different values of x, then the command does not specify the sort order within that group. When Stata is asked to sort the data and the sorting variables do not specify a unique result, Stata randomizes the sort order within the constaints of the variables specified. That is, in this example, the data will be correctly sorted on v1 and v2 within v1, but within a batch of observations having the same values of v1 and v2, the sort order is randomized. This means that when you rerun the code, you can get different results each time.

                While there are other possible causes of indeterminate results, this indeterminate sort problem is the most common culprit. The second most common is probably failure to set the random number seed before beginning to use random functions.

                Comment


                • #9
                  Clyde Schechter I definitely have a lot of those commands, and what you say makes sense. What would be the recommended way of solving this? Choosing another variable v3 (or multiple other variables) that is not really relevant for it to sort too, so that the sorting can be as more "controlled" as possible?

                  Comment


                  • #10
                    I definitely have a lot of those commands, and what you say makes sense. What would be the recommended way of solving this? Choosing another variable v3 (or multiple other variables) that is not really relevant for it to sort too, so that the sorting can be as more "controlled" as possible?
                    c
                    Well, in most situations there really should be a highly relevant v3 (or multiple other variables). Because if there isn't, it really means that it isn't that your program is a flawed implementation of your computational transformation, it means that the computational transformation you are implementing is inherently non-deterministic. Putting that in simpler terms, if there is no relevant way to disambiguate the sort order of -by v1 (v2), sort:...=- you are implying that any way of disambiguating the sort order is fine. But if different sort orders produce different results, then it must follow logically that all of those results are equally valid. In other words, the process you are modeling with your code is inherently non-deterministic--there is no unique right result.

                    More likely, there is (are) a relevant variable(s). It is possible, however, that those variables are not in your data set and you need to augment your data with them in order to resolve the problem. Or they may be in there already and you just haven't recognized their relevance.

                    As a practical approach to chasing down these errors, every time you have an explicit sort command (or -bysort-, by ..., sort-) I would insert before it a new command -isid same_list_of_variables_as_the_sort-. This will at least identify which of your sorts are indeterminate, and then you can figure out what to do for them.

                    Comment

                    Working...
                    X