Help obtaining number of previous appearances of unique ID within designated timeframe

Markos Valsamis

Join Date: Jan 2022

Posts: 43
#1

Help obtaining number of previous appearances of unique ID within designated timeframe

01 Oct 2022, 04:26

Hello all,

I have a "date" variable and an "ID" variable for a number of observations. I want to generate a new variable ("desired variable"), the value of which represents the number of previous occurrences of the corresponding "ID" within exactly one year prior to the date of that observation. An example of what I want to get is shown in the table below where I have manually calculated the values of the "desired variable" to demonstrate what I wish to output.

ID Date Desired variable

1 01/08/2003 0

1 05/09/2003 1

1 06/03/2005 0

2 13/05/2010 0

2 01/05/2015 0

3 03/09/2015 1

4 01/01/1999 0

4 01/02/1999 1

4 05/08/1999 2

4 10/10/1999 3

I would be very grateful for your help with this.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35664
#2

01 Oct 2022, 04:35

Please use dataex to give example data. As explained at FAQ Advice #12, date variables are ambiguous otherwise.
Comment

Markos Valsamis

Join Date: Jan 2022
Posts: 43

01 Oct 2022, 04:52

Hi Nick,

Apologies, hope this works:

Code:

*Example generates by -dataex-.
clear
input int ID float opdate
1 19876
1 19884
1 19092
1 19359
1 20906
139 25617
139 25649
141 26049
197 19424
197 19493
end
format %d opdate

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35664

01 Oct 2022, 05:05

Thanks for the example. There is small print over what do with leap years. but rangestat from SSC can help. Naturally, replace missings with 0 if needed.

Code:

*Example generates by -dataex-.
clear
input int ID float opdate
1 19876
1 19884
1 19092
1 19359
1 20906
139 25617
139 25649
141 26049
197 19424
197 19493
end
format %d opdate

rangestat (count) wanted=opdate, int(opdate -365 -1) by(ID)

list 

    +--------------------------+
     |  ID      opdate   wanted |
     |--------------------------|
  1. |   1   02jun2014        . |
  2. |   1   10jun2014        1 |
  3. |   1   09apr2012        . |
  4. |   1   01jan2013        1 |
  5. |   1   28mar2017        . |
     |--------------------------|
  6. | 139   19feb2030        . |
  7. | 139   23mar2030        1 |
  8. | 141   27apr2031        . |
  9. | 197   07mar2013        . |
 10. | 197   15may2013        1 |
     +--------------------------+

Comment

Markos Valsamis

Join Date: Jan 2022

Posts: 43
#5

01 Oct 2022, 05:21

Thanks Nick that works extremely well.
Best wishes
Markos
Comment
Markos Valsamis

Join Date: Jan 2022

Posts: 43
#6

26 Dec 2022, 12:06

Nick Cox, apologies for continuing the conversation on an older thread but I am trying to troubleshoot something in my code. Is there any chance that this line of code as demonstrated above

Code:

rangestat (count) wanted=opdate, int(opdate -365 -1) by(ID)

has any random/pseudorandom number generation in the process?

I have a very large piece of code and I have noticed that every time I run the entire section I get a slightly different value at the end (I know this is vague), and so I am just trying to see figure out whether any lines of my code are perhaps generating some pseudorandomness that is not reproduced the same each time I run the code.

Many thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#7

26 Dec 2022, 12:31

Nothing random there.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#8

26 Dec 2022, 13:44

I have a very large piece of code and I have noticed that every time I run the entire section I get a slightly different value at the end (I know this is vague), and so I am just trying to see figure out whether any lines of my code are perhaps generating some pseudorandomness that is not reproduced the same each time I run the code.

In a situation like this, the most common cause is some line of code that produces results that are dependent on the sort order of the data, but the data have been sorted in an indeterminate way. For example, if you have a command like -by v1 (v2), sort: gen last_value_of_x = x[_N]-, that result will evidently depend on the last value of x within each group of observations defined by pair of values of v1 and v2. Now, if v1 and v2, between them, uniquely identify observations in the data, then for a given value of v1, the result will be the one and only value of x associated with the largest value of v2 that occurs with the given value of v1. But if there are multiple observations with the same values of v1 and v2, and if they have different values of x, then the command does not specify the sort order within that group. When Stata is asked to sort the data and the sorting variables do not specify a unique result, Stata randomizes the sort order within the constaints of the variables specified. That is, in this example, the data will be correctly sorted on v1 and v2 within v1, but within a batch of observations having the same values of v1 and v2, the sort order is randomized. This means that when you rerun the code, you can get different results each time.

While there are other possible causes of indeterminate results, this indeterminate sort problem is the most common culprit. The second most common is probably failure to set the random number seed before beginning to use random functions.
Comment
Markos Valsamis

Join Date: Jan 2022

Posts: 43
#9

26 Dec 2022, 14:04

Clyde Schechter I definitely have a lot of those commands, and what you say makes sense. What would be the recommended way of solving this? Choosing another variable v3 (or multiple other variables) that is not really relevant for it to sort too, so that the sorting can be as more "controlled" as possible?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#10

26 Dec 2022, 16:48

I definitely have a lot of those commands, and what you say makes sense. What would be the recommended way of solving this? Choosing another variable v3 (or multiple other variables) that is not really relevant for it to sort too, so that the sorting can be as more "controlled" as possible?

c
Well, in most situations there really should be a highly relevant v3 (or multiple other variables). Because if there isn't, it really means that it isn't that your program is a flawed implementation of your computational transformation, it means that the computational transformation you are implementing is inherently non-deterministic. Putting that in simpler terms, if there is no relevant way to disambiguate the sort order of -by v1 (v2), sort:...=- you are implying that any way of disambiguating the sort order is fine. But if different sort orders produce different results, then it must follow logically that all of those results are equally valid. In other words, the process you are modeling with your code is inherently non-deterministic--there is no unique right result.

More likely, there is (are) a relevant variable(s). It is possible, however, that those variables are not in your data set and you need to augment your data with them in order to resolve the problem. Or they may be in there already and you just haven't recognized their relevance.

As a practical approach to chasing down these errors, every time you have an explicit sort command (or -bysort-, by ..., sort-) I would insert before it a new command -isid same_list_of_variables_as_the_sort-. This will at least identify which of your sorts are indeterminate, and then you can figure out what to do for them.
1 like
Comment

ID	Date	Desired variable
1	01/08/2003	0
1	05/09/2003	1
1	06/03/2005	0
2	13/05/2010	0
2	01/05/2015	0
3	03/09/2015	1
4	01/01/1999	0
4	01/02/1999	1
4	05/08/1999	2
4	10/10/1999	3

Announcement

Help obtaining number of previous appearances of unique ID within designated timeframe

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment