Hi Statalists,
I have a question regarding identifying the longest consecutive sequence in the data without using tsspell. The reason for not using tsspell is because of ran_key variable, which is randomly assigned to each observation based of several characteristic in the database (it can be the same for some observations and different for others). I attempted to use xtset or tsset, but I got an error saying that time_id is not unique for each observation.
The objective of this exercise is to determine the following values (when a person enrolls in the program or status = 1)
1. The time_id of the start of run.
2. The duration of each run.
3. The longest steak of runs for each person.
For instance, Person 1 is an ideal case, starting from time_id = 1 to 36. This individual would have the time_id of the first run as 1, the duration of the first run as 36, and the longest streak as 36.
Person 2 has two consecutive runs starting from time_id = 1 to 8, and time_id = 10 to 12. The duration of the first run is 8, and the second run is 3 (the longest is 8).
Person 4 has two consecutive runs as well. The first run goes from time_id = 1 to 6, and the second goes from time_id 8 to 12. The duration for these runs are 6 and 4 (the longest is 6).
Person 5 has three consecutive runs, starting from time_id = 1 to 2 (duration = 2), time_id = 17 to 23 (duration = 6), and time_id = 31 to 36 (duration = 6). The longest run is 6.
The data (using dataex) and my latest approach is below. What I tried so far (without using tsspell) is to define the gap by calculating the current - the lag. I can only define the gap between each run, and I could not calculate the longest streak for each run. Even worse, I could not capture if the gap if it is more than one (like Person 5). I would appreciate any help or idea to solve this problem. Please note that the original data of this cut version is huge and it would be great to solve this without changing its form to wide format.
Thank you!
Kob
I have a question regarding identifying the longest consecutive sequence in the data without using tsspell. The reason for not using tsspell is because of ran_key variable, which is randomly assigned to each observation based of several characteristic in the database (it can be the same for some observations and different for others). I attempted to use xtset or tsset, but I got an error saying that time_id is not unique for each observation.
The objective of this exercise is to determine the following values (when a person enrolls in the program or status = 1)
1. The time_id of the start of run.
2. The duration of each run.
3. The longest steak of runs for each person.
For instance, Person 1 is an ideal case, starting from time_id = 1 to 36. This individual would have the time_id of the first run as 1, the duration of the first run as 36, and the longest streak as 36.
Person 2 has two consecutive runs starting from time_id = 1 to 8, and time_id = 10 to 12. The duration of the first run is 8, and the second run is 3 (the longest is 8).
Person 4 has two consecutive runs as well. The first run goes from time_id = 1 to 6, and the second goes from time_id 8 to 12. The duration for these runs are 6 and 4 (the longest is 6).
Person 5 has three consecutive runs, starting from time_id = 1 to 2 (duration = 2), time_id = 17 to 23 (duration = 6), and time_id = 31 to 36 (duration = 6). The longest run is 6.
The data (using dataex) and my latest approach is below. What I tried so far (without using tsspell) is to define the gap by calculating the current - the lag. I can only define the gap between each run, and I could not calculate the longest streak for each run. Even worse, I could not capture if the gap if it is more than one (like Person 5). I would appreciate any help or idea to solve this problem. Please note that the original data of this cut version is huge and it would be great to solve this without changing its form to wide format.
Thank you!
Kob
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long person_id str4(ran_key) time_id status 1 96D7 1 1 1 8755 1 1 1 6ADF 1 1 1 96D7 2 1 1 8755 3 1 1 96D7 3 1 1 96D7 4 1 1 6ADF 4 1 1 8755 4 1 1 8755 5 1 1 6ADF 5 1 1 96D7 6 1 1 6ADF 7 1 1 8755 8 1 1 96D7 8 1 1 6ADF 8 1 1 6ADF 9 1 1 8755 10 1 1 96D7 11 1 1 8755 12 1 1 6ADF 12 1 1 96D7 13 1 1 96D7 14 1 1 8755 14 1 1 6ADF 14 1 1 8755 15 1 1 96D7 15 1 1 8755 16 1 1 96D7 17 1 1 6ADF 17 1 1 6ADF 18 1 1 8755 18 1 1 6ADF 19 1 1 8755 19 1 1 6ADF 20 1 1 8755 21 1 1 6ADF 21 1 1 96D7 22 1 1 6ADF 22 1 1 8755 22 1 1 96D7 23 1 1 6ADF 23 1 1 8755 23 1 1 6ADF 24 1 1 8755 25 1 1 96D7 26 1 1 8755 26 1 1 8755 27 1 1 96D7 28 1 1 6ADF 29 1 1 96D7 29 1 1 96D7 30 1 1 8755 30 1 1 96D7 31 1 1 6ADF 32 1 1 96D7 32 1 1 8755 33 1 1 6ADF 33 1 1 6ADF 34 1 1 96D7 34 1 1 8755 35 1 1 96D7 36 1 2 6C47 1 1 2 6C47 2 1 2 6C47 3 1 2 6C47 4 1 2 6C47 5 1 2 6C47 6 1 2 6C47 7 1 2 6C47 8 1 2 8200 9 0 2 6C47 10 1 2 8200 10 0 2 8200 11 0 2 6C47 11 1 2 6C47 12 1 2 8200 12 0 4 8700 1 1 4 3B70 1 1 4 8700 2 1 4 8700 3 1 4 8700 4 1 4 8700 5 1 4 8700 6 1 4 8700 8 1 4 8700 9 1 4 8700 10 1 4 8700 11 1 4 8700 12 1 4 10F7 15 0 4 10F7 16 0 4 10F7 17 0 4 10F7 18 0 4 10F7 19 0 4 10F7 20 0 4 10F7 21 0 4 10F7 22 0 4 10F7 23 0 4 10F7 24 0 4 10F7 25 0 4 10F7 26 0 4 10F7 27 0 4 3351 28 0 4 3351 29 0 4 3351 30 0 4 3351 31 0 4 3351 32 0 4 3351 33 0 4 3351 34 0 4 3351 35 0 4 3351 36 0 5 B315 1 1 5 1210 1 1 5 B315 2 1 5 1210 2 1 5 1210 17 1 5 B315 18 1 5 1210 18 1 5 1210 19 1 5 B315 19 1 5 1210 20 1 5 B315 20 1 5 B315 21 1 5 1210 21 1 5 1210 22 1 5 B315 22 1 5 1210 23 1 5 B315 23 1 5 1210 31 1 5 B315 31 1 5 1210 32 1 5 B315 32 1 5 1210 33 1 5 B315 33 1 5 B315 34 1 5 1210 34 1 5 B315 35 1 5 1210 35 1 5 1210 36 1 5 B315 36 1 end gen run1 = . bysort person_id status (time_id): replace run1 = cond(_n == 1, 1, time_id - time_id[_n-1]) if status != 0 gen gap = . replace gap = 1 if run != . & run1 > 1 bysort person_id (time_id): carryforward gap, gen(run2)
Comment