Find longest consecutive run without using tsspell

Papungkorn Kitcharoenkarnkul

Join Date: Jul 2022
Posts: 19

Find longest consecutive run without using tsspell

09 Dec 2023, 17:16

Hi Statalists,

I have a question regarding identifying the longest consecutive sequence in the data without using tsspell. The reason for not using tsspell is because of ran_key variable, which is randomly assigned to each observation based of several characteristic in the database (it can be the same for some observations and different for others). I attempted to use xtset or tsset, but I got an error saying that time_id is not unique for each observation.

The objective of this exercise is to determine the following values (when a person enrolls in the program or status = 1)
1. The time_id of the start of run.
2. The duration of each run.
3. The longest steak of runs for each person.

For instance, Person 1 is an ideal case, starting from time_id = 1 to 36. This individual would have the time_id of the first run as 1, the duration of the first run as 36, and the longest streak as 36.
Person 2 has two consecutive runs starting from time_id = 1 to 8, and time_id = 10 to 12. The duration of the first run is 8, and the second run is 3 (the longest is 8).
Person 4 has two consecutive runs as well. The first run goes from time_id = 1 to 6, and the second goes from time_id 8 to 12. The duration for these runs are 6 and 4 (the longest is 6).
Person 5 has three consecutive runs, starting from time_id = 1 to 2 (duration = 2), time_id = 17 to 23 (duration = 6), and time_id = 31 to 36 (duration = 6). The longest run is 6.

The data (using dataex) and my latest approach is below. What I tried so far (without using tsspell) is to define the gap by calculating the current - the lag. I can only define the gap between each run, and I could not calculate the longest streak for each run. Even worse, I could not capture if the gap if it is more than one (like Person 5). I would appreciate any help or idea to solve this problem. Please note that the original data of this cut version is huge and it would be great to solve this without changing its form to wide format.

Thank you!
Kob

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long person_id str4(ran_key) time_id status
1    96D7    1    1
1    8755    1    1
1    6ADF    1    1
1    96D7    2    1
1    8755    3    1
1    96D7    3    1
1    96D7    4    1
1    6ADF    4    1
1    8755    4    1
1    8755    5    1
1    6ADF    5    1
1    96D7    6    1
1    6ADF    7    1
1    8755    8    1
1    96D7    8    1
1    6ADF    8    1
1    6ADF    9    1
1    8755    10    1
1    96D7    11    1
1    8755    12    1
1    6ADF    12    1
1    96D7    13    1
1    96D7    14    1
1    8755    14    1
1    6ADF    14    1
1    8755    15    1
1    96D7    15    1
1    8755    16    1
1    96D7    17    1
1    6ADF    17    1
1    6ADF    18    1
1    8755    18    1
1    6ADF    19    1
1    8755    19    1
1    6ADF    20    1
1    8755    21    1
1    6ADF    21    1
1    96D7    22    1
1    6ADF    22    1
1    8755    22    1
1    96D7    23    1
1    6ADF    23    1
1    8755    23    1
1    6ADF    24    1
1    8755    25    1
1    96D7    26    1
1    8755    26    1
1    8755    27    1
1    96D7    28    1
1    6ADF    29    1
1    96D7    29    1
1    96D7    30    1
1    8755    30    1
1    96D7    31    1
1    6ADF    32    1
1    96D7    32    1
1    8755    33    1
1    6ADF    33    1
1    6ADF    34    1
1    96D7    34    1
1    8755    35    1
1    96D7    36    1
2    6C47    1    1
2    6C47    2    1
2    6C47    3    1
2    6C47    4    1
2    6C47    5    1
2    6C47    6    1
2    6C47    7    1
2    6C47    8    1
2    8200    9    0
2    6C47    10    1
2    8200    10    0
2    8200    11    0
2    6C47    11    1
2    6C47    12    1
2    8200    12    0
4    8700    1    1
4    3B70    1    1
4    8700    2    1
4    8700    3    1
4    8700    4    1
4    8700    5    1
4    8700    6    1
4    8700    8    1
4    8700    9    1
4    8700    10    1
4    8700    11    1
4    8700    12    1
4    10F7    15    0
4    10F7    16    0
4    10F7    17    0
4    10F7    18    0
4    10F7    19    0
4    10F7    20    0
4    10F7    21    0
4    10F7    22    0
4    10F7    23    0
4    10F7    24    0
4    10F7    25    0
4    10F7    26    0
4    10F7    27    0
4    3351    28    0
4    3351    29    0
4    3351    30    0
4    3351    31    0
4    3351    32    0
4    3351    33    0
4    3351    34    0
4    3351    35    0
4    3351    36    0
5    B315    1    1
5    1210    1    1
5    B315    2    1
5    1210    2    1
5    1210    17    1
5    B315    18    1
5    1210    18    1
5    1210    19    1
5    B315    19    1
5    1210    20    1
5    B315    20    1
5    B315    21    1
5    1210    21    1
5    1210    22    1
5    B315    22    1
5    1210    23    1
5    B315    23    1
5    1210    31    1
5    B315    31    1
5    1210    32    1
5    B315    32    1
5    1210    33    1
5    B315    33    1
5    B315    34    1
5    1210    34    1
5    B315    35    1
5    1210    35    1
5    1210    36    1
5    B315    36    1
end

gen run1 = .
bysort person_id status (time_id): replace run1 = cond(_n == 1, 1, time_id - time_id[_n-1]) if status != 0

gen gap = .
replace gap = 1 if run != . & run1 > 1
bysort person_id (time_id): carryforward gap, gen(run2)

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

09 Dec 2023, 19:21

I don't know what you mean by "longest streak of runs," but the following code will get you the first two things you want:

Code:

gen `c(obs_t)' obs_no = _n    // MARK THE CURRENT SORT ORDER SINCE ran_key IS RANDOM

by person_id (obs_no), sort: gen run = sum(status != status[_n-1] | (time_id > time_id[_n-1] + 1))

by person_id (run obs_no), sort: egen start_first_run = min(cond(status, time_id, .))

by person_id run (obs_no): gen duration = time_id[_N] - time_id[1] + 1
by person_id run (obs_no): replace duration = . if !status[1] //ONLY COUNT RUNS WITH status == 1

Added:
On a guess that you meant the duration of the person's longest run for "longest streak of runs" you can get that by following the above code with:

Code:

by person_id (run obs_no): egen longest_duration = max(duration)

Last edited by Clyde Schechter; 09 Dec 2023, 19:25.

Comment

Papungkorn Kitcharoenkarnkul

Join Date: Jul 2022

Posts: 19
#3

09 Dec 2023, 21:14

Hi Dr.Clyde Schechter,

Thank you so much for your suggestion and help. Your guess is correct. By "longest streak", I refer to the maximum duration among all the runs.

I have a couple of follow-up questions:
1.Regarding the use of `c(obs_t)' after gen, I reread the documentation for gen, but I could not find the relevant information about this. Could you please elaborate on the purpose of this specific code?
2. I followed your code and try to determine the number of times a person runs. The current code does not give me this information accurately since the total runs are mixed between status = 0 and status = 1. For instance, the count of runs for person 2 should be 2 instead of 5. Do you have any further suggestions on how I can modify the code to obtain this count correctly?

I appreciate any suggestions or thoughts you may have.

Best regards,
Kob
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

09 Dec 2023, 22:09

You won't find `c(obs_t)' in the documentation for -gen-. You'll find it in the documentation for -creturn-. But here's what it is. If we wrote -gen obs_no = _n-, by default -gen- creates all new variables as float storage type. But this could be a serious problem: a float storage type can only hold about 7 decimal digits worth of information. If your data set contains on the order of 10 million observations or more, then you will no longer get distinct values for obs_no in every observation--there are just not enough bits to do that. So for large data sets you need to use either a long or a double. You might "play it safe" by always using double, as that has the greatest number of digits available. But then in smaller data sets you are wasting a lot of memory to do that. This is where `c(obs_t)' comes in. `c(obs_t)' is set by Stata and it will be the smallest data storage type that is large enough to hold distinct numbers for every observation in your data set. When your data set grows or shrinks as you work with it, Stata updates `c(obs_t)' accordingly. So -gen `c(obs_t)' obs_no = _n- is a safe, and also efficient, way to create an observation-number variable. It's always just large enough for the job, but no larger.

As for a count of the number of runs, had you asked for it in #1, I would have supplied it in #2. You can get it with the following code, run after the code shown in #2:

Code:

by person_id run (obs_no): gen run_count = _n == 1 & status == 1 by person_id (run obs_no): replace run_count = sum(run_count) by person_id (run obs_no): replace run_count = run_count[_N]
1 like
Comment
Papungkorn Kitcharoenkarnkul

Join Date: Jul 2022

Posts: 19
#5

10 Dec 2023, 01:23

Dr.Clyde Good,

Thank you so much for taking time to clarify these details for me. This information is new and invaluable to my work. Despite having used Stata for some time, I was not aware of this potential problem with big data, and your insight is useful in this regards.

Regarding the follow-up question, I am sorry for any confusion that I made in #1. I realize that I did not articulate my requirement clearly. It was only upon running your code that I recognized the need for this specific information. Although I attempted to modify the code to get this information, my effort was unsuccessful. Your solution help me understand the logic behind using _n and _N to achieve the desired results. Once again, I really appreciate your help and advice for my question.

Actually, I have an additional question and would appreciate your general advice on the matter. When tackling this issue, do you recommend approaching it as you demonstrated, or would you suggest modifying the data to make it compatible with the tsspell package? Personally, I tend to adjust the code to better suit my database, as I am concerned about potential errors that may arise if I were to collapse the data for compatibility with the tsspell package. I recognize the it could have been easier to collapse everything to the individual level (by disregarding the ran_key column). However, there is a risk of accidentally dropping some observations without proper acknowledgement. I would appreciate your general thoughts on this.

Sincerely,
Kob

Last edited by Papungkorn Kitcharoenkarnkul; 10 Dec 2023, 01:34.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

10 Dec 2023, 09:22

When tackling this issue, do you recommend approaching it as you demonstrated, or would you suggest modifying the data to make it compatible with the tsspell package? Personally, I tend to adjust the code to better suit my database, as I am concerned about potential errors that may arise if I were to collapse the data for compatibility with the tsspell package. I recognize the it could have been easier to collapse everything to the individual level (by disregarding the ran_key column). However, there is a risk of accidentally dropping some observations without proper acknowledgement. I would appreciate your general thoughts on this.

It is true that each time you modify the data set, you run the risk of doing it incorrectly and introducing errors. But that has to be traded off against the risk of making a coding error in doing the various run calculations yourself vs relying on -tsspell-. Myself, I do not use -tsspell-. There's nothing wrong with it, but I was already quite comfortable with and proficient at doing these spell calculations myself when Nick Cox introduced -tsspell-, or at least when I first became aware of -tsspell-'s existence. So I would keep the data as they are and use the native Stata commands.

But your situation may be different. You chose to post the problem here, so I infer you were not confident in your ability to solve it yourself. -tsspell- has no magic; it can only solve spell problems that can also be solved with native Stata commands. But it abstracts the spell problem in a convenient way so that you just have to specify some key parameters of the spell problem, and it "figures out" the rest for you. It greatly reduces the risk of getting the spell calculations wrong. And modifying a data set by using -collapse- is pretty simple and the risk of doing it wrong is, in general, quite low. So the tradeoff would appear to lean in the other direction for you.

All of that said, it isn't entirely clear to me how you could use -collapse- with this data to get something that -tsspell- could handle correctly. The problem is that you not only have repeated observations on person_id and time_id (which by itself can easily be handled), but some of those repeat observations differ on the value of status, which is a key variable in defining the start and end of a run in your case. Perhaps if I gave it longer and harder thought, I could see a way to -collapse- this data to something that is -xtset-able and still preserve the important information conveyed by variable status. But at the moment, I'm not sure it's even possible.
Comment
Papungkorn Kitcharoenkarnkul

Join Date: Jul 2022

Posts: 19
#7

11 Dec 2023, 13:56

Hi Dr. Clyde Schechter,

I just realized that I tagged the wrong person in my previous comment, and I apologize for this confusion.

Thank you for sharing your thought on the strategies to approach data. Although I have been using Stata for sometime, my experience primarily involves using pre-existing packages from experts like Nick Cox. In my recent attempt, I looked into the source code of tsspell and tried to follow along. But I got confused due to my limited understanding. The available materials on the internet (like Stata documentation or instructional videos on Youtube) focus on command usage than delving into Stata's native functions like _n or _N for data manipulation. Could you please recommend additional resources, materials, or exercise that I can study for better understanding of native Stata? I would appreciate any suggestion you may have.

Regarding data collapsing, I shared a similar thought to yours. At first, I thought about using duplicates before collapsing the data. However, as you mentioned, there is a risk of losing some observation with or without noticing it. I find this question to be a bit challenging, which is why I posted it here. And I am grateful for all of your responses.

Thank you for your time and expertise.

Sincerely,
Kob
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#8

11 Dec 2023, 15:03

Well, I think the "required reading" for anybody who wants to gain proficiency in Stata is in the PDF documentation that comes with your Stata installation. Specifically, one should read both the [GS] Getting Started and [U] User's Guide volumes in their entirety. They provide a broad overview, and many of the details as well.

To move beyond those basics, there are several approaches. For the adventurous, you can just try things out and discover how they work by experimenting with them. For most people, though, that won't be an efficient way to learn. There are a number of books available in the StataCorp bookstore (and they can also be bought from the usual bookstores). I hesitate to recommend any one because they differ in their learning approaches and the fields on which they draw for worked examples. I think it is best to peruse a few and see if you can find one that looks suitable. Another approach is StataCorp's NetCourses. This was actually how I got up to speed initially, and I thought they were excellent. Between them and what I learned on Statalist, I grew my knowledge of Stata over time.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#9

12 Dec 2023, 04:05

Perhaps I should add something as the author of tsspell (from SSC). This was a successor to spell (ditto, with Richard Goldstein), which now seems to be used much less. tsspell grew out of noticing that the structure provided by tsset (or xtset with a time variable as well as an identifier) was often matched by data and was at any rate a standard Stata framework for analysis of spells in time.

I started to write up the principles of identifying spells using that approach in a paper for the Stata Journal . Originally I thought this would be fairly short, but proceeding slowly but surely through a discussion, I found I had written a paper of moderate length without even discussing tsspell -- so I turned it in. It was perhaps the intention to write a sequel specifically about tsspell, but in any case its help file was a discussion intended to be enough explanation for users. The sequel was never written and now doesn't seem much missed.

Naturally, more can be said about spells, and it was and remains the case that how to identify spells -- principally by working from a statement of what defines a spell to identifying when each spell starts and stops -- could easily be coded up independently, as Clyde Schechter did.

Anyway, the paper https://journals.sagepub.com/doi/pdf...867X0700700209 explains the principles as I see them.

This self-contained example shows you that you don't need tsset or tsspell. You don't even need regularly spaced times, or even time so long as you have some kind of position along a single dimension. (As a geographer, I am accustomed to series of observations down slopes, down rivers, across beaches, across country, from city centres outwards, up into the atmosphere, down into the ground, and so on.) You would need something more complicated (indeed, more rules) if each time or position could occur more than once. What you do need, at least with the way I do it, is some fluency with subscripting observations, the use of by:, and some standard Stata functions.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(time whatever) 1 1 2 1 3 1 4 1 5 0 6 0 7 0 8 1 9 1 10 1 11 0 12 0 13 1 14 1 15 1 end sort time gen _seq = 1 if whatever == 1 & whatever[_n-1] != 1 gen _spell = cond(whatever != 1, 0, sum(_seq)) replace _seq = cond(whatever != 1, 0, _seq[_n-1] + 1) if _seq != 1 bysort _spell (time) : gen _end = _n == _N & _spell > 0 sort time list, sepby(_spell) +---------------------------------------+ | time whatever _seq _spell _end | |---------------------------------------| 1. | 1 1 1 1 0 | 2. | 2 1 2 1 0 | 3. | 3 1 3 1 0 | 4. | 4 1 4 1 1 | |---------------------------------------| 5. | 5 0 0 0 0 | 6. | 6 0 0 0 0 | 7. | 7 0 0 0 0 | |---------------------------------------| 8. | 8 1 1 2 0 | 9. | 9 1 2 2 0 | 10. | 10 1 3 2 1 | |---------------------------------------| 11. | 11 0 0 0 0 | 12. | 12 0 0 0 0 | |---------------------------------------| 13. | 13 1 1 3 0 | 14. | 14 1 2 3 0 | 15. | 15 1 3 3 1 | +---------------------------------------+

Let's go through that more slowly.

1. The problem is to identify spells of 1 among other possibilities, here 0.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(time whatever) 1 1 2 1 3 1 4 1 5 0 6 0 7 0 8 1 9 1 10 1 11 0 12 0 13 1 14 1 15 1 end

2. The data have to be in time (position) order!

Code:

sort time

3. A spell starts (sequence variable 1) if this observation contains 1 and the previous observation doesn't contain 1. Notice that this
works fine for the start of the data as

Code:

whatever[0]

is evaluated as missing.

Code:

gen _seq = 1 if whatever == 1 & whatever[_n-1] != 1

4. We count spells by coding 0 if the observations don't contain 1 and by keeping track of the cumulative or running sum of starts.
(Every spell must start just once. A spell could be a single observation, but that doesn't bite so long as we count starts.) The cumulative sum will ignore missings.

Code:

gen _spell = cond(whatever != 1, 0, sum(_seq))

5. We count within spells by bumping up the previous counter if we're in a spell. Otherwise the counter is set to 0.

Code:

replace _seq = cond(whatever == 1, _seq[_n-1] + 1, 0) if _seq != 1

6. Evidently the start of a spell is coded by _seq being 1. It can be helpful to have a marker at the end of a spell too.

Code:

bysort _spell (time) : gen _end = _n == _N & _spell > 0

We're done.

Code:

sort time list, sepby(_spell) +---------------------------------------+ | time whatever _seq _spell _end | |---------------------------------------| 1. | 1 1 1 1 0 | 2. | 2 1 2 1 0 | 3. | 3 1 3 1 0 | 4. | 4 1 4 1 1 | |---------------------------------------| 5. | 5 0 0 0 0 | 6. | 6 0 0 0 0 | 7. | 7 0 0 0 0 | |---------------------------------------| 8. | 8 1 1 2 0 | 9. | 9 1 2 2 0 | 10. | 10 1 3 2 1 | |---------------------------------------| 11. | 11 0 0 0 0 | 12. | 12 0 0 0 0 | |---------------------------------------| 13. | 13 1 1 3 0 | 14. | 14 1 2 3 0 | 15. | 15 1 3 3 1 | +---------------------------------------+

Now it may seem that we've just solved a toy problem, but the solution to this one is the key to many other problems. Naturally you can choose other variable names.

* Any simple definition of a spell is reducible to some condition being true (1) within a spell and false (0) outside it. (Exceptions?)

* We want to do this separately for each panel, or for each year, or whatever? by: provides the machinery.

* We want the length of the longest spell?

Code:

su _seq, meanonly di r(max)

OR with panel or group data identified by id

Code:

egen max = max(_seq), by(id)

* Many other summaries yield to some egen function, such as the number of spells, the mean length of spell, summaries of some other variable for each spell, and so forth.

* We allow short breaks within spells? That is a repeated calculation.

Spells of standard kind; spells that are short breaks; composite calculation where a spell is either standard or a short break.

Last edited by Nick Cox; 12 Dec 2023, 04:50.
2 likes
Comment

Announcement

Find longest consecutive run without using tsspell

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment