Calculating distance between two times, with dichotomous variables as indicators

Jonathan Horowitz

Join Date: Apr 2015
Posts: 102

Calculating distance between two times, with dichotomous variables as indicators

09 Apr 2021, 15:52

This probably has a really straightforward answer, but I can't seem to wrap my brain around this, so I'm hoping someone can help me. It's clearly related to event history type data, but it's not exactly the same thing as what we normally see.

Imagine the following dataset, where "1" for startdate1 means you started a book in month 1, and a "1" for enddate3 means you ended a book in month 3. But the twist is that it's not easily separated into discrete books (and for various reasons too strange to mention here, it might not make sense anyway for the actual task at hand).

Code:

 startdate1
startdate2
startdate3
startdate4
startdate5
startdate6
enddate1
enddate2
enddate3
enddate4
enddate5
enddate6

1
0
1
0
0
0
0
1
0
0
0
1

0
1
0
0
0
0
0
0
0
0
0
1

0
0
1
0
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
0
0
0
0
0

1
0
0
1
0
0
0
0
1
0
1
0

Is there an automated way to count the distance between the time someone starts reading a book and ends reading a book? So for the first observation, I would count the distance between startdate1 and enddate2, and then from startdate3 to enddate6 (listed in two different variables). But then for observation #2, I would only fill in the first variable (from startdate2 to enddate6).

Thanks for your time. I greatly appreciate it.

Tags: data, event history, loop, panel data

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

09 Apr 2021, 18:56

This would not be hard if the data were in long layout instead of wide. Then you would drop the observations that are zeroes and just subtract startdate from enddate.

But here's a question. Suppose somebody starts a book in month 1, then starts another in month 2, ends one in month 3 and ends the other in month 4. You can't tell if the first book was the one ended in month 3 or month 4. So you don't really know the durations of either book.

If you need more specific help, post back showing the data in a usable form. That is to say, use the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Jonathan Horowitz

Join Date: Apr 2015
Posts: 102

10 Apr 2021, 10:58

Interesting--I like this -dataex- package.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(startdate1 startdate2 startdate3 startdate4 startdate5 startdate6 enddate1 enddate2 enddate3 enddate4 enddate5 enddate6) float pubid
1 0 1 0 0 0 0 1 0 0 0 1 1
0 1 0 0 0 0 0 0 0 0 0 1 2
0 0 1 0 0 0 0 0 0 0 1 0 3
0 1 0 0 0 0 0 0 0 0 0 0 4
1 0 0 1 0 0 0 0 1 0 1 0 5
end

Or, if you prefer, in long format:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float pubid byte(month startdate enddate)
1 1 1 0
1 2 0 1
1 3 1 0
1 4 0 0
1 5 0 0
1 6 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 0
2 6 0 1
3 1 0 0
3 2 0 0
3 3 1 0
3 4 0 0
3 5 0 1
3 6 0 0
4 1 0 0
4 2 1 0
4 3 0 0
4 4 0 0
4 5 0 0
4 6 0 0
5 1 1 0
5 2 0 0
5 3 0 1
5 4 1 0
5 5 0 1
5 6 0 0
end

I'm curious to see what you would do with this. Like I said, I'm sure there's an easy or pre-made solution to this but for some reason I can't wrap my brain around it.
Re: Your question--yes, this toy example does raise some questions. It's specific to the example here (I've had to change some details of the sample from the original analysis for confidentiality reasons, but the mechanics are the same).

Thanks again.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

10 Apr 2021, 12:39

So, starting from the long layout:

Code:

//  NOW GO DOUBLE LONG
reshape long @date, i(pubid month) j(event) string
rename date occurred
keep if occurred
drop occurred
gsort pubid month -event


by pubid (month): assert event == cond(mod(_n, 2), "start", "end")
by pubid (month): gen duration = month[_n+1]-month if event == "start"

Now, if you need to reconnect these results to the original data, you can keep just those observations with event == "start" and then -merge- back to your original data.

Comment

Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#5

10 Apr 2021, 16:58

This is good, and I would not have thought to do it that way. Thanks!

Sounds like if for some reason something is undesirable (e.g., negative number, which would indicate a book was taken out before the study period) you could probably simply drop those cases (duration<0, in the example before) and re-run it again?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

10 Apr 2021, 18:31

Sounds like if for some reason something is undesirable (e.g., negative number, which would indicate a book was taken out before the study period) you could probably simply drop those cases (duration<0, in the example before) and re-run it again?

That shouldn't happen. If you have an example where it does, please post it here. It shouldn't be possible because when you calculate the duration, the data are sorted by month within pubid, so the difference between the next value of month and the current one can only be 0 or positive.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#7

11 Apr 2021, 09:33

I haven't used assert before, but it looks like if it violates the condition it just breaks rather than sorts. I just tweaked the data so to see what happened if it got fed some more difficult and complicated information, and I just got an "assertion is false" error message.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float pubid byte(month startdate enddate) 1 1 1 0 1 2 0 1 1 3 1 0 1 4 0 0 1 5 0 0 1 6 0 1 2 1 1 0 2 2 1 0 2 3 0 0 2 4 0 0 2 5 0 0 2 6 0 1 3 1 0 0 3 2 1 0 3 3 1 0 3 4 0 0 3 5 0 1 3 6 0 0 4 1 0 0 4 2 1 0 4 3 0 0 4 4 0 0 4 5 0 0 4 6 0 0 5 1 1 0 5 2 1 0 5 3 0 1 5 4 1 0 5 5 0 1 5 6 0 0 end

Thanks again for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#8

11 Apr 2021, 12:03

OK, so you have stumbled on my, so far unanswered, "here's a question" remark in #2. In this data, pubid 2 has a start in month 1, and then, without ending that, has a start in month 2. Then we get an end in month 6--but nothing in the data tells us which of the starts ended, nor have you suggested how you want to handle the unended other start (whichever one that happens to be.) In other words, the data are such that what you have asked for originally is undefined. That's why I put that -assert- in there: to verify that the data would always provide an unambiguous answer to the question before attempting to proceed.

So either these are data errors, or, more likely, you need to clarify your request so that the code can then be modified to handle these situations.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#9

11 Apr 2021, 12:41

There are two issues going on. Using our example from before, the first issue is the case of someone returning a book and checking out another one in the same month. So, hypothetically, if someone returns a book on the 20th of the month and picks up another one in the 21st, those get coded as the same month. Logically, there should be a way to make it so that it always picks up endings that happen after the start month. My sense is that this is the easier of the two problems to fix, possibly by changing the sorting slightly.

The second problem is things that happen outside of the study window (censoring). Someone checks out a book two months before the study begins, and returns it in the second month of the study. Now we have a case where the "first" event we see is actually an end; in an analogous situation, if someone starts a new book in the last month of the study we would never see the ending. This is the problem I have no idea how to handle. What I'd want to do is drop the situation where we have an ending that happens before a start (or where there is a start but the end is unknown at the conclusion of the study).

The hope is to find the first instance of a start, then count to the first subsequent ending; then find the second instance of a start, then count to the next subsequent ending; and to continue until we've gone through the data. The way I had been planning to handle this was very different than this, but it wasn't very good coding and I kept running into these problems in various forms.

Thanks again. I am also wondering if there is a better way of framing this example so that it's not as confusing. Say, the beginning of cohabitation to the end of cohabitation?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#10

11 Apr 2021, 13:06

find the first instance of a start, then count to the first subsequent ending; then find the second instance of a start, then count to the next subsequent ending; and to continue until we've gone through the data.

Code:

reshape long @date, i(pubid month) j(event) string rename date occurred keep if occurred // ELIMINATE ANY ENDS THAT OCCUR BEFORE THE FIRST STARTS by pubid (month), sort: gen n_starts = sum(event == "start") drop if n_starts == 0 drop n_starts // SEPARATE STARTS FROM ENDS preserve keep if event == "end" by pubid (month), sort: gen chron_order = _n rename month end_month drop occurred event tempfile endings save `endings' restore keep if event == "start" by pubid (month), sort: gen chron_order = _n rename month start_month drop occurred event merge 1:1 pubid chron_order using `endings' gen duration = end_month - start_month

I am also wondering if there is a better way of framing this example so that it's not as confusing. Say, the beginning of cohabitation to the end of cohabitation?

Well, I think that would be a worse way of framing the problem. OK, there is such a thing as polyamory, but it's not common, whereas clearly an important problem to resolve in this data is when there have been multiple starts before an end.

I would also say that the biggest problem you have with this data is the inability to identify which end corresponds to which start. You have now resolved it by declaring it to be a FIFO process--which is fine for some things, but almost certainly not for library books. (It is certainly not the case for my own library usage!) But the reality is that you don't know, so that even where there is no censoring, the durations being calculated may be fictitious.

I don't know what your context is here. I initially thought, based on the framing in #1, that it was about library books. But I gather from #9 that it may well have been something else. But the context may well suggest solutions to both the censoring and the inability to link ends to particular starts. If there is truly no way to identify which end goes with which start, I would caution you about doing any kind of duration related analysis here. The results could be misleading. Instead you might want to do some analysis of the number of "books still checked out" in each month, or perhaps some other approach.

Last edited by Clyde Schechter; 11 Apr 2021, 13:09.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#11

11 Apr 2021, 13:19

Yes--I came up with that example because I need a more general solution for a few different types of data but I'm starting to realize that this will only work when I am looking for the distance between one specific event that only happens once (e.g., birth) and the first instance of something else (e.g., saying "mama"). For everything else, I need to figure out an entirely different solution, and perhaps with additional data. I'll post back if I can think of a better way of framing it and I still haven't figured things out.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#12

11 Apr 2021, 15:18

Okay, I think I have a better example, one that is generalizable but also more specific.

Let's say you want to find out how long it generally takes to get married after cohabitation (or, more specifically, asking: in cases where people get married after cohabiting, how long does it take?). The reason why I'm framing it this way is because this way, you can repeat this for the distance between any number of different events; you could repeat this for start date of cohabitation and end date of cohabitation, or start date of cohabitation and first vacation together after cohabitation, or start date of cohabitation and first date of ordering a takeout dinner, etc.

The problem here is that there are plenty of times where someone has multiple cohabitation "starts" without a "marriage" because they break up and move out instead of getting married. So we need to find a way to filter out the cases where the second event never happens, because we'll tackle those separately.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(idvar month) str7 event 1 73 "cohab" 1 82 "married" 2 38 "cohab" 2 120 "cohab" 3 21 "cohab" 3 110 "married" 3 134 "cohab" 4 39 "cohab" 4 62 "cohab" 5 58 "cohab" 5 107 "married" 6 22 "cohab" 6 35 "married" 7 4 "cohab" 7 90 "cohab" 8 46 "cohab" 8 98 "married" 9 56 "cohab" 9 92 "married" 9 139 "cohab" 10 17 "cohab" 10 99 "married" 11 15 "cohab" 11 34 "married" 11 61 "cohab" 12 51 "cohab" 12 95 "married" 13 50 "cohab" 13 84 "married" 14 1 "cohab" 14 85 "cohab" end label values idvar vlR0000100

Thanks again for all your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#13

11 Apr 2021, 15:44

In this context, I think it is pretty reasonable to assume that for any given marriage, the cohabitation that is linked to it is the latest one the precedes it or occurs in the same month. I suppose it is possible for A to cohabit with B, then cohabit with C, and then marry B--but this is going to be pretty rare. In the absence of direct information linking cohabitations to marriages, I think this assumption will not lead to many misclassifications.

Code:

// SEPARATE THE MARIAGE AND COHABITATION EVENS preserve keep if event == "cohab" drop event rename month cohab_month tempfile cohabs save `cohabs' restore drop if event == "cohab" drop event rename month marriage_month // FOR EACH MARRIAGE SELECT THE LATEST PRECEDING COHABITATION rangejoin cohab_month . marriage_month using `cohabs', by(idvar) by id marriage_month (cohab_month), sort: keep if _n == _N gen latency = marriage_month - cohab_month // AND NOW BRING BACK THE COHABITATIONS THAT NEVER LED TO A MARRIAGE merge 1:1 idvar cohab_month using `cohabs'

-rangejoin- is written by Robert Picard and is available from SSC. To run it you must also install -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

The final -merge- is just done to restore all of cohabitations that never led to a marriage: if you are doing an analysis of time to marriage, you will need to include observations where the answer is "never."
1 like
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#14

11 Apr 2021, 16:04

Thanks--this is exactly what I was looking for. I think I can use this as a template and adapt it for everything else.
Comment

Announcement

Calculating distance between two times, with dichotomous variables as indicators

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment