Identifying two consecutive values according to time from start date

Elke Wynberg Gompertz

Join Date: Nov 2021
Posts: 1

Identifying two consecutive values according to time from start date

04 Nov 2021, 13:58

Hi Statalist,

My aim is to prepare my data for survival analysis, by defining my 'time to event' variable. In this question I hope you can help me code when the event takes place and how I can determine the time variable.

General overview of the data: I have a prospective cohort study of patients who report a start date of symptoms, after which they fill in monthly questionnaires on these symptoms. I have created monthly variables of the total number of symptoms reported.

The event I want to identify is: two consecutive surveys where 0 symptoms are reported (i.e. recovery)
The time variable should be: the midpoint between the start of symptoms date and the date the first survey was filled in at which no symptoms were reported.

Here is an example of the dataset up to 6 months of follow-up, where "pid" = patient ID, "EZD" = illness onset, "symp_totalnumber`x'" = total no. of symptoms at monthly survey, "surveydate`x'" = date at which survey is filled in.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 pid float(ezd symp_totalnumber2 surveydate2 symp_totalnumber3 surveydate3 symp_totalnumber4 surveydate4 symp_totalnumber5 surveydate5 symp_totalnumber6 surveydate6)
"AMCVIS0001" 22004 .     . .     . .     . .     . .     .
"AMCVIS0002" 22012 .     . .     . .     . .     . .     .
"AMCVIS0003" 22042 .     . .     . .     . .     . .     .
"AMCVIS0004" 22011 .     . 1 22137 3 22181 0 22181 .     .
"AMCVIS0005" 21948 .     . .     . 7 22137 6 22185 .     .
"AMCVIS0006" 22016 .     . 7 22112 8 22151 5 22187 8 22224
"AMCVIS0007" 22002 .     . .     . 0 22141 0 22173 .     .
"GGDVIS0002" 22044 3 22111 1 22165 0 22165 1 22201 3 22230
"GGDVIS0003" 22037 0 22117 1 22150 2 22178 2 22207 0 22236
"GGDVIS0004" 22044 .     . 0 22132 0 22160 0 22188 0 22216
"GGDVIS0005" 22048 .     . 0 22132 0 22160 0 22188 0 22216
"GGDVIS0006" 22047 .     . .     . .     . .     . .     .
"GGDVIS0007" 22042 .     . 8 22184 .     . .     . 2 22248
"VIS0019"    21970 .     . .     . .     . .     . 4 22158
"VIS0108"    22135 0 22193 1 22231 1 22262 0 22293 2 22319
"VIS0116"    22164 5 22242 3 22259 5 22286 2 22314 5 22347
"VIS0159"    22303 0 22376 1 22401 0 22447 0 22467 0 22498
"VIS0196"    22060 .     . .     . .     . .     . .     .
"VIS0216"    22084 4 22145 5 22181 4 22205 4 22235 3 22305
"VIS0218"    22038 .     . 5 22126 5 22153 6 22181 4 22208
"VIS0338"    22223 1 22291 2 22319 1 22346 .     . 4 22409
"VIS0385"    22190 .     . 2 22332 .     . .     . 0 22418
"VIS0523"    22211 4 22268 1 22292 1 22321 1 22347 0 22374
"VIS0540"    22072 1 22140 1 22172 3 22201 3 22233 2 22267
"VIS0553"    21995 .     . 0 22112 .     . .     . .     .
end
format %-tdDD_Mon_YY ezd
format %-tdDD_Mon_YY surveydate2
format %-tdDD_Mon_YY surveydate3
format %-tdDD_Mon_YY surveydate4
format %-tdDD_Mon_YY surveydate5
format %-tdDD_Mon_YY surveydate6

I have puzzled with creating these variables in complicated manual ways but I'm hoping there will be a more elegant solution?

Look forward to hearing your suggestions.

Many thanks,

Elke

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29998
#2

04 Nov 2021, 15:41

There are a few unclear areas in your explanation of the problem. I will first outline those which I have perceived, and what I have assumed about them, and then show code predicated on my assumptions.

the date the first survey was filled in at which no symptoms were reported

On its face this has a meaning that contradicts the context in which it occurs. I'm assuming you really mean the date that the first of two consecutive surveys at which no symptoms were reported.

the midpoint...

The midpoint between Monday and Wednesday is Tuesday. But what is the midpoint between Monday and Thursday? The point is that this calculation does not always fall on one particular date but gives a date + 1/2 day. Do you want to round to the earlier or the later possibility. I will assume the earlier.

the start of symptoms date

Is that the same as the variable ezd in your data? I assume yes.

So, as with so many things in Stata this problem appears hard because your data are in wide layout. It is quite simple once you go to long layout.

Code:

reshape long symp_totalnumber surveydate, i(pid) j(survey_num) by pid (surveydate), sort: gen n_remissions = sum(symp_totalnumber == 0 /// & symp_totalnumber[_n+1] == 0) by pid: egen first_remission_date = min(cond(n_remissions == 1, surveydate, .)) gen wanted = floor(0.5*(first_remission_date + ezd)) format wanted %-tdDD_Mon_YY

If you have a compelling reason to return to wide layout, you can do that at this point if you drop the n_remissions variable and then -reshape wide-. But don't do so unless you have a compelling reason. It is highly likely that everything else you want to do with this data will be easier, or even only possible, in long layout. There are only a handful of things that are best done with wide data in Stata.

Thank you for using -dataex- on your very first post.

Added: You should never contemplate doing any data analysis manually, unless you are just playing with the data for fun. For any purpose where others will be asked to take your results seriously, you must have a complete record of everything done to the data. Manual calculations do not leave an audit trail. And, worse, they are error-prone. Not to mention tedious. The Stata programming language is Turing complete: anything you could, in principle, do by hand, can be done in Stata. It might not be easy, but it is always possible, and it is always best practice to have all your calculations done by a computer following clearly documented and preserved code.

Last edited by Clyde Schechter; 04 Nov 2021, 15:48.
1 like
Comment

Announcement

Identifying two consecutive values according to time from start date

Comment