Two observations per ID, want to keep the one closest to a certain number

Vilma Antonov

Join Date: Aug 2022

Posts: 47
#1

Two observations per ID, want to keep the one closest to a certain number

20 Oct 2022, 14:13

Hi!

I have a large dataset with follow-up observations for patients following a surgical event. The patients are followed up annually, however, in reality this means annually ish. Therefore, we have decided that patients who was seen between the days 548 and 914 are counted as annual followup year two (counted as 365,25x2=730,5), by using this code:

Code:

. generate daysfromintervention= datediff(INTERVENTION_DATE, FOLLOWUP_DATE, "day"> ) . generate followuptime =. . replace followuptime=1 if (daysfromintervention>=274 & daysfromintervention<548) . replace followuptime=2 if (daysfromintervention>=548 & daysfromintervention<914) . replace followuptime=3 if (daysfromintervention>=914 & daysfromintervention<1279) *and then individually, creating different datasets for each yearly followup . keep if followuptime==1 . duplicates tag TRR_ID_CODE, generate(tags) . save as xxxx

All great in theory - however, in practice, there are patients who have had two followups within this time period. We would like to keep the one closest to the true annual number of days, meaning that if a patient had two followups during the time period for year 1, eg day 320 and day 547, we keep the one on day 320.

Is there any way you could write a script that does this? I'm unfortunately not skilled enough but I have realized that it's not manageable to do manually for 200,000 observations...
Eternally grateful!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

20 Oct 2022, 14:25

Code:

generate daysfromintervention= datediff(INTERVENTION_DATE, FOLLOWUP_DATE, "day"> ) generate followuptime =. replace followuptime=1 if (daysfromintervention>=274 & daysfromintervention<548) replace followuptime=2 if (daysfromintervention>=548 & daysfromintervention<914) replace followuptime=3 if (daysfromintervention>=914 & daysfromintervention<1279) gen delta = abs(daysfromintervention - followuptime*365.25) by patient_id (delta), sort: keep if _n == 1

Replace patient_id in the last command by the actual name of the variable that identifies individual patients.

In the future, when asking for help with code, always show example data, and please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

I have realized that it's not manageable to do manually for 200,000 observations...

Well, you are fortunate that this is the case. Manually changing data is an unacceptable data practice, even if it is only a single keystroke change to one observation. Unless you are just playing around for fun, your data management and analysis should have a complete audit trail--manually changing the data fails to meet this important requirement for data integrity. All your data management should be done in do-files that use the -log- command to maintain a complete record of what was done. Not only is this important for those who will use and rely on your results to have confidence in what you have done, if you need to return to this project some months from now, you may well have forgotten details of what was done, and the log files will be the only reliable, permanent record you can turn to to refresh your own memory on what happened. Never make manual changes to data!
1 like
Comment
Vilma Antonov

Join Date: Aug 2022

Posts: 47
#3

20 Oct 2022, 14:40

Hi Clyde,

Thank you for your quick answer! I tried it and it does seem to work. Thank you so much!

To the other part, it's a great point. I jokingly made that comment to point to its absurdity, and would never do that - hence reaching out for help. But always important to consider your reproducability when working with your data.

I sincerely appreciate your help!
Comment

Announcement

Two observations per ID, want to keep the one closest to a certain number

Comment

Comment