By ID_number, keep registry-based (baseline=0) test dates obtained 1 year after original (baseline=1) study’s test date

Kevin Marks

Join Date: Jun 2021

Posts: 24
#1

By ID_number, keep registry-based (baseline=0) test dates obtained 1 year after original (baseline=1) study’s test date

24 Oct 2021, 14:41

Really need help! With the expert guidance of a statistician, I merged (using the append command) a baseline data set with three registry-based data sets (using the append command). Now, my goal is to only keep the registry-based HbA1c test dates (spanning from 01jan2008 to 31dec2020) that were performed over 1 year (364.25 days) after each participant's original HbA1c value, which was collected in each original study participant on a date between 01oct2008 to 01apr2010.

Before merging, the original study’s data set was labeled as baseline = 1 and had the following variables in as the columns in long format: ID_number, the HbA1c test date (status_date), the HbA1c value (HbA1c_mmolmol), the year the HbA1c test was done (y_status_date), the absolute # of days between each participant's birthday and the date of their HbA1c test (new_diff_days), baseline (labeled as 1 for the original study) and some string variables. Before merging / appending the data sets, the registry-based data was labeled as baseline = 1 and it has all of the same variable names and labels as the original study without any string variables. My question = What code should I use to only keep the registry-based (baseline == 0) HbA1c values and test dates that were done 1 year (364.25 days) after each participant's baseline study's HbA1c test date?

[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
input double(ID_number status_date HbA1c_mmolmol) float(y_status_date new_diff_days baseline)

000000000 17875 61 2008 343 0
111111111 17979 61 2009 81 1
222222222 18071 66 2009 173 0
333333333 18281 64. 2017 18 0
444444444 18788 62 2019 160 0
555555555 19025 59 2009 32 1

The above dataex example is a dummy data set. It is not my real data.

Last edited by Kevin Marks; 24 Oct 2021, 15:26.
Tags: data management, panel data, Time Series
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

24 Oct 2021, 14:53

What code should I use to only keep the registry-based (baseline == 0) HbA1c values and test dates that were done 1 year (364.25 days) after each participant's baseline study's HbA1c test date?

Not sure what you mean by this. There will hardly be any participants for whom there was a registry HbA1c done exactly one year from the baseline--so I really doubt this is what you want. So maybe you mean done at least one year after the baseline date? Or at most one year after the baseline? Or maybe there is some window of acceptablity around one year, like within 90 days of the one-year anniversary or something like that? Please clarify what you actually want here.
1 like
Comment
Kevin Marks

Join Date: Jun 2021

Posts: 24
#3

24 Oct 2021, 15:22

Sorry that I was not more clear. I need to only keep the registry-based (baseline=0) HbA1c test dates done at least one year after the original (baseline=1) study's date. E.g., If a participant in the original study (which occurred from 01oct2008 to 01apr2010) had their HbA1c test done on 05jul2009, then I need to keep all of the follow-up / registry-based HbA1c values that were collected from 05jul2010 to 31dec2020. If another participant in the original study had his/her HbA1c test done on 25dec2009, then, for that individual participant (ID_number), I would need to keep all registry-based HbA1c values from 25dec2010 to 31dec2020. The registry-based HbA1c test dates range from 2008 to 2020. Does that make things more clear?

Last edited by Kevin Marks; 24 Oct 2021, 15:27.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

24 Oct 2021, 15:40

OK. First, I had to modify your example data because in what you showed, each ID had only a single observation, so there was nothing to work with. The code below assumes that, as in the modified example data, every ID has one and only one observation that is marked as baseline. (Your question is ill-posed if that is not the case since it would not be possible to define what "baseline" date to compare to.) Also, I have never seen a year defined as 364.25 days before; 365.25 is common, as is just 365. Here, however, I have used Stata's -datediff_frac()- function which also takes into account calendar idosyncracies and considers October 24, 2021 to be 1 year after October 24, 2020 (where no leap day intervenes) and also treats October 24, 2020 as 1 year after October 24, 2019 (where there is an intervening leap day).

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double(ID_number status_date HbA1c_mmolmol) float(y_status_date new_diff_days baseline) 111111111 17875 61 2008 343 1 111111111 17979 61 2009 81 0 222222222 18071 66 2009 173 1 222222222 18281 64. 2017 18 0 222222222 18788 62 2019 160 0 222222222 19025 59 2009 32 0 end by ID_number (baseline), sort: assert baseline == (_n == _N) by ID_number (baseline): keep if _n == _N | /// datediff_frac(status_date[_N], status_date, "y") >= 1

Note that if the assumption that each ID has one and only one baseline observation is not met in your real data, this code will break at the -assert- command. In that case you either have to find the errors in your data and fix them, or consider some different approach that can be meaningfully applied to the more irregular data you have.
Comment
Kevin Marks

Join Date: Jun 2021

Posts: 24
#5

25 Oct 2021, 02:44

Dear Clyde, I am so incredibly thankful for your help. Here is what just happened...

by ID_number (baseline), sort: assert baseline == (_n == _N)
*assertion is false
*r(9);

sort baseline

by ID_number (baseline): keep if _n == _N | datediff_frac(status_date[_N], status_date, "y") >= 1
*(5,925 observations deleted)
*Then, I methodically reviewed the data before and after that last command. The problem appears to have been solved!
**It keeps registry-based HbA1c "status_date" if done 1 year after original study's "status_date"

Of course, I still need to review the final version of my merged data set with a statistician. However, I want to let you know that you absolutely made my day! From my perspective, you are the unsung hero of the internet for PhD students who use Stata!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

25 Oct 2021, 16:51

No, no, no. The problem is not solved; it has been swept under the rug. You just have wrong results and are unaware of, or ignoring, the problem.

That -assert- command was put there for an important reason. In order to know whether or not a registry A1c was dated a year or more after "the baseline date," there has to be a clearly identified baseline date. The error response to that assert command tells us that there are some people in your data set who either have no baseline observations, or have two (or more) baseline observations. In either of those cases it is meaningless to speak of a year after "the baseline date" because no such date exists. For those people you have incorrect results.

Never ignore error messages. I know you are eager to get on with your work and get to exciting results. But don't be in a hurry to get the wrong answers. You need to investigate the cases that do not have a baseline observation or have more than one, and then deal with that. It is easy enough to find them. Go back to the data before you ran the code in #5. Then run:

Code:

by ID_number (status_date), sort: egen int bcount = total(baseline == 1) browse if bcount != 1

Then you have to figure out why these situations arose, and fix this problem. The results from the later code are not valid until the condition of one and only one baseline per participant is met.

Last edited by Clyde Schechter; 25 Oct 2021, 17:03.
2 likes
Comment
Kevin Marks

Join Date: Jun 2021

Posts: 24
#7

26 Oct 2021, 07:59

Thank you Clyde for your advice. I have used the code you provided to study or diagnose the problem. Will discuss the "fix" (if needed) at a Zoom meeting with a statistician in a few days.
Comment
Kevin Marks

Join Date: Jun 2021

Posts: 24
#8

29 Oct 2021, 04:27

Diagnosis = First, I briefly reviewed the dofile and data with a statistician. Topic was about my unresolved r(9) error. Why doesn't every registry-based (baseline==0) status_date have a baseline status_date? We reviewed together what happened with 6 participants before and after the coding that you had recommended. We noticed that the code worked perfectly for half (3 out of 6) of the cases. We noted that were 734 contradictions for the 20,824 observations in my entire data set. I.e., there were 734 ID_numbers from the registry-based data files that were not included in the original (baseline==1) study group that we are working with for this particular study.

Treatment = I need to write code that basically does this...

By ID_number: only drop registry-based status_date observations (where baseline==0) when there is no corresponding 2009 original study or baseline status_date (baseline==1) with the same ID_number

Any suggestions?

Last edited by Kevin Marks; 29 Oct 2021, 05:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

29 Oct 2021, 08:59

Can you clarify by what you mean by "no corresponding 2009 original study?" Does 2009 refer to the year of the status_date variable, or to the year in the variable y_status_date? (They are frequently different in the example data.)
Comment

Announcement

By ID_number, keep registry-based (baseline=0) test dates obtained 1 year after original (baseline=1) study’s test date

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment