Different results with bysort comand: longitudinal data

Adrian Shaba

Join Date: Dec 2018

Posts: 21
#1

Different results with bysort comand: longitudinal data

02 Feb 2019, 08:18

Hello Statalisters,

I am trying to select a subset from a large clinical dataset based on a set of criteria. Just from the start of my dofile I
keep getting different results. My commands for one particular criterion are as below:

Note: lab_date : date when a glucose test was done
food _start_date : date when clients started a particular food combination

Code:

******Checking for and dropping duplicates sort id_client food food_start_date lab_date quietly by id_client food food_start_date lab_date: gen dup = cond(_N==1,0,_n) br if dup>0 drop if dup>1 bysort id_client lab_date: gen visit=_n *based on selection criteria:* ***keeping only patients who started "FO" containg food between 01/01/14 to 31/12/14 bysort id_client visit: gen condition1 = 1 if strpos(food, "FO") > 0 /// & inrange(food_start_date,td(01jan2014),td(31dec2014)) *Capturing entire follow-up period (using lab_dates as reference) of each observation on by id_client: mipolate condition1 visit , gen(condition2) forward keep if condition2==1

This last step for this particular criterion keeps varying with observations ranging from 60610 to 60622 being kept every time I run this part of the dofile

Any reason why this is the case?

I know adding "stable" to the -bysort- command will only mask the problem.

Any suggestions?

Thanks so much for your assistance.

Regards

Adrian
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30172
#2

02 Feb 2019, 11:32

Code:

sort id_client food food_start_date lab_date quietly by id_client food food_start_date lab_date: gen dup = cond(_N==1,0,_n) br if dup>0 drop if dup>1 bysort id_client lab_date: gen visit=_n

This code guarantees that at this point the combination nof id_client food food_start_date uniquely identify observations. However, when you then do -bysort id_client lab_date: gen visit = _n-, the possibility remains that there are multiple observations for a combination of id_client and lab_date (which could differ by food or food_start_date). Those observations would be sorted randomly (and irreproducibly) by Stata. Consequently, the variable visit will be different from one occasion to the next when you run this code. From there, interpolation based on visit will also give irreproducible results.

I'm not entirely sure what you are trying to do here, but I'm going to speculate. I think that what you actually want is for all observations with the same id_client and lab_date to have the same value of visit. For that to happen you would have to do this:

Code:

by id_client lab_date, sort: gen visit = (_n == 1) by id_client (lab_date): replace visit = sum(visit) // DO NOT RE-SORT

This code will make the variable visit consistent from one run of the code to the next.

But that does not completely solve your problem (and I may be mistaken in my belief that this is what you want anyway). Because you are trying to fill in variable condition2 from condition1 using interpolation. But since id_client and visit do not uniquely identify observations, the order in which values of condition1 appear within groups of id_client and visit is still indeterminate and will differ from one run of the code to the next. This means that the interpolation results will also be irreproducible.

Anyway, those are two places at which you are creating irreproducible results with your code. As I don't understand what you are trying to accomplish overall, I can't really advise you how to fix the problems (though my suggestion for the first one might be part of the solution.)
1 like
Comment
Adrian Shaba

Join Date: Dec 2018

Posts: 21
#3

02 Feb 2019, 16:07

Thanks so much Clyde for all your input. Your advice always gives me time to reflect on what I am doing.

Below i provide clarity on the commands and my goal :

Please note that my data is a bit complex in the sense that I have clients who are on different food items before 2014.

My first step , which is typical with longitudinal clinical data is to clean the duplicate captured observations. I generate a dummy variable to identify duplicate observations and drop them:

Code:

sort id_client food food_start_date lab_date quietly by id_client food food_start_date lab_date: gen dup = cond(_N==1,0,_n) br if dup>0 drop if dup>1

I am mainly interested in those clients (panels) that started taking food combinations that included the food item "FO" in the year 2014 only ===> for this a dummy variable "condition1"
which captures these observations as 1.

Code:

***keeping only patients who started "FO" containg food between 01/01/14 to 31/12/14 bysort id_client visit: gen condition1 = 1 if strpos(food, "FO") > 0 /// & inrange(food_start_date,td(01jan2014),td(31dec2014))

I generated the variable "visit" to represent the visit number for each client_id ranging from "1 to n" based on the
chronological order of the lab_date. I am using as my time reference variable "yvar" in interpolation in the next step.
The lab_date variable will later be used to calculate the time spent in the study for each client based on the
individual lab results (which will deteremine failure of censoring). Furthermore, the generated variable 'visit" is to
be used in the forward interpolation of the variable "condition1" to generate variable "condition2". Condition2
provides a guide to drop any observations before 2014 within each panel which wont be part of this longitudinal
analysis (which are captured as missing in condition2). Again my assumption is that all panels fall outside this
criterion (not commencing this food item "FO" in 2014) will also be dropped.

Code:

by id_client: mipolate condition1 visit , gen(condition2) forward keep if condition2==1

I hope this provides more clarity on what i aim trying to achieve.

Thanks so much for your continued support
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30172
#4

02 Feb 2019, 18:15

Code:

sort id_client food food_start_date lab_date quietly by id_client food food_start_date lab_date: gen dup = cond(_N==1,0,_n) br if dup>0 drop if dup>1

does not eliminate duplicate observations. It eliminates "extra" observations that agree on the variables id_client food food_start_date and lab_date. Those observations might disagree on other variables, but the code will eliminate them anyway, arbitrarily retaining one (selected at random) and irreproducibly. This is, in general, not a good thing to do as information is being discarded in an arbitrary and unpredictable way. If you are sure that variables that agree on those four variables always agree on everything else, or if, for your purposes, the values of the other variables will never matter later in the analysis, then what you have done is OK.

If you want to eliminate truly duplicate observations (i.e. agree on all variables, not just those four), the simplest way to do that is:

Code:

duplicates drop

If eliminating truly duplicate observations is not sufficient, because you can have observations that agree on just those four variables but disagree on some others, and if those others are not relevant for your analysis, you can do this:

Code:

ds id_client food food_start_date lab_date, not drop `r(varlist)' duplicates drop

This will leave you with a single observation for every instantiated combination of id_client food food_start_date and lab_date, and will eliminate all other variables in the data set. (If you need those other variables, then stop here because there is something wrong with your data and your approach: you can't do anything reproducible if you need those other variables on the one hand, but need to select a single observation for each combination of id_client food food_start_date and lab_date on the other hand. In that situation you must figure out how to resolve the differences among those observations and select one to retain on a systematic, rather than a random basis. Otherwise, your analysis becomes indeterminate and irreproducible from this point on.)

Next, although I do not think it is causing you trouble in this instance, it is usually not a good idea to generate variables that encode a condition as 1 for the condition and missing elsewhere. 1/0 coding is usually better in Stata, and is what I use in the code shown below.

I don't actually follow the logic in the rest of your code and, at least in places, I don't think it does what you say. Let me seize on one thing you said to propose a way forward. "I am mainly interested in those clients (panels) that started taking food combinations that included the food item "FO" in the year 2014 only." It sounds like you want to keep all of the observations for such clients, not just the observations from 2014 with an included FO item, and eliminate all observations for any other id_client's. The simplest way to do that is:

Code:

by id_client, sort: egen keeper = max(strpos(food, "FO") & year(food_start_date) == 2014) keep if keeper

And to generate a visit number based on the lab date, as suggested in #2, I would do:

Code:

by id_client lab_date, sort: gen visit = (_n == 1) by id_client (lab_date): replace visit = sum(visit) // DO NOT RE-SORT

This will assign the same visit number to all observations of a given id_client with the same date.
Comment

Announcement

Different results with bysort comand: longitudinal data

Comment

Comment

Comment