joining two datasets but stata only picks the one with the closest date and the one with most information (least missing)

Rose Matthews

Join Date: Aug 2023

Posts: 154
#1

joining two datasets but stata only picks the one with the closest date and the one with most information (least missing)

14 Oct 2023, 15:45

I'm just data cleaning my datasets. I have one 'pristine' dataset and another dataset 'outcomes' which I'm working on -- SEE DATASETS BELOW
I would like to join my outcomes dataset to my pristine dataset.

The problem is my outcomes dataset has several rows of information for each id and I would like to:
1. Append the information from outcomes dataset to the pristine dataset using their commonly shared unique ids + gender
2. The aim is to keep information for each unique Id in 1 row - so outcomes dataset needs some work

Therefore My thoughts were

STEP 1:
1. to first keep the rows in the outcome dataset with the most information, therefore calculating per the no of most missing (ok) and drop the ones with the most missing --> here I got stuck

Code:

egen value9=anycount(scar), value(9) //note missing for this variable coded as 9 egen nmissing=rowmiss(pain) // missing for this variable coded as . gen totalmissing=value+nmissing ////for each id, drop the ones with the most missing ***Here I got stuck, not sure what to use for interval - can anyone help rangestat (max) totalmissing, int(intrval . -1) by(id)

Step 2:

Join the pristine dataset to the outcomes dataset and keeping the row which have the surgdate variable closest to 'opdate' and have the same id, id2 and gender (this confirms its the same person)

Step 2:

Code:

//Joining the datasets -- this joins the datasets but adds rows of data which I don't want //Load pristine dataset *** joinby id id2 gender using "/Users/rosematthews/Desktop/outcomesdata.dta" ///if they're not the same in both datasets = the information is dropped

Can anyone help with regards to the next steps? where I then want to keep the row which has the surgdate closest to operationdate but of course must have the same id, id2 and gneder (although this was done in this step ***

Here are the datasets if perhaps I didn't make myself clear... APpreciate your help from all the experts who kindly give us their feedback. - thank you

***OUTCOMES DATASET****

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(pain scar surgdate) byte value9 float(nmissing totalmissing genderout) 12 "1A" 1 1 22555 0 0 0 1 12 "1A" . 9 . 1 1 2 1 13 "2B" 1 9 11962 1 0 1 2 13 "2B" 1 9 11962 1 0 1 2 13 "2B" . 1 . 0 1 1 2 14 "7Z" 1 9 13549 1 0 1 2 end format %td surgdate

**PRISTINE DATASET***

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(operationdate gender) 12 "1A" 22555 1 13 "2B" 11962 2 14 "7Z" 13549 2 end format %td operationdate
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

14 Oct 2023, 17:13

This is a somewhat unusual way to proceed. You are applying two different, potentially orthogonal, criteria in the two steps. So, doing them in the order you propose, you could end up with one observation from the outcomes data set that has lots of information but is very far away temporally from the observation it matches to in the pristine data set. If you reverse the order of application, you could end up pairing with an observation that matches the operationdate and surgdate exactly but has no information about scar and pain at all. I suppose it all depends on how you plan to use and interpret this data, but just think through the implications of doing this before you proceed. My intuition is that for these kinds of variables, in nearly all circumstances I would be more concerned with a good temporal match on the data than with the amount of missing information. But it's your call.

Meanwhile, here's how I would do your two-step approach.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(pain scar surgdate genderout) 12 "1A" 1 1 22555 1 12 "1A" . 9 . 1 13 "2B" 1 9 11962 2 13 "2B" 1 9 11962 2 13 "2B" . 1 . 2 14 "7Z" 1 9 13549 2 end format %td surgdate tempfile outcomes save `outcomes' * Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(operationdate gender) 12 "1A" 22555 1 13 "2B" 11962 2 14 "7Z" 13549 2 end format %td operationdate tempfile pristine save `pristine' // STEP 1 use `outcomes', clear mvdecode scar, mv(9) rename genderout gender egen mcount = rowmiss(pain scar) by id id2 gender (mcount), sort: keep if mcount == mcount[1] // RETAIN MOST INFORMATION drop mcount // STEP 2 merge m:1 id id2 gender using `pristine' gen delta = abs(surgdate-operationdate) by id id2 gender (delta), sort: keep if _n == 1 drop delta

Comments:
I think the reason you are confused about what -interval()- to use with -rangestat- is because this isn't a -rangestat- problem. You just want to count the number of missing values among the outcome variables and choose the least. Since there could be multiple observations which are tied for the least amount of missing values, I keep them all at this point.

Don't work with magic number missing values in Stata. It's a recipe for incorrect results. Getting rid of those should be one of the very first steps in data cleaning.

In the second step, it is, again, possible that there will be two or more observations that are tied for temporal proximity. This code breaks any such ties randomly and irreproducibly. It would be better to have a rule for breaking the ties. First, consider that ties can result two ways: there can be two observations with the same surgdate, or their can be two observations with different surgdates, one of which is before and the other after operationdate, but they are the same number of days away from operationdate. In the latter case, it would be best to make a rule that systematically picks the one that comes before, or systematically picks the one that comes after. Your call. In the first situation where both observations have the same surgdate, you might want to base that on the values of some other variables in your full data set provided doing it this way isn't going to bias the analysis.
Comment
Rose Matthews

Join Date: Aug 2023

Posts: 154
#3

15 Oct 2023, 00:36

Dear Clyde,

Thanks for this. I had a think about what you said regarding dropping the data with the most missing. It's true. This would be incorrect for me to do.

Problem: Trying to find out which outcomes from the outcome dataset are at 6 months, 1 year, 5 years and 10 years.
And for the preop op period - Keep the one with the least days

As you can see from my dataset 'outcomes' - something I failed to add in my previous post( see below for dataset) is I have 4 date variables and not all are necessarily correct. Some have a preop date that is coded as after the surgery- which would be incorrect. Here are my thoughts, appreciate if you could help where I marked as a problematic issue

- The operation date from 'Pristine' dataset is the correct surgery date from which I need to calculate the outcomes of 6m, 1 y, 5y, 10y.

Outcome variables:
-Pain & Scar are part of a score which I need to use MI to calculate the total score but I can only do this when I have finaised my dataset due to the requirement of my outcomes in OLS (as seen in my another post)
- OKS - total score - this outcome score is all complete (mean imputation complete)

//In your pristine dataset just keep ID ID2 and Gender and operation date (Operation date in prisitne is most correct)

//using outcomes dataset

Code:

merge m:1 id id2 gender using "/Users/rosematthews/Desktop/pristinedata.dta"

//Problem: Try Find which Proms are at 6 months, 1 year, 5 years, 10 years - coded as 1
//Rule 1 - preopdates must be < surgerydate, if both are < surgerydate keep the one with the least days from surgery date without losing information

//Step 1

//Finding the correct preop_outcomedate - this is calculated in days

Code:

gen preop_correct1 = abs(operationdate-preop_outcome) gen preop_correct2 = abs(operationdate-preop_received1) //Finding the correct postop_outcomedate - days gen postop_correct1 = abs(operationdate-postop_outcome) gen postop_correct2 = abs(operationdate-postop_received2) gen proms_preop = 1 if preop_outcome or preop_received1 < operationdate

//Problem: Keep the one with the least preop days //need help here - keep the one with the least days from surgery date without losing information
**becareful may lose information as different rows may have different information at different stages: 6 months/1 year / 5 years / 10 years **

//Step 2 - Finding the 6 months proms - accepting if the date is within 6 months - 8 months from surgerydate

Code:

gen proms_6m= 1 if postop_correct1 >= 180 & postop_correct <=240 gen proms_6m= 1 if postop_correct2 >= 180 & postop_correct <=240

// Repeat for 1 year /5 years / 10 years

//Now you have different lines for each id if the outcomes are at 6 months/ 1 years / 5 years / 10 years - can save different datasets for 6 months/1 year/5 y/ 10 years to have one line for each id

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(pain scar surgdate) byte value9 float(nmissing totalmissing gender) str24 var14 float(preop_outcome postop_outcome preop_received1 preop_received2) 12 "1A" 1 1 22555 0 0 0 1 "all correct" 22554 22565 22554 22565 12 "1A" . 9 . 1 1 2 1 "received correct preop" 22556 22566 22563 22565 13 "2B" 1 9 11962 1 0 1 2 "received post op correct" 11963 11901 11932 11973 13 "2B" 1 9 11962 1 0 1 2 "received all correct" 11932 11934 . . 13 "2B" . 1 . 0 1 1 2 "" 11963 11901 11933 11973 14 "7Z" 1 9 13549 1 0 1 2 "" 13550 13549 13548 13551 end format %td surgdate format %td preop_outcome format %td postop_outcome format %td preop_received1 format %td preop_received2

**Pristine dataset remains the same as post1
Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

15 Oct 2023, 05:34

hello just to say I've sorted the problem out - for anyone else here is the code, for, trying to find the PROMS at 6months etc (see my post 5)

However I still have this problem:

I would like to keep the row of information for each ID which has the

preop_date or recpreop_date CLOSEST to the surgery date

These are my steps up to now - pls note a cleaner dataset and smaller

Code:

//Select which is the correct pre_opdate (from the received vs preopdate) by finding the one closest to the operationdate


gen preop_correct1 = abs(operationdate-preop_date)

gen preop_correct2 = abs(operationdate-recpreop_date)



//Clyde - suggested this code - for when I wanted just one variable of dates. But this time I’m trying to decide between two (preop_correct1 and preop_correct2) 

//Clyde's code 
gen delta = abs(surgdate-operationdate)

by id id2 gender (delta), sort: keep if _n == 1

drop delta

I don’t know how to modify the code for 2 variables?

Dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex

clear

input float id str2 id2 float(pain scar surgdate oks gender preop_date recpreop_date postop_date recpostop_date operationdate) byte(_merge duplicatescores)

12 "1A" 1 1 22555 40 1 22533 22289 22770 22771 22555 3 0

13 "2B" 1 9 11962  1 2 11933 11933 11963 11782 11962 3 0

13 "2B" 1 9 11962 48 2 11932 11951 11963 11782 11962 3 1

13 "2B" 1 9 11962 48 2 11932     .     .     .     . . 1

end

format %td surgdate

format %td preop_date

format %td recpreop_date

format %td postop_date

format %td recpostop_date

format %td operationdate

label values _merge _merge

label def _merge 3 "Matched (3)", modify

Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

15 Oct 2023, 05:35

//Select which are the 6/12 proms

//Find out which Proms have been recorded at 6 months or 1 year - days

Code:

gen postop_6m = abs(operationdate-postop_date)
gen postop_6mrec = abs(operationdate-recpostop_date)

gen postop_1y = abs(operationdate-postop_date)
gen postop_1yrec = abs(operationdate-recpostop_date)

//6 months = 180 days - 8 months = 6months 240 days
gen final6m= 1 if postop_6m >= 180 & postop_6m  <= 240
gen final6mrec= 1 if postop_6mrec >= 180 & postop_6mrec  <= 240


//1 year + 2 months limit 
gen final1y= 1 if postop_1y >= 365 & postop_1y  <= 425
gen final_1yrec= 1 if postop_1yrec >= 365 & postop_1yrec  <= 425


//Save 6 months dataset
keep if final6m == 1 | final6mrec ==1 


//WORK FORM there

Comment

Rose Matthews

Join Date: Aug 2023

Posts: 154
#6

15 Oct 2023, 09:14

Code:

by id id2 gender (mcount), sort: keep if mcount == mcount[1] // RETAIN MOST INFORMATION

Is there a way how to work with -min- command? What if I have a variable with ID 16 that has mcount number as 2 (rather than 1). This wouldn't be captured with the above command. As I want to find for each ID keep the the ID with the LEAST missing, this could be any number from 1-3. By any chance, do you have any other recommendations, if any of course?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#7

15 Oct 2023, 12:27

Is there a way how to work with -min- command? What if I have a variable with ID 16 that has mcount number as 2 (rather than 1). This wouldn't be captured with the above command. As I want to find for each ID keep the the ID with the LEAST missing, this could be any number from 1-3. By any chance, do you have any other recommendations, if any of course?

I think you don't understand what that code does. Which also means that you don't understand how -by- works with subscripts, so you are lacking a fundamental Stata tool that you need for working with data sets like the ones you have. So let's review that.

In the absence of -by-, a subscript refers to which observation of the data set is to be used. So, in the absence of -by-, mcount[1] refers to the value of variable mcount in the first observation. It has nothing to do with mcount == 1. mcount[5] refers to the value of mcount in the 5th observation. And _N has a special meaning: mcount[_N] refers to the value of mcount in the last observation. The value of the subscript (the number inside the square brackets) designates the location of the data set, not its value.

In a command with -by- it is a bit more complicated. If we have -by varlist1 (varlist2): ... mcount[1]...-, first recall that the data must first be sorted on varlist1 and varlist2. -by- then executes the command that follows the colon separately for each group of observations defined by common values of varlist1. Now, in the presence of -by-, the values of subscripts are interpreted relative to the group of observations. So when we have -by id id2 gender (mcount): ...mcount[1]...- the data are first sorted in order of id, id2, gender, and mcount. The command following the colon is executed separately for each group of observations defined by the same combination of values of id id2 and gender (but not mcount). Now, within each such group of observations, the data are already sorted on mcount, meaning that the first observation in the group has the smallest value of mcount, the second one has the next smallest value (or might be tied with the first), etc. So in this context mcount[1] means the first observation in the group, which, because the data were presorted on mcount within id id2 and gender, means the observation with the smallest value of mcount. In other words, this command does exactly what you asked for. To be a little more specific, this particular command is -keep if mcount == mcount[1]-. So this means keep all observations whose value of mcount is equal to the smallest value of mcount in the group defined by id id2 and gender. In other words, if there are observations that are all tied for smallest value of mcount in the group, all of them are kept, and everything else is dropped.
Comment

Announcement