I'm just data cleaning my datasets. I have one 'pristine' dataset and another dataset 'outcomes' which I'm working on -- SEE DATASETS BELOW
I would like to join my outcomes dataset to my pristine dataset.
The problem is my outcomes dataset has several rows of information for each id and I would like to:
1. Append the information from outcomes dataset to the pristine dataset using their commonly shared unique ids + gender
2. The aim is to keep information for each unique Id in 1 row - so outcomes dataset needs some work
Therefore My thoughts were
STEP 1:
1. to first keep the rows in the outcome dataset with the most information, therefore calculating per the no of most missing (ok) and drop the ones with the most missing --> here I got stuck
Step 2:
Join the pristine dataset to the outcomes dataset and keeping the row which have the surgdate variable closest to 'opdate' and have the same id, id2 and gender (this confirms its the same person)
Step 2:
Can anyone help with regards to the next steps? where I then want to keep the row which has the surgdate closest to operationdate but of course must have the same id, id2 and gneder (although this was done in this step ***
Here are the datasets if perhaps I didn't make myself clear... APpreciate your help from all the experts who kindly give us their feedback. - thank you
***OUTCOMES DATASET****
**PRISTINE DATASET***
I would like to join my outcomes dataset to my pristine dataset.
The problem is my outcomes dataset has several rows of information for each id and I would like to:
1. Append the information from outcomes dataset to the pristine dataset using their commonly shared unique ids + gender
2. The aim is to keep information for each unique Id in 1 row - so outcomes dataset needs some work
Therefore My thoughts were
STEP 1:
1. to first keep the rows in the outcome dataset with the most information, therefore calculating per the no of most missing (ok) and drop the ones with the most missing --> here I got stuck
Code:
egen value9=anycount(scar), value(9) //note missing for this variable coded as 9
egen nmissing=rowmiss(pain) // missing for this variable coded as .
gen totalmissing=value+nmissing
////for each id, drop the ones with the most missing ***Here I got stuck, not sure what to use for interval - can anyone help
rangestat (max) totalmissing, int(intrval . -1) by(id)
Join the pristine dataset to the outcomes dataset and keeping the row which have the surgdate variable closest to 'opdate' and have the same id, id2 and gender (this confirms its the same person)
Step 2:
Code:
//Joining the datasets -- this joins the datasets but adds rows of data which I don't want //Load pristine dataset *** joinby id id2 gender using "/Users/rosematthews/Desktop/outcomesdata.dta" ///if they're not the same in both datasets = the information is dropped
Here are the datasets if perhaps I didn't make myself clear... APpreciate your help from all the experts who kindly give us their feedback. - thank you
***OUTCOMES DATASET****
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(pain scar surgdate) byte value9 float(nmissing totalmissing genderout) 12 "1A" 1 1 22555 0 0 0 1 12 "1A" . 9 . 1 1 2 1 13 "2B" 1 9 11962 1 0 1 2 13 "2B" 1 9 11962 1 0 1 2 13 "2B" . 1 . 0 1 1 2 14 "7Z" 1 9 13549 1 0 1 2 end format %td surgdate
**PRISTINE DATASET***
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float id str2 id2 float(operationdate gender) 12 "1A" 22555 1 13 "2B" 11962 2 14 "7Z" 13549 2 end format %td operationdate
Comment