Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • joining two datasets but stata only picks the one with the closest date and the one with most information (least missing)

    I'm just data cleaning my datasets. I have one 'pristine' dataset and another dataset 'outcomes' which I'm working on -- SEE DATASETS BELOW
    I would like to join my outcomes dataset to my pristine dataset.

    The problem is my outcomes dataset has several rows of information for each id and I would like to:
    1. Append the information from outcomes dataset to the pristine dataset using their commonly shared unique ids + gender
    2. The aim is to keep information for each unique Id in 1 row - so outcomes dataset needs some work

    Therefore My thoughts were

    STEP 1:
    1. to first keep the rows in the outcome dataset with the most information, therefore calculating per the no of most missing (ok) and drop the ones with the most missing --> here I got stuck

    Code:
    egen value9=anycount(scar), value(9) //note missing for this variable coded as 9
    egen nmissing=rowmiss(pain) // missing for this variable coded as .
    gen totalmissing=value+nmissing
    
    
    ////for each id, drop the ones with the most missing ***Here I got stuck, not sure what to use for interval - can anyone help
    
    rangestat (max) totalmissing, int(intrval . -1) by(id)
    Step 2:

    Join the pristine dataset to the outcomes dataset and keeping the row which have the surgdate variable closest to 'opdate' and have the same id, id2 and gender (this confirms its the same person)

    Step 2:

    Code:
    //Joining the datasets -- this joins the datasets but adds rows of data which I don't want
    //Load pristine dataset ***
    joinby id id2 gender using "/Users/rosematthews/Desktop/outcomesdata.dta" ///if they're not the same in both datasets = the information is dropped
    Can anyone help with regards to the next steps? where I then want to keep the row which has the surgdate closest to operationdate but of course must have the same id, id2 and gneder (although this was done in this step ***

    Here are the datasets if perhaps I didn't make myself clear... APpreciate your help from all the experts who kindly give us their feedback. - thank you

    ***OUTCOMES DATASET****
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id str2 id2 float(pain scar surgdate) byte value9 float(nmissing totalmissing genderout)
    12 "1A" 1 1 22555 0 0 0 1
    12 "1A" . 9     . 1 1 2 1
    13 "2B" 1 9 11962 1 0 1 2
    13 "2B" 1 9 11962 1 0 1 2
    13 "2B" . 1     . 0 1 1 2
    14 "7Z" 1 9 13549 1 0 1 2
    end
    format %td surgdate

    **PRISTINE DATASET***

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id str2 id2 float(operationdate gender)
    12 "1A" 22555 1
    13 "2B" 11962 2
    14 "7Z" 13549 2
    end
    format %td operationdate

  • #2
    This is a somewhat unusual way to proceed. You are applying two different, potentially orthogonal, criteria in the two steps. So, doing them in the order you propose, you could end up with one observation from the outcomes data set that has lots of information but is very far away temporally from the observation it matches to in the pristine data set. If you reverse the order of application, you could end up pairing with an observation that matches the operationdate and surgdate exactly but has no information about scar and pain at all. I suppose it all depends on how you plan to use and interpret this data, but just think through the implications of doing this before you proceed. My intuition is that for these kinds of variables, in nearly all circumstances I would be more concerned with a good temporal match on the data than with the amount of missing information. But it's your call.

    Meanwhile, here's how I would do your two-step approach.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id str2 id2 float(pain scar surgdate genderout)
    12 "1A" 1 1 22555 1
    12 "1A" . 9     . 1
    13 "2B" 1 9 11962 2
    13 "2B" 1 9 11962 2
    13 "2B" . 1     . 2
    14 "7Z" 1 9 13549 2
    end
    format %td surgdate
    tempfile outcomes
    save `outcomes'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id str2 id2 float(operationdate gender)
    12 "1A" 22555 1
    13 "2B" 11962 2
    14 "7Z" 13549 2
    end
    format %td operationdate
    tempfile pristine
    save `pristine'
    
    // STEP 1
    use `outcomes', clear
    mvdecode scar, mv(9)
    rename genderout gender
    egen mcount = rowmiss(pain scar)
    by id id2 gender (mcount), sort: keep if mcount == mcount[1]    // RETAIN MOST INFORMATION
    drop mcount
    
    // STEP 2
    merge m:1 id id2 gender using `pristine'
    gen delta = abs(surgdate-operationdate)
    by id id2 gender (delta), sort: keep if _n == 1
    drop delta
    Comments:
    1. I think the reason you are confused about what -interval()- to use with -rangestat- is because this isn't a -rangestat- problem. You just want to count the number of missing values among the outcome variables and choose the least. Since there could be multiple observations which are tied for the least amount of missing values, I keep them all at this point.
    2. Don't work with magic number missing values in Stata. It's a recipe for incorrect results. Getting rid of those should be one of the very first steps in data cleaning.
    3. In the second step, it is, again, possible that there will be two or more observations that are tied for temporal proximity. This code breaks any such ties randomly and irreproducibly. It would be better to have a rule for breaking the ties. First, consider that ties can result two ways: there can be two observations with the same surgdate, or their can be two observations with different surgdates, one of which is before and the other after operationdate, but they are the same number of days away from operationdate. In the latter case, it would be best to make a rule that systematically picks the one that comes before, or systematically picks the one that comes after. Your call. In the first situation where both observations have the same surgdate, you might want to base that on the values of some other variables in your full data set provided doing it this way isn't going to bias the analysis.

    Comment


    • #3
      Dear Clyde,

      Thanks for this. I had a think about what you said regarding dropping the data with the most missing. It's true. This would be incorrect for me to do.


      Problem: Trying to find out which outcomes from the outcome dataset are at 6 months, 1 year, 5 years and 10 years.
      And for the preop op period - Keep the one with the least days

      As you can see from my dataset 'outcomes' - something I failed to add in my previous post( see below for dataset) is I have 4 date variables and not all are necessarily correct. Some have a preop date that is coded as after the surgery- which would be incorrect. Here are my thoughts, appreciate if you could help where I marked as a problematic issue



      - The operation date from 'Pristine' dataset is the correct surgery date from which I need to calculate the outcomes of 6m, 1 y, 5y, 10y.

      Outcome variables:
      -Pain & Scar are part of a score which I need to use MI to calculate the total score but I can only do this when I have finaised my dataset due to the requirement of my outcomes in OLS (as seen in my another post)
      - OKS - total score - this outcome score is all complete (mean imputation complete)



      //In your pristine dataset just keep ID ID2 and Gender and operation date (Operation date in prisitne is most correct)

      //using outcomes dataset

      Code:
      merge m:1 id id2 gender using "/Users/rosematthews/Desktop/pristinedata.dta"
      //Problem: Try Find which Proms are at 6 months, 1 year, 5 years, 10 years - coded as 1
      //Rule 1 - preopdates must be < surgerydate, if both are < surgerydate keep the one with the least days from surgery date without losing information


      //Step 1

      //Finding the correct preop_outcomedate - this is calculated in days

      Code:
      gen preop_correct1 = abs(operationdate-preop_outcome)
      gen preop_correct2 = abs(operationdate-preop_received1)
      
      //Finding the correct postop_outcomedate - days
      gen postop_correct1 = abs(operationdate-postop_outcome)
      gen postop_correct2 = abs(operationdate-postop_received2)
      
      
      gen proms_preop = 1 if preop_outcome or preop_received1 < operationdate
      //Problem: Keep the one with the least preop days //need help here - keep the one with the least days from surgery date without losing information
      **becareful may lose information as different rows may have different information at different stages: 6 months/1 year / 5 years / 10 years **


      //Step 2 - Finding the 6 months proms - accepting if the date is within 6 months - 8 months from surgerydate

      Code:
      gen proms_6m= 1 if postop_correct1 >= 180  & postop_correct <=240
      gen proms_6m= 1 if postop_correct2 >= 180  & postop_correct <=240
      // Repeat for 1 year /5 years / 10 years

      //Now you have different lines for each id if the outcomes are at 6 months/ 1 years / 5 years / 10 years - can save different datasets for 6 months/1 year/5 y/ 10 years to have one line for each id



      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float id str2 id2 float(pain scar surgdate) byte value9 float(nmissing totalmissing gender) str24 var14 float(preop_outcome postop_outcome preop_received1 preop_received2)
      12 "1A" 1 1 22555 0 0 0 1 "all correct"              22554 22565 22554 22565
      12 "1A" . 9     . 1 1 2 1 "received correct preop"   22556 22566 22563 22565
      13 "2B" 1 9 11962 1 0 1 2 "received post op correct" 11963 11901 11932 11973
      13 "2B" 1 9 11962 1 0 1 2 "received all correct"     11932 11934     .     .
      13 "2B" . 1     . 0 1 1 2 ""                         11963 11901 11933 11973
      14 "7Z" 1 9 13549 1 0 1 2 ""                         13550 13549 13548 13551
      end
      format %td surgdate
      format %td preop_outcome
      format %td postop_outcome
      format %td preop_received1
      format %td preop_received2
      **Pristine dataset remains the same as post1

      Comment


      • #4
        hello just to say I've sorted the problem out - for anyone else here is the code, for, trying to find the PROMS at 6months etc (see my post 5)

        However I still have this problem:

        I would like to keep the row of information for each ID which has the

        preop_date or recpreop_date CLOSEST to the surgery date




        These are my steps up to now - pls note a cleaner dataset and smaller


        Code:
        //Select which is the correct pre_opdate (from the received vs preopdate) by finding the one closest to the operationdate
        
        
        gen preop_correct1 = abs(operationdate-preop_date)
        
        gen preop_correct2 = abs(operationdate-recpreop_date)
        
        
        
        //Clyde - suggested this code - for when I wanted just one variable of dates. But this time I’m trying to decide between two (preop_correct1 and preop_correct2) 
        
        //Clyde's code 
        gen delta = abs(surgdate-operationdate)
        
        by id id2 gender (delta), sort: keep if _n == 1
        
        drop delta

        I don’t know how to modify the code for 2 variables?






        Dataset:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        
        clear
        
        input float id str2 id2 float(pain scar surgdate oks gender preop_date recpreop_date postop_date recpostop_date operationdate) byte(_merge duplicatescores)
        
        12 "1A" 1 1 22555 40 1 22533 22289 22770 22771 22555 3 0
        
        13 "2B" 1 9 11962  1 2 11933 11933 11963 11782 11962 3 0
        
        13 "2B" 1 9 11962 48 2 11932 11951 11963 11782 11962 3 1
        
        13 "2B" 1 9 11962 48 2 11932     .     .     .     . . 1
        
        end
        
        format %td surgdate
        
        format %td preop_date
        
        format %td recpreop_date
        
        format %td postop_date
        
        format %td recpostop_date
        
        format %td operationdate
        
        label values _merge _merge
        
        label def _merge 3 "Matched (3)", modify


        Comment


        • #5
          //Select which are the 6/12 proms

          //Find out which Proms have been recorded at 6 months or 1 year - days

          Code:
          gen postop_6m = abs(operationdate-postop_date)
          gen postop_6mrec = abs(operationdate-recpostop_date)
          
          gen postop_1y = abs(operationdate-postop_date)
          gen postop_1yrec = abs(operationdate-recpostop_date)
          
          //6 months = 180 days - 8 months = 6months 240 days
          gen final6m= 1 if postop_6m >= 180 & postop_6m  <= 240
          gen final6mrec= 1 if postop_6mrec >= 180 & postop_6mrec  <= 240
          
          
          //1 year + 2 months limit 
          gen final1y= 1 if postop_1y >= 365 & postop_1y  <= 425
          gen final_1yrec= 1 if postop_1yrec >= 365 & postop_1yrec  <= 425
          
          
          //Save 6 months dataset
          keep if final6m == 1 | final6mrec ==1 
          
          
          //WORK FORM there

          Comment


          • #6

            Code:
             by id id2 gender (mcount), sort: keep if mcount == mcount[1]    // RETAIN MOST INFORMATION
            Is there a way how to work with -min- command? What if I have a variable with ID 16 that has mcount number as 2 (rather than 1). This wouldn't be captured with the above command. As I want to find for each ID keep the the ID with the LEAST missing, this could be any number from 1-3. By any chance, do you have any other recommendations, if any of course?

            Comment


            • #7
              Is there a way how to work with -min- command? What if I have a variable with ID 16 that has mcount number as 2 (rather than 1). This wouldn't be captured with the above command. As I want to find for each ID keep the the ID with the LEAST missing, this could be any number from 1-3. By any chance, do you have any other recommendations, if any of course?
              I think you don't understand what that code does. Which also means that you don't understand how -by- works with subscripts, so you are lacking a fundamental Stata tool that you need for working with data sets like the ones you have. So let's review that.

              In the absence of -by-, a subscript refers to which observation of the data set is to be used. So, in the absence of -by-, mcount[1] refers to the value of variable mcount in the first observation. It has nothing to do with mcount == 1. mcount[5] refers to the value of mcount in the 5th observation. And _N has a special meaning: mcount[_N] refers to the value of mcount in the last observation. The value of the subscript (the number inside the square brackets) designates the location of the data set, not its value.

              In a command with -by- it is a bit more complicated. If we have -by varlist1 (varlist2): ... mcount[1]...-, first recall that the data must first be sorted on varlist1 and varlist2. -by- then executes the command that follows the colon separately for each group of observations defined by common values of varlist1. Now, in the presence of -by-, the values of subscripts are interpreted relative to the group of observations. So when we have -by id id2 gender (mcount): ...mcount[1]...- the data are first sorted in order of id, id2, gender, and mcount. The command following the colon is executed separately for each group of observations defined by the same combination of values of id id2 and gender (but not mcount). Now, within each such group of observations, the data are already sorted on mcount, meaning that the first observation in the group has the smallest value of mcount, the second one has the next smallest value (or might be tied with the first), etc. So in this context mcount[1] means the first observation in the group, which, because the data were presorted on mcount within id id2 and gender, means the observation with the smallest value of mcount. In other words, this command does exactly what you asked for. To be a little more specific, this particular command is -keep if mcount == mcount[1]-. So this means keep all observations whose value of mcount is equal to the smallest value of mcount in the group defined by id id2 and gender. In other words, if there are observations that are all tied for smallest value of mcount in the group, all of them are kept, and everything else is dropped.

              Comment

              Working...
              X