Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discarding observations which do not match reference

    I have a dataset of patients, each with a number of samples collected on different days. One of the samples from each patient is labelled as the reference sample with which the others are to be compared.
    But as you can see from the example, some of the samples are not from the same location as the patient's reference sample, and relevant samples must be from the same location.
    I want to discard the irrelevant samples from the dataset. But how can I do that? I imagine some bysort: egen.. procedure, perhaps using _n and _N, but I can't figure out how. Thanks for any help!

    Example dataset
    Code:
    patient_id      sample_no.        location      reference_sample
        #1                   1                  1                  0
        #1                   2                  2                  0
        #1                   3                  2                  1
        #1                   4                  1                  0
        #2                   1                  2                  0
        #2                   2                  2                  1
        #2                   3                  1                  0
        #3                   1                  1                  0
        #3                   2                  1                  1
        #3                   3                  1                  0
        #3                   4                  2                  0
        #4                   1                  2                  1
        #4                   2                  1                  0
        #4                   3                  2                  0
    Last edited by Hans Rahr; 06 Oct 2023, 14:21.

  • #2
    Use of _n and _N, yes. But, as it happens, no -egen- required.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str2 patient_id byte(sample_no location reference_sample)
    "#1" 1 1 0
    "#1" 2 2 0
    "#1" 3 2 1
    "#1" 4 1 0
    "#2" 1 2 0
    "#2" 2 2 1
    "#2" 3 1 0
    "#3" 1 1 0
    "#3" 2 1 1
    "#3" 3 1 0
    "#3" 4 2 0
    "#4" 1 2 1
    "#4" 2 1 0
    "#4" 3 2 0
    end
    
    //    VERIFY EXACTLY ONE REFERENCE SAMPLE PER PATIENT
    by patient_id (reference_sample), sort: assert (reference_sample) == (_n == _N)
    //    DROP OBSERVATIONS WITH LOCATION OTHER THAN THAT OF REFERENCE SAMPLE
    by patient_id (reference_sample): drop if location != location[_N]
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you very much, Clyde, for your swift reply! Your solution is simple and slick and works fine. I am particularly impressed with the assert (reference_sample)==(_n==_N) part with its "cumulated" equal signs. I didn't know you could do that!
      Have a nice weekend
      Hans

      Comment

      Working...
      X