Discarding observations which do not match reference

Hans Rahr

Join Date: Oct 2019
Posts: 30

Discarding observations which do not match reference

06 Oct 2023, 14:06

I have a dataset of patients, each with a number of samples collected on different days. One of the samples from each patient is labelled as the reference sample with which the others are to be compared.
But as you can see from the example, some of the samples are not from the same location as the patient's reference sample, and relevant samples must be from the same location.
I want to discard the irrelevant samples from the dataset. But how can I do that? I imagine some bysort: egen.. procedure, perhaps using _n and _N, but I can't figure out how. Thanks for any help!

Example dataset

Code:

patient_id      sample_no.        location      reference_sample
    #1                   1                  1                  0
    #1                   2                  2                  0
    #1                   3                  2                  1
    #1                   4                  1                  0
    #2                   1                  2                  0
    #2                   2                  2                  1
    #2                   3                  1                  0
    #3                   1                  1                  0
    #3                   2                  1                  1
    #3                   3                  1                  0
    #3                   4                  2                  0
    #4                   1                  2                  1
    #4                   2                  1                  0
    #4                   3                  2                  0

Last edited by Hans Rahr; 06 Oct 2023, 14:21.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

06 Oct 2023, 14:43

Use of _n and _N, yes. But, as it happens, no -egen- required.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str2 patient_id byte(sample_no location reference_sample) "#1" 1 1 0 "#1" 2 2 0 "#1" 3 2 1 "#1" 4 1 0 "#2" 1 2 0 "#2" 2 2 1 "#2" 3 1 0 "#3" 1 1 0 "#3" 2 1 1 "#3" 3 1 0 "#3" 4 2 0 "#4" 1 2 1 "#4" 2 1 0 "#4" 3 2 0 end // VERIFY EXACTLY ONE REFERENCE SAMPLE PER PATIENT by patient_id (reference_sample), sort: assert (reference_sample) == (_n == _N) // DROP OBSERVATIONS WITH LOCATION OTHER THAN THAT OF REFERENCE SAMPLE by patient_id (reference_sample): drop if location != location[_N]

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Hans Rahr

Join Date: Oct 2019

Posts: 30
#3

07 Oct 2023, 05:40

Thank you very much, Clyde, for your swift reply! Your solution is simple and slick and works fine. I am particularly impressed with the assert (reference_sample)==(_n==_N) part with its "cumulated" equal signs. I didn't know you could do that!
Have a nice weekend
Hans
Comment

Announcement

Discarding observations which do not match reference

Comment

Comment