Identifying individuals across different datasets

Ameer Lambrias

Join Date: Oct 2023

Posts: 2
#1

Identifying individuals across different datasets

25 Oct 2023, 21:12

Hi everyone,

I am working with medical record data and have two datasets, demographics.dta and encounters.dta. demographics.dta is comprised of 598 records, with one for each individual represented. In contrast, encounters.dta comprises 905 records for 143 individuals who each had anywhere between 1 and 31 encounters with a specific clinic. What I want to be able to do is generate an indicator variable in demographics.dta to specifically identify the 143 individuals who attended the clinic out of the initial sample of 598.

I have one common identifying variable across the two datasets, mrn (medical record number). In demographics.dta, each mrn only appears once as seen in the first 10 records below:

Code:

mrn 571246 662848 680940 774017 774256 774719 774877 775572 775700 776944

But in encounters.dta, the same mrn appears multiple times e.g.:

Code:

mrn 776944 776944 776944 776944 776944 776944 875626 875626 875626 875626

I know that all the individuals I want to identify are included in demographics.dta, is there a method to register this without manually scrolling through and matching by mrn? I have tried to merge the datasets but this just adds more records to demographics.dta and I am unsure how to proceed from here. Any assistance would be greatly appreciated
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#2

25 Oct 2023, 22:07

Code:

use mrn using encounters, clear keep mrn duplicates drop merge 1:1 mrn using demographics, assert(match using) gen byte in_encounters = (_merge == 3) drop _merge

Note: If the -merge- command gives you an error message saying that not all the observations are match or using, that will be because, contrary to your expectation, there are mrn's that occur in the encounters file that are not anywhere in the demographics file.
1 like
Comment
Ameer Lambrias

Join Date: Oct 2023

Posts: 2
#3

25 Oct 2023, 22:32

Thanks for your help Clyde! Your legendary status on these forums is well deserved.

As you anticipated, there was one mrn which was copied over from the encounters dataset. However, curiously it has a value of 0 for the in_encounters variable.

Last edited by Ameer Lambrias; 25 Oct 2023, 22:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#4

26 Oct 2023, 08:52

Well, you "cheated." The -merge- command halted execution with an error message. You chose to go ahead and run the rest of the code, which you're not supposed to do. The reason Stata halts execution with error messages is to inform you that something is wrong--if you run the rest of the code in the face of that, there is no assurance that the subsequent results will be correct, or even meaningful.

The reason that observation got the code in_encounters = 0 is clear. The command that creates in_encounters defines it to be 1 when _merge == 3 and 0 otherwise. The rogue observation in the encounters data had _merge = 2, not 3. (If you are not familiar with how the -merge- command creates the _merge variable, read -help merge-.) But it doesn't matter whether that observation had a "correct" value for in_encounters: that observation, under the terms of your problem, is not supposed to exist in the first place. That's why the code asks Stata to stop and give you that error message: it found an observation that should not exist and is not classifiable within the terms of your problem. The wise course of action is investigate why that observation is even there. Perhaps the mrn is incorrect. Perhaps you are using an outdated version of the demographics file. Perhaps it is something else. But something is wrong that makes it impossible to correctly proceed with the original plan.

Last edited by Clyde Schechter; 26 Oct 2023, 08:55.
Comment

Announcement

Identifying individuals across different datasets

Comment

Comment

Comment