Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying individuals across different datasets

    Hi everyone,

    I am working with medical record data and have two datasets, demographics.dta and encounters.dta. demographics.dta is comprised of 598 records, with one for each individual represented. In contrast, encounters.dta comprises 905 records for 143 individuals who each had anywhere between 1 and 31 encounters with a specific clinic. What I want to be able to do is generate an indicator variable in demographics.dta to specifically identify the 143 individuals who attended the clinic out of the initial sample of 598.

    I have one common identifying variable across the two datasets, mrn (medical record number). In demographics.dta, each mrn only appears once as seen in the first 10 records below:
    Code:
    mrn
    571246
    662848
    680940
    774017
    774256
    774719
    774877
    775572
    775700
    776944
    But in encounters.dta, the same mrn appears multiple times e.g.:
    Code:
    mrn
    776944
    776944
    776944
    776944
    776944
    776944
    875626
    875626
    875626
    875626
    I know that all the individuals I want to identify are included in demographics.dta, is there a method to register this without manually scrolling through and matching by mrn? I have tried to merge the datasets but this just adds more records to demographics.dta and I am unsure how to proceed from here. Any assistance would be greatly appreciated

  • #2
    Code:
    use mrn using encounters, clear
    keep mrn
    duplicates drop
    merge 1:1 mrn using demographics, assert(match using)
    gen byte in_encounters = (_merge == 3)
    drop _merge
    Note: If the -merge- command gives you an error message saying that not all the observations are match or using, that will be because, contrary to your expectation, there are mrn's that occur in the encounters file that are not anywhere in the demographics file.

    Comment


    • #3
      Thanks for your help Clyde! Your legendary status on these forums is well deserved.

      As you anticipated, there was one mrn which was copied over from the encounters dataset. However, curiously it has a value of 0 for the in_encounters variable.
      Last edited by Ameer Lambrias; 25 Oct 2023, 22:35.

      Comment


      • #4
        Well, you "cheated." The -merge- command halted execution with an error message. You chose to go ahead and run the rest of the code, which you're not supposed to do. The reason Stata halts execution with error messages is to inform you that something is wrong--if you run the rest of the code in the face of that, there is no assurance that the subsequent results will be correct, or even meaningful.

        The reason that observation got the code in_encounters = 0 is clear. The command that creates in_encounters defines it to be 1 when _merge == 3 and 0 otherwise. The rogue observation in the encounters data had _merge = 2, not 3. (If you are not familiar with how the -merge- command creates the _merge variable, read -help merge-.) But it doesn't matter whether that observation had a "correct" value for in_encounters: that observation, under the terms of your problem, is not supposed to exist in the first place. That's why the code asks Stata to stop and give you that error message: it found an observation that should not exist and is not classifiable within the terms of your problem. The wise course of action is investigate why that observation is even there. Perhaps the mrn is incorrect. Perhaps you are using an outdated version of the demographics file. Perhaps it is something else. But something is wrong that makes it impossible to correctly proceed with the original plan.
        Last edited by Clyde Schechter; 26 Oct 2023, 08:55.

        Comment

        Working...
        X