Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I determine if a datafile has overlapping entries in another file?

    Hello Everyone,

    I have inherited several datasets which may or may not have overlapping entries. In general the variables are the same. One issue I am running into when trying to compare the datasets is that the number of entries is not the same. For example I might have the file blue_A with 300 entries and blue_B with 90 entries. I've checked that the variables are indeed the same (well 3 identifying var names are exactly the same, but the rest have A_ or B_ appended to them, which is a whole different issue) using cfvars (SSC). Before I append the datasets I want to check first if any of the id variable values are the same.

    I guess what I am asking is, how do I determine if any of the 90 things (rooms, in this case) from the file blue_B are also included in the file blue_A? I have about 19 of these "pairs" of files that I have to go through. If the id variables was numeric I could make some plots or do some math to see if any entries are included in both, but they are identified by strings ("c1-b3-x6") so that doesn't work. Ideally these are separate datasets for case A and case B and the overlap is zero, but I know that is not the case (some rooms have been measured for case A and B).

    I hope my question is clear....Nothing I've tried so far works and I feel like I've wasted a ton of time on something that should be simple.

    Thanks in advance!
    Cara

  • #2
    Are the string identifiers the same across datasets blue_A and blue_B? If so, appending the datasets and then using duplicates on the identifier variable, to search for duplicate observations, should do the work. Otherwise, you could use duplicates with the whole set of variables of interest, but this assumes that the entries can never have the same values even if they're different.
    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

    Comment


    • #3
      Thanks Jorge! I'm sorry I didn't say thank you sooner - I had internet problems and then was travelling.

      So - YES - that actually worked in the end. I copied the original datasets to create "play" datasets, and then appended those and checked for duplicates using various lists of the variables to see where the duplicates were and why they were duplicate in the first place. I am looking at lists of rooms, and some (not all) were measured twice in different seasons, so there were duplicate rooms with all variables the same but with different temperatures (for example) - I needed a single dataset, one room per observation so I had to comb through everything carefully, add variables for second measurements, etc.

      Thanks again!
      Cara

      Comment

      Working...
      X