Strategy to merge two dataset

Marco Errico

Join Date: Apr 2020

Posts: 187
#1

Strategy to merge two dataset

31 Jul 2024, 10:53

Hi everyone,

I would love to receive your advice regarding a strategy to match two datasets.
Specifically I have these sets of datasets:
Dataset A: Contains observations and information about establishments on a yearly basis, with a variable identifier "X_id" (identifying establishments). It's a panel dataset, but there are multiple instances of "X_id" within each year (i.e., duplicates).

Dataset B: Contains observations and information (different from Dataset A) about establishments on a yearly basis, with a variable identifier "Y_id" (identifying establishments). It's also a panel dataset with multiple instances of "Y_id" within each year (i.e., duplicates).

My final goal is to merge datasets A and B. Since each dataset has variables that identify establishment names, I've done the following:
Extracted unique identifiers associated with establishments from each dataset.

Performed a fuzzy match on establishment names.

Now I have Dataset C, which is a linking table with two variables: "X_id" associated with each "Y_id".

Given that my final goal is to merge A and B, I'm trying to determine the best way to leverage this linking table. If I perform a 1:m merge from C using A, I'm struggling to find a way to finally merge with B, as I'll have repeated identifiers in the A+C dataset.

Any suggestions?
Thank you!
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

31 Jul 2024, 11:21

If the duplicates in Dataset B are indeed duplicates, carrying no useful additional information, you should drop them from the data sets before merging. Beyond that, it's hard to advise you without seeing some kind of example data from each of your files.
Comment
George Ford

Join Date: Aug 2014

Posts: 3136
#3

31 Jul 2024, 12:21

Are firms are only identified by X_id or Y_id? What is the nature of the duplicate observations? Pure duplicates, or something else?
Comment

Announcement

Strategy to merge two dataset

Comment

Comment