Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strategy to merge two dataset

    Hi everyone,

    I would love to receive your advice regarding a strategy to match two datasets.
    Specifically I have these sets of datasets:
    • Dataset A: Contains observations and information about establishments on a yearly basis, with a variable identifier "X_id" (identifying establishments). It's a panel dataset, but there are multiple instances of "X_id" within each year (i.e., duplicates).
    • Dataset B: Contains observations and information (different from Dataset A) about establishments on a yearly basis, with a variable identifier "Y_id" (identifying establishments). It's also a panel dataset with multiple instances of "Y_id" within each year (i.e., duplicates).
    My final goal is to merge datasets A and B. Since each dataset has variables that identify establishment names, I've done the following:
    • Extracted unique identifiers associated with establishments from each dataset.
    • Performed a fuzzy match on establishment names.
    Now I have Dataset C, which is a linking table with two variables: "X_id" associated with each "Y_id".

    Given that my final goal is to merge A and B, I'm trying to determine the best way to leverage this linking table. If I perform a 1:m merge from C using A, I'm struggling to find a way to finally merge with B, as I'll have repeated identifiers in the A+C dataset.

    Any suggestions?
    Thank you!

  • #2
    If the duplicates in Dataset B are indeed duplicates, carrying no useful additional information, you should drop them from the data sets before merging. Beyond that, it's hard to advise you without seeing some kind of example data from each of your files.

    Comment


    • #3
      Are firms are only identified by X_id or Y_id? What is the nature of the duplicate observations? Pure duplicates, or something else?

      Comment

      Working...
      X