comparing multiple variables from two datasets without ID

Marius Kaltenbach

Join Date: Apr 2017

Posts: 31
#1

comparing multiple variables from two datasets without ID

11 Oct 2019, 06:06

Dear Stata friends,

Following Scenario:
I have to Datasets with four variables each: Schoolname, Address, City and Postalcode.
The first dataset is from 2016, the second one from 2020
Both have around 3500 Observations, but not exactly the same number, also not in the same order. There is no common identical ID.

Task:
I would like to compare the datasets. Which datapoints from which school has changed within the four years, which data is still the same.

What have I tried:
My first though was to use cfout.

Code:

cfout SCHOOLNAME ADDRESS CITY PCODE using "xxxxx", saving(diffs) use diffs

But without ID that doesnt work (I only realized later that the ID's in both datasets are not the same...).

A second though was to use a normal merge and repeat it over and over again. Starting with all variables and then looking for less intersections.

Code:

merge 1:1 SCHOOLNAME ADDRESS CITY PCODE using "xxxxx.dta" export excel using "xxxxx.xlsx" if _merge== 3, firstrow(variables) sheet("err_all_merge") nolabel replace

But that seems very hard to analyse at the end. There would be alot of differents Excel sheets and combinations.

Is there another strategy which I am not seeing?

Thanks for your time

Last edited by Marius Kaltenbach; 11 Oct 2019, 06:12.
Tags: None
cammcdermaid

Join Date: May 2014

Posts: 23
#2

11 Oct 2019, 06:53

I'd suggest data cleaning and constucting your own unique ID

When I've worked with data like this, I construct a unique ID using elements that are likely to be stable and, depending on the data quality, are least likely to include errors or variations after doing some basic cleaning (all caps, remove special symbols, trim leading and trailing spaces and remove internal spaces). Something like the first 5 characters of the school name concatenated with the postal code. You might have to do some match iterations to find out what combination works best and there may be some you have to manually set. I would avoid using addresses because there are so many variations possible. 14 MAIN ST N; 14 MAIN STREET N; 14 MAIN ST NORTH; 14 MAIN ST etc. While you can clean these if you really need them, it can be a pain.

When I work with data like this, creating a unique identifier is usually easiest. In some cases I've had to do the iterative match technique you describe where I'm using less and less data to make matches. When I do that, I generally also loop in a confidence flag that lets me know which matches may be the least robust (e.g. confident=2 if I'm sort of confident, confident=3 if less so etc.)
1 like
Comment
Marius Kaltenbach

Join Date: Apr 2017

Posts: 31
#3

13 Oct 2019, 07:42

Thank you for your answer. My problem with creating an unique ID is, that I will not be able to create the same ID for the same case.
The idea with the confidence Variable is good. I might use that.
Comment

Announcement

comparing multiple variables from two datasets without ID

Comment

Comment