Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • comparing multiple variables from two datasets without ID

    Dear Stata friends,

    Following Scenario:
    I have to Datasets with four variables each: Schoolname, Address, City and Postalcode.
    The first dataset is from 2016, the second one from 2020
    Both have around 3500 Observations, but not exactly the same number, also not in the same order. There is no common identical ID.

    Task:
    I would like to compare the datasets. Which datapoints from which school has changed within the four years, which data is still the same.

    What have I tried:
    My first though was to use cfout.
    Code:
    cfout SCHOOLNAME ADDRESS CITY PCODE using "xxxxx", saving(diffs)
    use diffs
    But without ID that doesnt work (I only realized later that the ID's in both datasets are not the same...).

    A second though was to use a normal merge and repeat it over and over again. Starting with all variables and then looking for less intersections.

    Code:
    merge 1:1 SCHOOLNAME ADDRESS CITY PCODE using "xxxxx.dta"
    export excel using "xxxxx.xlsx" if _merge== 3, firstrow(variables) sheet("err_all_merge") nolabel replace
    But that seems very hard to analyse at the end. There would be alot of differents Excel sheets and combinations.

    Is there another strategy which I am not seeing?

    Thanks for your time
    Last edited by Marius Kaltenbach; 11 Oct 2019, 06:12.

  • #2
    I'd suggest data cleaning and constucting your own unique ID

    When I've worked with data like this, I construct a unique ID using elements that are likely to be stable and, depending on the data quality, are least likely to include errors or variations after doing some basic cleaning (all caps, remove special symbols, trim leading and trailing spaces and remove internal spaces). Something like the first 5 characters of the school name concatenated with the postal code. You might have to do some match iterations to find out what combination works best and there may be some you have to manually set. I would avoid using addresses because there are so many variations possible. 14 MAIN ST N; 14 MAIN STREET N; 14 MAIN ST NORTH; 14 MAIN ST etc. While you can clean these if you really need them, it can be a pain.

    When I work with data like this, creating a unique identifier is usually easiest. In some cases I've had to do the iterative match technique you describe where I'm using less and less data to make matches. When I do that, I generally also loop in a confidence flag that lets me know which matches may be the least robust (e.g. confident=2 if I'm sort of confident, confident=3 if less so etc.)

    Comment


    • #3
      Thank you for your answer. My problem with creating an unique ID is, that I will not be able to create the same ID for the same case.
      The idea with the confidence Variable is good. I might use that.


      Comment

      Working...
      X