Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge with variables which can vary a little bite

    Hi,
    I'm looking how to merge 2 files with common variables which can vary (but only as i define the variation)

    Ex: the birth weight can be accepted with a variation of 100g or an age with a variation of 1 year.

    In advance, Thank you for your help

  • #2
    One approach would be to merge the datasets "unconditionally", i.e. generate a dummy with value 1 in both, and then use joinby to "merge" on that dummy:

    in both datasets:
    gen dummy = 1

    in file1.dta:
    joinby dummy using file2.dta

    Now you have all possible combinations of observations between the datasets (the result has n*m observations - might be problematic if both files have many observations to begin with. I have no clue how this solutions performs with large files).


    Then check the difference between the variables you want the merge to be conditional on (say age1 is age in the first, age2 age in the second dataset).
    Now you drop all combinations where the absolute value of that difference exceeds your tolerance limit:

    drop if abs(age1-age2)>1


    /Edit:

    Oh, I should add that you should make sure that there are no identically named variables in the datasets, as otherwise the command will just keep the the one from file1. You can use ren before creating the dummy in file2 to ensure this doesn't happen, for example:

    ren * *_2

    (then go on as above)
    Last edited by David Poensgen; 11 Sep 2014, 05:57.

    Comment


    • #3
      I'm not sure this calls for a joinby. If the age or birth weight can vary a bit, then presumably they refer to the same person. And presumably the datasets contains identifiers variables that uniquely identify each person. If that's the case, then all that is needed is to rename variables and merge both version. You can then do variable-to-variable comparisons. There are some user-written programs on SSC that will handle the merge but a simple example can show how to do this from first principles:

      Code:
      sysuse auto, clear
      
      * make sure you have a variable that uniquely idenfies observations
      isid make, sort
      
      * make a few changes
      replace price = price + _n if mod(_n,2)
      replace trunk = trunk + _n if mod(_n,2)
      
      * change the number of observations
      drop if price > 10000
      
      * rename variables that may be common in both datasets (help rename group)
      rename price-foreign =_
      
      * save the mock data with differences
      tempfile auto2
      save "`auto2'"
      
      * go back to where we started and compare both datasets
      sysuse auto, clear
      merge 1:1 make using "`auto2'"
      
      * Do a variable-to-variable comparison and decide what to do
      list price price_ if abs(price - price_) / price > .01 & _merge == 3
      list trunk trunk_ if abs(trunk - trunk_) / trunk > 5 & _merge == 3

      Comment


      • #4
        Note also that rather than using joinby with a dummy variable set to 1, you could instead use cross

        Comment

        Working...
        X