Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I perform a fuzzy match between datasets using numeric values?

    Is there a way to efficiently perform a "fuzzy" merge between two large datasets? Emphasis on large datasets (see below). For example, I have two datasets like this
    Code:
    * Dataset 1
    clear
    input str3 mortgage_id long(balance zip_code) byte open_date
    "AA1" 200000 12345 1
    "AA2" 150000 68593 4
    "AA3"  98000 39825 2
    "AA4" 176400 57924 6
    end
    Code:
    * Dataset 2
    clear
    input str3 mortgage_id long(balance zip_code) byte open_date double interest_rate
    "XX1" 200000 12345 1  7.6
    "YY1" 151000 68593 3  8.1
    "YY2" 149000 68593 4  9.2
    "YY3" 345600 20886 7  4.3
    "ZZ1" 175900 57924 6 11.1
    "ZZ2" 176900 57924 6  5.5
    end
    and I want to consider the following criteria, in order:
    1. A mortgage in the first dataset is considered a match to a mortgage in the second dataset if their balances are within $1,000 of each other. In my example, mortgage AA1 matches precisely one mortgage in the second dataset: XX1.
    2. If a mortgage in dataset 1 matches multiple mortgages in dataset 2 based on their balances, e.g. AA2 matches both YY1 and YY2, then the open_date and zip_code should be used to break ties.
    3. Whether or not ties exist, stop. In this example, AA4 matches both ZZ1 and ZZ2, but that's ok. Consider AA4 matched to both of these mortgages.
    Since the data are numeric, -reclink- doesn't apply here, and since I'm not necessarily estimating treatment effects, I don't think the propensity score matching of -teffects- will do this. Unfortunately, the datasets I'm working with are large enough (hundreds of millions of observations) that simply computing the cross product of both datasets and calculating the matches that way is completely infeasible.

    Does Stata have any support for this at all? Or is Stata simply not the tool for data work like this?



  • #2
    The first step would be very slow if you do it naively, because you would have to compute N^2 comparisons (i.e. a trillion comparisons if you have a dataset with a million obs.)

    However, you can borrow from the GIS literature to do something like an R-tree. The idea is that you round up the balances into thousands, so if mortgage i is 135,500, then it belongs in the bucket #134 and you only need to check buckets #134, #135 and #136 of the second dataset for possible matches. and compare mortgages in those buckets. That makes the whole thing much faster.

    Now, there is no built-in way to do it in Stata. You code it if you code in Mata with associative arrays but it would be a lot of work. You could also work in Python/R/PostGIS/SQLite with the R*Tree module/etc but it would also be a lot of work.

    Nonetheless, if you are smart with the algorithm you should be able to solve it fast, even with millions of observations.

    Comment


    • #3
      Check out -rangejoin- at SSC.

      Comment


      • #4
        You might also see if you can find a way to make -calipmatch- (SSC) do what you're looking for- apparently it's optimized for large datasets(?).
        __________________________________________________ __
        Assistant Professor, Department of Biostatistics and Epidemiology
        School of Public Health and Health Sciences
        University of Massachusetts- Amherst

        Comment


        • #5
          You might also want to look at the new R package fastLink with the gammaKpar() function. See https://cran.r-project.org/web/packa...ink/index.html and https://github.com/kosukeimai/fastLink. R can be called from within Stata. I have not tested fastLink but it might be relevant to you.

          Comment


          • #6
            On second look, it seems that the new R package fastLink as is will not work for this because the gammaKpar function determines a binary agree-disagree, not a partial agree. Perhaps submit a feature request to the authors?

            Comment


            • #7
              I agree with Clyde with Clyde's recommendation to use rangejoin for this problem. rangejoin is very fast and very efficient in terms of memory management. Here's a fully worked out example from what I understand of the problem in #1:

              Code:
              * Dataset 2
              clear
              input str3 mortgage_id long(balance zip_code) byte open_date double interest_rate
              "XX1" 200000 12345 1  7.6
              "YY1" 151000 68593 3  8.1
              "YY2" 149000 68593 4  9.2
              "YY3" 345600 20886 7  4.3
              "ZZ1" 175900 57924 6 11.1
              "ZZ2" 176900 57924 6  5.5
              end
              save "dataset2.dta", replace
              
              * Dataset 1
              clear
              input str3 mortgage_id long(balance zip_code) byte open_date
              "AA1" 200000 12345 1
              "AA2" 150000 68593 4
              "AA3"  98000 39825 2
              "AA4" 176400 57924 6
              end
              
              rangejoin balance -1000 1000 using "dataset2.dta"
              
              gen nmatch = (zip_code == zip_code_U) + (open_date == open_date_U)
              bysort mortgage_id (nmatch mortgage_id): drop if nmatch < nmatch[_N] & nmatch[_N] > 0
              list, sepby(mortgage_id)
              and the results
              Code:
              . list, sepby(mortgage_id)
              
                   +----------------------------------------------------------------------------------------------------------+
                   | mortga~d   balance   zip_code   open_d~e   mortga~U   balanc~U   zip_co~U   open_d~U   intere~e   nmatch |
                   |----------------------------------------------------------------------------------------------------------|
                1. |      AA1    200000      12345          1        XX1     200000      12345          1        7.6        2 |
                   |----------------------------------------------------------------------------------------------------------|
                2. |      AA2    150000      68593          4        YY2     149000      68593          4        9.2        2 |
                   |----------------------------------------------------------------------------------------------------------|
                3. |      AA3     98000      39825          2                     .          .          .          .        0 |
                   |----------------------------------------------------------------------------------------------------------|
                4. |      AA4    176400      57924          6        ZZ1     175900      57924          6       11.1        2 |
                5. |      AA4    176400      57924          6        ZZ2     176900      57924          6        5.5        2 |
                   +----------------------------------------------------------------------------------------------------------+

              Comment


              • #8
                Originally posted by Robert Picard View Post
                I agree with Clyde with Clyde's recommendation to use rangejoin for this problem. rangejoin is very fast and very efficient in terms of memory management.
                I attempted to use -rangejoin-, but unfortunately it isn't suitable for my needs. I tried to fuzzy merge a 2.5 GB dataset with roughly 100 million obs with a much smaller dataset of roughly a hundred obs, and, before running out of memory and crashing with an r(909) error, -rangejoin- consumed all 32 GB of RAM on the workstation and another 23 GB of swap space. This happened regardless of which datasets I designated master and using.

                -rangejoin- might work for small datasets, but I don't think it's going to work for me.

                Comment


                • #9
                  I too would be interested in knowing the practical dataset limits of -rangejoin-. Is rounding the balances by $2,000 followed by a deterministic match a possible alternative? For example, if you round the balance $151,000 to $150,000 then it is a match but if you round $151,000 to $152,000 then it is a non-match?

                  Comment


                  • #10
                    Michael Anbar You might want to contact Robert Picard about this. I have used -rangejoin- with both the master and using data sets being about 5GB, with no memory problems at all, and surprisingly fast results. I wonder if there is something particular about your data sets, or the amount of cross-matches produced that is causing the problem. Anyway, my guess is that Robert Picard would want to know about it, if not for an immediate fix, perhaps for a later version.

                    Anders Alexandersson -rangejoin- doesn't do any rounding or binning. Instead it matches according to a specified range. So, for example, if you run

                    Code:
                    rangejoin balance -2000 2000 using...
                    then $150,000 will match with anything between $148,000 and $152,000. If you want the allowable range to vary by observation, you can do that by creating variables for the lower and upper limits and specifying those instead of numbers in the -rangejoin-

                    And -rangejoin- can simultaneously impose exact matching on other variables, using its -by()- option.

                    Comment


                    • #11
                      There's no problem with rangejoin and large datasets. It is vastly more efficient than any other alternatives. When performing a join, you form all pairwise combinations. So even if each observation in memory matches only 10 observations from the using, you still have to hold in memory 1 billion observations. All you need is to do this in parts.

                      Comment


                      • #12
                        Clyde Schechter Thank you. Yes, I know. I instead meant that if rangejoin was too slow, then rounding in Stata (using another command than rangejoin) followed by merge might be an option. If merge itself is too slow, which has been suggested, then maybe the user-written fmerge which is part of ftools (SSC) or R's package fastLink (with exact matching on variable balance) would work.

                        Robert Picard just replied, again recommending rangejoin but doing it in parts.

                        Comment


                        • #13
                          Let me add that a lot depends on the structure of the data and how many matches each observation is likely to find within the requested +/-1000 range. Say the mortgage balances are evenly distributed between 0 and 500,000 in both datasets, that means that each observation from the master is likely to match 1/250 of the observations in the using. So even if you sort the master by balance and split the 100 million observations into 1000 smaller files, each with 100,000 observations, each of these may well match 400 observations from the using, which would blow up the data to 40M observations (and twice as wide because the variables come from both datasets). Only then would you be able to prune down to the desired match(es) by comparing zip and date.

                          On the off chance that the objective here is to find an interest rate from peer mortgages, something like the mean of the interest rate from mortgages in the +/-1000 range, let me suggest an approach using rangestat. Rather than forming all pairwise combinations of observations within a +/-1000 range, you simply need to append the data with interest rates to the master (from the data examples in #1, all variables are the same except that the second dataset contains an interest rate). Then it's just a matter of finding, for each observation without an interest rate, the mean interest rate from mortgages that are within +/-1000 of the value of balance for the current observation. It makes sense to prefer rates from mortgages in the same zip code with the same date. If none are found, you can then look for interest rates within the same zip or on the same date. And if none are found, you can then look across the board at all mortgages that are within range.

                          To test this, I have defined the following datasets, each with 1 million observations:
                          Code:
                          clear
                          set seed 123312
                          set obs 1000000
                          gen long mortgage_id = _n
                          gen long balance = runiformint(1000,500000)
                          gen long zip_code = runiformint(1001,99950)
                          gen byte open_date = runiformint(1,500)
                          gen interest_rate = runiform(2,20)
                          save "big_data2.dta", replace
                          
                          clear
                          set obs 1000000
                          gen long mortgage_id = _n
                          gen long balance = runiformint(1000,500000)
                          gen long zip_code = runiformint(1001,99950)
                          gen byte open_date = runiformint(1,500)
                          save "big_data1.dta", replace
                          With rangestat, no calculations are done if the interval is invalid (i.e. the lower bound is higher than the upper bound). Since we do not need to calculate the mean interest rate for observations with an interest rate, we mark out these observations using an invalid interval. Similarly, we do not want to include the observations with no interest rate in the observations in range so we use a large negative balance for these observations; this way they will never fall within an observation's interval bounds. With rangestat, you can request that the observations in range be found within by-groups. So the first match will find mortgages with a similar balance within the same zip_code and open_date. You then mark out observations that found one or more matches by applying the invalid interval bound to them. You can now run rangestat again on observations with valid intervals this time grouping on zip_code. You repeat with groups of open_date. Any remaining observation is then matched, without regards to zip or date:
                          Code:
                          clear all
                          use "big_data1.dta"
                          append using "big_data2.dta"
                          
                          * define valid bounds only when the interest rate is missing
                          gen low = cond(mi(interest_rate), balance-1000, 0)
                          gen high = cond(mi(interest_rate), balance+1000, -1)
                          
                          * use an out-of-range balance when the interest rate is missing
                          gen bal2use = cond(mi(interest_rate), -9999, balance)
                          
                          * find average interest rate in groups with the same zip_code and open_date
                          timer on 1
                          rangestat (mean) mrate1 = interest_rate (count) nrate1 = interest_rate, ///
                              interval(bal2use low high) by(zip_code open_date)
                          timer off 1
                          
                          * replace with an invalid interval if the initial match was successful
                          replace low = 0 if !mi(nrate1)
                          replace high = -1 if !mi(nrate1)
                          
                          * find average interest rate in groups with the same zip_code
                          timer on 2
                          rangestat (mean) mrate2 = interest_rate (count) nrate2 = interest_rate, ///
                              interval(bal2use low high) by(zip_code)
                          timer off 2
                          
                          * continue to mark out matches using an invalid interval
                          replace low = 0 if !mi(nrate2)
                          replace high = -1 if !mi(nrate2)
                          
                          * find average interest rate in groups with the same open_date
                          timer on 3
                          rangestat (mean) mrate3 = interest_rate (count) nrate3 = interest_rate, ///
                              interval(bal2use low high) by(open_date)
                          timer off 3
                          
                          * continue to mark out matches using an invalid interval
                          replace low = 0 if !mi(nrate3)
                          replace high = -1 if !mi(nrate3)
                          
                          * find average interest rate for observations that have not matched so far
                          timer on 4
                          rangestat (mean) mrate4 = interest_rate (count) nrate4 = interest_rate, ///
                              interval(bal2use low high)
                          timer off 4
                          
                          timer list
                          
                          count if !mi(nrate1)
                          count if !mi(nrate2)
                          count if !mi(nrate3)
                          count if !mi(nrate4)
                          Here are the results I get:
                          Code:
                          . timer list
                             1:      9.47 /        1 =       9.4730
                             2:      8.19 /        1 =       8.1860
                             3:     47.26 /        1 =      47.2590
                             4:     14.67 /        1 =      14.6740
                          
                          . 
                          . count if !mi(nrate1)
                            25,654
                          
                          . count if !mi(nrate2)
                            14,204
                          
                          . count if !mi(nrate3)
                            960,080
                          
                          . count if !mi(nrate4)
                            62
                          Note that with rangestat, execution times are linear, so if you double the size of the dataset, the run will take twice as long. As used here, if both datasets have 2 million observations, the execution time should just double.

                          Comment


                          • #14
                            I guess I was getting tired last night and did not notice that I forgot to change the variable type for the open_date variable in the two datasets when I increased the range to 500. This created a large quantity of missing dates (which explains why the third match took longer to run). Here's the code I intended to use:

                            Code:
                            clear
                            set seed 123312
                            set obs 1000000
                            gen long mortgage_id = _n
                            gen long balance = runiformint(1000,500000)
                            gen long zip_code = runiformint(1001,99950)
                            gen int open_date = runiformint(1,500)
                            gen interest_rate = runiform(2,20)
                            save "big_data2.dta", replace
                            
                            clear
                            set obs 1000000
                            gen long mortgage_id = _n
                            gen long balance = runiformint(1000,500000)
                            gen long zip_code = runiformint(1001,99950)
                            gen int open_date = runiformint(1,500)
                            save "big_data1.dta", replace
                            and here are the updated results using the same code:
                            Code:
                            . timer list
                               1:     14.03 /        1 =      14.0280
                               2:      8.03 /        1 =       8.0330
                               3:     12.11 /        1 =      12.1120
                               4:     14.27 /        1 =      14.2690
                            
                            . 
                            . count if !mi(nrate1)
                              99
                            
                            . count if !mi(nrate2)
                              39,759
                            
                            . count if !mi(nrate3)
                              959,828
                            
                            . count if !mi(nrate4)
                              314

                            Comment

                            Working...
                            X