Merging strings with subset criterion

Cem Ozdemir

Join Date: Sep 2023

Posts: 7
#1

Merging strings with subset criterion

18 Nov 2023, 20:17

Hello everyone,

I have two separate data files that I plan to merge using the id column.

However, a challenge arises due to the different formats in which the ids are reported in each data file. To illustrate:

Data1: HFW542643LD43 Data2: AM-54264301A1

In this case, despite the varied characters, these two ids should be matched because they share the common number 542643.

I cannot simply standardize the approach by removing all characters and extracting the first 6 numbers. The issue is that the 6 numbers identifying observations across both datasets may differ in their position. For instance:

Data1: GD01542643MK Data2: 000000542643

Additionally, the unique number crucial for the match may vary in length in some cases, but it is always at least 5 characters long.

Hence, I am exploring whether it's possible to match strings based on a subset criterion. I am seeking an algorithm that matches observations in Data1 to Data2 if a subset of the id (at least 5 characters long) in Data2 is contained in the id in Data1.

Do you have any suggestions on how to approach this problem? Your insights and tips would be highly appreciated.

Thanks
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10237

19 Nov 2023, 03:50

As long as your conditions do not imply truncation of digits for both strings to be matched, e.g., saying that "1235678" should be matched with "1235679" because the two share the same first 6 digits, then you can do the following (truncation for one is fine):

Code:

clear
input str50 id2
"AM-54264301A1"
"000000542643"
"HLM12356T"
"MZ1000000000R"
end
tempfile ids2
save `ids2'


clear
input str50 id1
"HFW542643LD43"
"HF-1235660L4"
"GD01542643MK"
"AB100201H2"
end

cross using `ids2'
gen nid1= string(real(ustrregexra(id1, ".*[^\d](\d+)[^\d].*", "$1")), "%17.0f")
gen nid2= string(real(ustrregexra(id2, ".*[^\d](\d+)[^\d].*", "$1")), "%17.0f")
gen match= (ustrregexm(nid1, nid2)|ustrregexm(nid2, nid1)) & length(nid1)>=5 & length(nid2)>=5
keep if match

Res.:

Code:

. l

     +------------------------------------------------------------+
     |           id1             id2      nid1       nid2   match |
     |------------------------------------------------------------|
  1. | HFW542643LD43   AM-54264301A1    542643   54264301       1 |
  2. | HFW542643LD43    000000542643    542643     542643       1 |
  3. |  HF-1235660L4       HLM12356T   1235660      12356       1 |
  4. |  GD01542643MK    000000542643   1542643     542643       1 |
     +------------------------------------------------------------+

.

Last edited by Andrew Musau; 19 Nov 2023, 04:07.

Comment

Cem Ozdemir

Join Date: Sep 2023
Posts: 7

24 Dec 2023, 14:56

Originally posted by Andrew Musau View Post

Code:

clear
input str50 id2
"AM-54264301A1"
"000000542643"
"HLM12356T"
"MZ1000000000R"
end
tempfile ids2
save `ids2'


clear
input str50 id1
"HFW542643LD43"
"HF-1235660L4"
"GD01542643MK"
"AB100201H2"
end

cross using `ids2'
gen nid1= string(real(ustrregexra(id1, ".*[^\d](\d+)[^\d].*", "$1")), "%17.0f")
gen nid2= string(real(ustrregexra(id2, ".*[^\d](\d+)[^\d].*", "$1")), "%17.0f")
gen match= (ustrregexm(nid1, nid2)|ustrregexm(nid2, nid1)) & length(nid1)>=5 & length(nid2)>=5
keep if match

Res.:

Code:

. l

+------------------------------------------------------------+
| id1 id2 nid1 nid2 match |
|------------------------------------------------------------|
1. | HFW542643LD43 AM-54264301A1 542643 54264301 1 |
2. | HFW542643LD43 000000542643 542643 542643 1 |
3. | HF-1235660L4 HLM12356T 1235660 12356 1 |
4. | GD01542643MK 000000542643 1542643 542643 1 |
+------------------------------------------------------------+

.

Hello Andrew,

Thanks a lot for your input. This should work well for me - the idea is exactly what I was looking for.
However, my datasets are quite large. Using cross goes well over the observation limit in STATA. I tried looping over subsets of each data but that takes too much time, and sometimes leading to an I/O error.
Do you have any suggestions on how one can use this without using cross?

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10237
#4

26 Dec 2023, 22:39

You can address the memory issues, but the entire process will still be time-consuming. You need to compare each observation of the second variable with all observations of the first variable, which is what the cross command does. To offer code suggestions, please provide details on the following:

1. What version and flavor of Stata do you have?
2. How many observations do you have?
1 like
Comment
Cem Ozdemir

Join Date: Sep 2023

Posts: 7
#5

29 Dec 2023, 13:42

Originally posted by Andrew Musau View Post

You can address the memory issues, but the entire process will still be time-consuming. You need to compare each observation of the second variable with all observations of the first variable, which is what the cross command does. To offer code suggestions, please provide details on the following:

1. What version and flavor of Stata do you have?
2. How many observations do you have?

1. I am using STATA 17 with single core.
2. I have almost 3 million observations in each dataset.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10237
#6

30 Dec 2023, 01:05

Stata flavors are MP, SE and BE; see https://www.stata.com/products/which...-right-for-me/. With \(3\text{ million}\) observations, you will be doing \(3\text{ million}\;\times\;3\text{ million}\;=\;9\text{ trillion}\) comparisons. You could put the first variable in a new frame then pick each observation of the second variable, store it in a local and generate a second variable in the new frame with the content of this local. Run #2 and save/ append the matches in another frame. If a single iteration of the loop takes 1 second, you need about 3 million seconds or approximately 35 days, which means that you will be done sometime in early February. I am sure with Stata MP, cross with a lower number of observations will be more efficient. You have to determine the optimal number of observations, however. If this is say 10,000, you have to create 300 datasets from splitting the first variable and then successively loop over these datasets using 10,000 observations of the second variable and do this 300 times. See this thread for an example of implementing a single cross using several crosses: https://www.statalist.org/forums/for...-same-variable.

ADDED IN EDIT: From the link above, it appears that StataCorp has moved on from using the term "flavor" to "edition" when describing the various packages of Stata available for purchase.

Last edited by Andrew Musau; 30 Dec 2023, 01:42.
2 likes
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 818

30 Dec 2023, 16:34

I decided to try to implement this algorithm in python where I'm a bit more comfortable. Essentially, I compare each id in dataset 1 to every other id in dataset 2, then check the size of the largest common substring. If it is greater than 5, I assume it is a matching id. If that doesn't work for you, one could also do the matching operation with regular expressions like Andrew recommends in #2.

I use a couple of optimizations. If I find a match in dataset 2 for a given id in dataset 1, I don't continue looking through dataset 2 for a match. I break out of the inner loop and continue with the next id in dataset #1. Likewise, once an id in dataset 2 is matched with an id in dataset 1, I throw the dataset 2 id out and don't consider it as a possible match for subsequent iterations of the loop.

I haven't let this algorithm run to completion on the full 30 million (n by m) data, but I have estimated an average rate per 10,000 first ids (not unique comparisons) at about a minute and 20 seconds on my M1 macbook air. I use that to calculate a conservative upper bound of the runtime at around 67 hours. I think the runtime should be much lower than that because I expect the algorithm to speed up as it progresses, but 67 hours seems like a reasonable worst-case scenario in python, which we should expect to be roughly comparable to mata in terms of speed. Of course, other optimization strategies may be available, and if you wrote this in, say, C or C++, you could get something orders of magnitude faster. That said, hopefully this helps you get started as an alternative to Andrew's recommendations in #6.

Code:

import numpy as np
from numpy import random
import string
from difflib import SequenceMatcher
from time import time

# Randomly generate two id columns for testing purposes here.
observations = 30000000
print("start randomly generating data...")
print("randomly generate ids...")
id = list(random.randint(100000,999999, size=(observations)))
id_str = np.array(list(map(str, id)))

print("randomly generate prefix 1...")
prefix_1_arr = random.choice(list(string.ascii_letters + string.digits), size=(observations, 5))
prefix_1 = np.array(["".join(row) for row in prefix_1_arr])

print("randomly generate prefix 2...")
prefix_2_arr = random.choice(list(string.ascii_letters + string.digits), size=(observations, 5))
prefix_2 = np.array(["".join(row) for row in prefix_2_arr])

print("randomly generate postfix 1...")
postfix_1_arr = random.choice(list(string.ascii_letters + string.digits), size=(observations, 5))
postfix_1 = np.array(["".join(row) for row in prefix_1_arr])

print("randomly generate postfix 2...")
postfix_2_arr = random.choice(list(string.ascii_letters + string.digits), size=(observations, 5))
postfix_2 = np.array(["".join(row) for row in prefix_1_arr])

print("combine id 1...")
id_1 = np.char.add(prefix_1, id_str)
id_1 = np.char.add(id_1, postfix_1)

print("combine id 2...")
id_2 = np.char.add(prefix_2, id_str)
id_2 = np.char.add(id_2, postfix_2)
print("Done generating columns.")

# Matching algorithm starts here.
print("Start matching algorithm!")
start_time = time()
result = list()
id_2_list = list(id_2)
for first_index in range(0, len(id_1)):
    if first_index % 10000 == 0:
        print(f"Continuing with index {first_index}")
    for second_index in range(0, len(id_2_list)):
        first_id = id_1[first_index]
        second_id = id_2_list[second_index]
        match = SequenceMatcher(None, first_id, second_id).find_longest_match()
        if match.size > 5:
            match = {"id_1": first_id, "id_2": second_id}
            result.append(match)
            del id_2_list[second_index]
            break
end_time = time()
print("Matching algorithm done!")
print('Algorithm runtime was %f' %(t1-t0))

Last edited by Daniel Schaefer; 30 Dec 2023, 16:37.

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#8

02 Jan 2024, 05:38

Additionally, the unique number

Cem Ozdemir Is the real id values unique in each (or one) of the files? (contrary to your exmple data) If so identifying the id (by regex) then -merge- may be a first step.
Comment

Announcement