Hello everyone,
I have two separate data files that I plan to merge using the id column.
However, a challenge arises due to the different formats in which the ids are reported in each data file. To illustrate:
Data1: HFW542643LD43 Data2: AM-54264301A1
In this case, despite the varied characters, these two ids should be matched because they share the common number 542643.
I cannot simply standardize the approach by removing all characters and extracting the first 6 numbers. The issue is that the 6 numbers identifying observations across both datasets may differ in their position. For instance:
Data1: GD01542643MK Data2: 000000542643
Additionally, the unique number crucial for the match may vary in length in some cases, but it is always at least 5 characters long.
Hence, I am exploring whether it's possible to match strings based on a subset criterion. I am seeking an algorithm that matches observations in Data1 to Data2 if a subset of the id (at least 5 characters long) in Data2 is contained in the id in Data1.
Do you have any suggestions on how to approach this problem? Your insights and tips would be highly appreciated.
Thanks
I have two separate data files that I plan to merge using the id column.
However, a challenge arises due to the different formats in which the ids are reported in each data file. To illustrate:
Data1: HFW542643LD43 Data2: AM-54264301A1
In this case, despite the varied characters, these two ids should be matched because they share the common number 542643.
I cannot simply standardize the approach by removing all characters and extracting the first 6 numbers. The issue is that the 6 numbers identifying observations across both datasets may differ in their position. For instance:
Data1: GD01542643MK Data2: 000000542643
Additionally, the unique number crucial for the match may vary in length in some cases, but it is always at least 5 characters long.
Hence, I am exploring whether it's possible to match strings based on a subset criterion. I am seeking an algorithm that matches observations in Data1 to Data2 if a subset of the id (at least 5 characters long) in Data2 is contained in the id in Data1.
Do you have any suggestions on how to approach this problem? Your insights and tips would be highly appreciated.
Thanks
Comment