Find if a string is a substring of another

Arianna Vivoli

Join Date: Nov 2021

Posts: 1
#1

Find if a string is a substring of another

10 Nov 2021, 08:51

Hi everybody,

I have a string variable called 'Name', which consists of names of firms, that can appear once or several times throughout the dataset.
Names have been recorded manually, so a firm may be recorded in ways that are similar but slightly different, and what I am trying to do is basically to unify and harmonize all the names that correspond to a single firm.
What it would be extremely helpful would be to create a code that can recognize if a string is a substring of another, and in that case, replace the longer name with the one that it is nested.

For example, if I have

Name
Zara
Zara
Zara
Zara Espana
Zara Home

I would like to create a code that replace the entries 'Zara Espana' and 'Zara Home' with the shorter version 'Zara'.
Is there a way to do that?

Thanks in advance!
Arianna
Tags: None
Fei Wang

Join Date: Oct 2021

Posts: 726
#2

10 Nov 2021, 09:26

Arianna, if the name variable has consistent structure as "short name" + a blank space + "redundant strings", then the code below will extract the short name from original string.

Code:

split Name

If name is arbitrarily structured, then the algorithm could be unclear. For example, if the last obs is "Home Zara", and there is another brand called "Home", then Stata won't know if "Home Zara" belongs to "Zara" or "Home".

Last edited by Fei Wang; 10 Nov 2021, 09:30.
Comment

Announcement