Testing to see if the value of categorical variable x for string m contains the value of variable y for string m.

Daniel Lane

Join Date: Sep 2020

Posts: 1
#1

Testing to see if the value of categorical variable x for string m contains the value of variable y for string m.

08 Sep 2020, 18:49

Hello Statalist,

Please forgive me if this is a stupid question; I am new to Stata. I am working with a dataset that has one variable listing the names of businesses and another listing the names of the towns in which the businesses are established. I want to test the business names for whether or not they contain the name of their respective towns. For example, if the name of the business is "Corinth Lumber Co." and the business is in Corinth, I want that to come back positive. Furthermore, I would like to create a binary variable on the same dataset to house the results, with 1 for positive and 0 for negative. Any advice for how I would go about doing this? I really don't know where to start.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

09 Sep 2020, 01:59

Daniel:
welcome to this forum.
Do you mean something along the following lines?:

Code:

. set obs 1
number of observations (_N) was 0, now 1

. g firm="Corinth Lumber Co"

. g town="Corinth"

. split firm , p()
variables created as string:
firm1  firm2  firm3

. g counter=1 if firm1==town

. drop firm2 firm3

. list

     +-------------------------------------------------+
     |              firm      town     firm1   counter |
     |-------------------------------------------------|
  1. | Corinth Lumber Co   Corinth   Corinth         1 |
     +-------------------------------------------------+

.

Last edited by Carlo Lazzaro; 09 Sep 2020, 02:05.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35677

09 Sep 2020, 02:31

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 firm str8 place
"Corinth Coffee Cafe" "Corinth" 
"Lloyds of London"    "London"  
"Not informative"     "Wherever"
end

gen wanted = strpos(firm, place) > 0

list

     +-----------------------------------------+
     |                firm      place   wanted |
     |-----------------------------------------|
  1. | Corinth Coffee Cafe    Corinth        1 |
  2. |    Lloyds of London     London        1 |
  3. |     Not informative   Wherever        0 |
     +-----------------------------------------+

.

Warnings: The match has to be exact. "LONDON" is not matched by "London". Use e.g. lower() to standardise if need be.

Watch out for "London" being a match for "Londonderry". @Carlo Lazzaro's approach based on words is a good idea for that, but will fall over for "Los Angeles" and other multi-word placenames.

Watch for this

Code:

. di strpos("frog", "")
1

strpos() will always find an empty string in a larger string.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

09 Sep 2020, 07:49

Stata's Unicode regular expression matching provides another approach that takes into account case differences and substring matches, but as with strpos() will always match an empty string within a larger string. The key to success is that "\b" matches a "word boundary" which means what you intuitively believe it to mean.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 firm str11 place
"Corinth Coffee Cafe"           "Corinth" 
"Lloyds of London"              "London"  
"Londonderry Fashion"           "London"  
"Anaheim, Azusa, and Cucamonga" "AZUSA"   
"The Los Angeles Times"         "Los Angeles"
"Not informative"               "Wherever"
"Gnxl"                          ""
end
generate found = ustrregexm(firm,"\b"+place+"\b",1)
list, clean noobs

Code:

. list, clean noobs

                             firm         place   found  
              Corinth Coffee Cafe       Corinth       1  
                 Lloyds of London        London       1  
              Londonderry Fashion        London       0  
    Anaheim, Azusa, and Cucamonga         AZUSA       1  
            The Los Angeles Times   Los Angeles       1  
                  Not informative      Wherever       0  
                             Gnxl                     1

To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

Announcement