Find digit(s) in string

Pratap Pundir

Join Date: Oct 2018

Posts: 143
#1

Find digit(s) in string

26 Nov 2018, 16:07

I realize that there have been hundreds of questions about extracting digits or numbers from strings. This question is somewhat different.

I am looking at world bank data, and trying to get rid of the arbitrary regions they force into the data (e.g. High income countries). The approach I feel makes sense is to remove the observations for which

Code:

iso2code == "" | strmatch(iso2code,"X*") | strmatch(countryname,"*income*") | strmatch(iso2code,"%?") | strmatch(iso2code,"?%") | strmatch(iso2code,"%%")

There are no real countries for which ISO 2 code doesn't exist, starts with "X", or contains a digit, (nor a name that contains the word "income") while some of world bank's groupings do have iso2codes like so.

Of course, in the above I have totally made up

Code:

strmatch(iso2code,"%?") | strmatch(iso2code,"?%") | strmatch(iso2code,"%%")

to illustrate what I want to do...in this % represents a digit. So I want to tell Stata to drop the observation if iso2code string contains a digit.

A bit stuck on how to do that...

Last edited by Pratap Pundir; 26 Nov 2018, 16:10.

Thank you for your help!

Stata SE/17.0, Windows 10 Enterprise
Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

26 Nov 2018, 17:04

There is probably a more elegant way to solve this with regular expressions (for example regexm, regexr or moss (from SSC).

Code:

gen to_drop = 0
forvalues i = 0/9  {
replace to_drop = 1 if strpos(iso2code, "`i'") > 0
}

* This only set to 1 if iso2code starts with "X".  Also, will skip lower case "x"
replace iso2code = trim(iso2code)  // removes any extra spaces at beginning or end
replace to_drop = 1 if strpos(iso2code, "X") == 1

* Once you've confirmed this does what you want
drop if to_drop==1

You might also check out these other posts:

Comment

Pratap Pundir

Join Date: Oct 2018

Posts: 143
#3

26 Nov 2018, 17:33

Nice! Thanks!

Thank you for your help!

Stata SE/17.0, Windows 10 Enterprise
Comment

David Benson

Join Date: Oct 2018
Posts: 489

26 Nov 2018, 17:57

So I figured out how to do it with regex. You'll want to use regexm. regexm stands for “match.” You can use this command to create a variable that is 0 if the expression is not present, and 1 if the expression is present.

See also:

Regular Expressions in Stata
REGULAR EXPRESSIONS IN STATA at Stata Hacks

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str24 iso2code
"high countries"          
"low countries"          
"middle coutrnies"        
"This has a 10"          
"This has 5"              
"X45"                    
"45X"                    
"x45"                    
" x45 (has leading space)"
end

Code:

gen to_drop  = regexm( iso2code , "[0-9]")  // set to 1 if iso2code contains a digit
gen to_drop2 = regexm( iso2code , "^X")  // set to 1 if has capital X at beginning
gen to_drop3 = regexm( iso2code , "X")   // set to 1 if capital X anywhere in string
gen to_drop4 = regexm( iso2code , "[xX]")  // set to 1 if lowercase or uppercase X anywhere in string
gen to_drop5 = regexm( iso2code , "^[xX]") // set to 1 if first character is lower- or uppercase X.

. list, noobs

  +--------------------------------------------------------------------------------+
  |                 iso2code   to_drop   to_drop2   to_drop3   to_drop4   to_drop5 |
  |--------------------------------------------------------------------------------|
  |           high countries         0          0          0          0          0 |
  |            low countries         0          0          0          0          0 |
  |         middle coutrnies         0          0          0          0          0 |
  |            This has a 10         1          0          0          0          0 |
  |               This has 5         1          0          0          0          0 |
  |--------------------------------------------------------------------------------|
  |                      X45         1          1          1          1          1 |
  |                      45X         1          0          1          1          0 |
  |                      x45         1          0          0          1          1 |
  |  x45 (has leading space)         1          0          0          1          0 |
  +--------------------------------------------------------------------------------+

Last edited by David Benson; 26 Nov 2018, 17:59.

Announcement