Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find digit(s) in string

    I realize that there have been hundreds of questions about extracting digits or numbers from strings. This question is somewhat different.

    I am looking at world bank data, and trying to get rid of the arbitrary regions they force into the data (e.g. High income countries). The approach I feel makes sense is to remove the observations for which
    Code:
    iso2code == "" | strmatch(iso2code,"X*") | strmatch(countryname,"*income*") | strmatch(iso2code,"%?") | strmatch(iso2code,"?%") | strmatch(iso2code,"%%")
    There are no real countries for which ISO 2 code doesn't exist, starts with "X", or contains a digit, (nor a name that contains the word "income") while some of world bank's groupings do have iso2codes like so.

    Of course, in the above I have totally made up
    Code:
    strmatch(iso2code,"%?") | strmatch(iso2code,"?%") | strmatch(iso2code,"%%")
    to illustrate what I want to do...in this % represents a digit. So I want to tell Stata to drop the observation if iso2code string contains a digit.

    A bit stuck on how to do that...
    Last edited by Pratap Pundir; 26 Nov 2018, 16:10.
    Thank you for your help!

    Stata SE/17.0, Windows 10 Enterprise

  • #2
    There is probably a more elegant way to solve this with regular expressions (for example regexm, regexr or moss (from SSC).

    Code:
    gen to_drop = 0
    forvalues i = 0/9  {
    replace to_drop = 1 if strpos(iso2code, "`i'") > 0
    }
    
    * This only set to 1 if iso2code starts with "X".  Also, will skip lower case "x"
    replace iso2code = trim(iso2code)  // removes any extra spaces at beginning or end
    replace to_drop = 1 if strpos(iso2code, "X") == 1
    
    * Once you've confirmed this does what you want
    drop if to_drop==1

    You might also check out these other posts:

    Comment


    • #3
      Nice! Thanks!
      Thank you for your help!

      Stata SE/17.0, Windows 10 Enterprise

      Comment


      • #4
        So I figured out how to do it with regex. You'll want to use regexm. regexm stands for “match.” You can use this command to create a variable that is 0 if the expression is not present, and 1 if the expression is present.

        See also:
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str24 iso2code
        "high countries"          
        "low countries"          
        "middle coutrnies"        
        "This has a 10"          
        "This has 5"              
        "X45"                    
        "45X"                    
        "x45"                    
        " x45 (has leading space)"
        end

        Code:
        gen to_drop  = regexm( iso2code , "[0-9]")  // set to 1 if iso2code contains a digit
        gen to_drop2 = regexm( iso2code , "^X")  // set to 1 if has capital X at beginning
        gen to_drop3 = regexm( iso2code , "X")   // set to 1 if capital X anywhere in string
        gen to_drop4 = regexm( iso2code , "[xX]")  // set to 1 if lowercase or uppercase X anywhere in string
        gen to_drop5 = regexm( iso2code , "^[xX]") // set to 1 if first character is lower- or uppercase X.
        
        . list, noobs
        
          +--------------------------------------------------------------------------------+
          |                 iso2code   to_drop   to_drop2   to_drop3   to_drop4   to_drop5 |
          |--------------------------------------------------------------------------------|
          |           high countries         0          0          0          0          0 |
          |            low countries         0          0          0          0          0 |
          |         middle coutrnies         0          0          0          0          0 |
          |            This has a 10         1          0          0          0          0 |
          |               This has 5         1          0          0          0          0 |
          |--------------------------------------------------------------------------------|
          |                      X45         1          1          1          1          1 |
          |                      45X         1          0          1          1          0 |
          |                      x45         1          0          0          1          1 |
          |  x45 (has leading space)         1          0          0          1          0 |
          +--------------------------------------------------------------------------------+
        Last edited by David Benson; 26 Nov 2018, 17:59.

        Comment

        Working...
        X