Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching using regular expressions -- Want to match all subsequent strings of numbers, commas, and decimals

    Hi all,

    I have a very messy variable from a survey I'm trying to clean up that consists of written-in amounts of money. The problem is that nearly all the respondents decided to write in the amounts of money in their own respective formats, so I have a collection of entries that consist of various combinations of $, USD, and the word "dollars." To do any work with this variable, I need to extract all the combinations of digits, commas and decimals that lie in a row, thus extracting the numbers of the amounts. I've used regular expressions before, but not for anything this messy -- is there easy way to do this?

    Thanks,
    Ben

  • #2
    You can use -destring- for this, but if you'd prefer regular expressions, I've got an example for that below as well:
    Code:
    clear
    set obs 100
    
    **make some fake data**
    g amount = cond(rbinomial(1,.3)==1, "$", "") + ///
        string(round(runiform()*100, .01)) + " " + ///
        cond(int(runiform()*4)==0, "USD", cond(int(runiform()*4)==1, "Dollars", cond(int(runiform()*4)==2, "Currency", cond(int(runiform()*4)==3, "Cents",""))))    
    ta amount
    
    
    **create numeric version
        *1. destring
         destring amount, g(amount2) ignore(`" $,`c(alpha)'`c(ALPHA)'"')
        *2. REGEXM
        g result = regexs(1) + regexs(2) + regexs(3) ///
            if regexm(amount, `"([0-9]*)(\.?)([0-9]+)"')
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      Not tested but could potentially use something like:

      Code:
       
      replace var = ustrregexra(var, "[^\d,\.]", "", 1)
      destring var, replace
      Using the caret inside the square brackets will cause it to interpret the character classes as negated (e.g., find everything that isn't a digit, comma, or period) and then replace them with a null string/empty string. It might require a bit of touch up, but with the newest version of Stata there are much better regular expression capabilities. If there are general positions for the characters you could also use subexpressions in case you wanted to standardize periods/commas (in case working with values from multiple countries where periods and commas are used slightly differently).

      Comment

      Working...
      X