Matching using regular expressions -- Want to match all subsequent strings of numbers, commas, and decimals

Ben Thompson

Join Date: May 2014

Posts: 12
#1

Matching using regular expressions -- Want to match all subsequent strings of numbers, commas, and decimals

28 Mar 2016, 14:41

Hi all,

I have a very messy variable from a survey I'm trying to clean up that consists of written-in amounts of money. The problem is that nearly all the respondents decided to write in the amounts of money in their own respective formats, so I have a collection of entries that consist of various combinations of $, USD, and the word "dollars." To do any work with this variable, I need to extract all the combinations of digits, commas and decimals that lie in a row, thus extracting the numbers of the amounts. I've used regular expressions before, but not for anything this messy -- is there easy way to do this?

Thanks,
Ben
Tags: None

eric_a_booth

Join Date: Apr 2014
Posts: 292

28 Mar 2016, 15:08

You can use -destring- for this, but if you'd prefer regular expressions, I've got an example for that below as well:

Code:

clear
set obs 100

**make some fake data**
g amount = cond(rbinomial(1,.3)==1, "$", "") + ///
    string(round(runiform()*100, .01)) + " " + ///
    cond(int(runiform()*4)==0, "USD", cond(int(runiform()*4)==1, "Dollars", cond(int(runiform()*4)==2, "Currency", cond(int(runiform()*4)==3, "Cents",""))))    
ta amount


**create numeric version
    *1. destring
     destring amount, g(amount2) ignore(`" $,`c(alpha)'`c(ALPHA)'"')
    *2. REGEXM
    g result = regexs(1) + regexs(2) + regexs(3) ///
        if regexm(amount, `"([0-9]*)(\.?)([0-9]+)"')

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#3

29 Mar 2016, 03:20

Not tested but could potentially use something like:

Code:

replace var = ustrregexra(var, "[^\d,\.]", "", 1) destring var, replace

Using the caret inside the square brackets will cause it to interpret the character classes as negated (e.g., find everything that isn't a digit, comma, or period) and then replace them with a null string/empty string. It might require a bit of touch up, but with the newest version of Stata there are much better regular expression capabilities. If there are general positions for the characters you could also use subexpressions in case you wanted to standardize periods/commas (in case working with values from multiple countries where periods and commas are used slightly differently).
Comment

Announcement

Matching using regular expressions -- Want to match all subsequent strings of numbers, commas, and decimals

Comment

Comment