Extracting numbers of variable length from a messy string variable

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#1

Extracting numbers of variable length from a messy string variable

27 Jan 2015, 04:24

Colleagues,

I have a string variable resembling the data generated by the code below. I would like to achieve two things:
Create a separate variable containing figures in brackets

Create a separate variable with figures saved at the end of the variable

The code reflects the nature of the variable rather well, but to summarise:
Each observation contains two sets of figures, both of variable lengths from 1 to 3 digits
First set of figures is always encapsulated in brackets

Second set of figures is always at the end

Each observation contains at least one word at the beginning of the variable

On some occasions additional words appear between the figures

There is no word at the end of the observation

There are blank spaces at the end or beginning of the observation

Naturally, I will be grateful for any suggestions on how to tame this variable.

Code:

/* === SampleString Data === */ clear input str27 problemvar "abcXYZ (90) 135 " "def (130) comment 20 " "mbnuiegh (1) koj 130 " "wshli (786) kojepj (11) " "oujiopwe kojkl we (09) 787 " " ecfh (11) comment 90 " end

Last edited by Konrad Zdeb; 27 Jan 2015, 04:46. Reason: Typo.

Kind regards,
Konrad
Version: Stata/IC 13.1
Tags: data, import, string, syntax

Nick Cox

Join Date: Mar 2014
Posts: 35724

27 Jan 2015, 04:42

I know many people will think "regular expressions" here, which is one reason to suggest something more prosaic as an alternative.

Code:

 
clear

input str27 problemvar     
"abcXYZ (90) 135            "
"def (130) comment 20       "
"mbnuiegh (1) koj 130       "
"wshli (786) kojepj (11)    "
"oujiopwe kojkl we (09) 787 "
" ecfh (11) comment 90       "
end

local v problemvar 
local p2 strpos(`v', ")") - 1 
local p1 strpos(`v', "(") + 1

gen swanted2 = word(`v', -1)
replace swanted2 = subinstr(swanted2, "(", "", .) 
replace swanted2 = subinstr(swanted2, ")", "", .) 
gen wanted2 = real(swanted2) 

gen wanted1 = real(substr(`v', `p1', `p2' - `p1' - 1)) 

list wanted1 wanted2, sep(0)  

     +-------------------+
     | wanted1   wanted2 |
     |-------------------|
  1. |      90       135 |
  2. |     130        20 |
  3. |       1       130 |
  4. |     786        11 |
  5. |       9       787 |
  6. |      11        90 |
     +-------------------+

Comment

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#3

27 Jan 2015, 04:45

Thing of beauty!

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

27 Jan 2015, 04:50

Here's another way to do it. (Regular expression fans are right too.) moss is from SSC.

Code:

moss problemvar, regex match("([0-9]+)")
1 like
Comment

Konrad Zdeb

Join Date: Apr 2014
Posts: 496

27 Jan 2015, 04:57

Definitely more straightforward but the results are slightly more messy:

Code:

. /* === Example String Data === */
. clear

. input str27 problemvar     

                      problemvar
  1. "abcXYZ (90) 135            "
  2. "def (130) comment 20       "
  3. "mbnuiegh (1) koj 130       "
  4. "wshli (786) kojepj (11)    "
  5. "oujiopwe kojkl we (09) 787 "
  6. "ecfh (11) comment 90       "
  7. end

. preserve

.
. // First solution
. local v problemvar

. local p2 strpos(`v', ")") - 1

. local p1 strpos(`v', "(") + 1

.
. gen swanted2 = word(`v', -1)

. replace swanted2 = subinstr(swanted2, "(", "", .)
(1 real change made)

. replace swanted2 = subinstr(swanted2, ")", "", .)
(1 real change made)

. gen wanted2 = real(swanted2)

.
. gen wanted1 = real(substr(`v', `p1', `p2' - `p1' - 1))

.
. list wanted1 wanted2, sep(0)

     +-------------------+
     | wanted1   wanted2 |
     |-------------------|
  1. |      90       135 |
  2. |     130        20 |
  3. |       1       130 |
  4. |     786        11 |
  5. |       9       787 |
  6. |      11        90 |
     +-------------------+

.
. // Second solution
. restore

. moss problemvar, regex match("([0-9]+)")

. list

     +--------------------------------------------------------------------------+
     |                  problemvar   _count   _match1   _pos1   _match2   _pos2 |
     |--------------------------------------------------------------------------|
  1. | abcXYZ (90) 135                    2        90       9       135      13 |
  2. | def (130) comment 20               2       130       6        20      19 |
  3. | mbnuiegh (1) koj 130               2         1      11       130      18 |
  4. | wshli (786) kojepj (11)            2       786       8        11      21 |
  5. | oujiopwe kojkl we (09) 787         2        09      20       787      24 |
     |--------------------------------------------------------------------------|
  6. | ecfh (11) comment 90               2        11       7        90      19 |
     +--------------------------------------------------------------------------+

.
.
end of do-file

Kind regards,
Konrad
Version: Stata/IC 13.1

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

27 Jan 2015, 05:06

In your problem it seems there are always two wanted numbers. In many other problems things are (much) more complicated. The extra variables _count and _pos1 up are there for more complicated problems, as when (simple example) one might want to home in observations for which _count is not 2. That's why moss produces extra variables, which you can always ignore or drop.
Comment

Announcement

Extracting numbers of variable length from a messy string variable

Comment

Comment

Comment

Comment

Comment