Searching for numbers in a string of numbers

Paul Hofman

Join Date: Jan 2015

Posts: 5
#1

Searching for numbers in a string of numbers

11 Dec 2015, 10:30

I have a string variable (numberlist) that contains a list of numbers with spaces in between them (e.g. "41 54 1 46 54 23"). I want to see if a different variable in my dataset (id) is in this list of numbers. The obvious approach is using strpos:

gen match=1 if strpos(numberlist, string(id))

The problem with this is is that it matches 1 to 41. Which makes sense, 1 is indeed in 41, but it is not the behaviour I am looking for.

I suppose I could simply -split- the numberlist variable, and then loop over all the resulting variables. This is a bit cumbersome though, and I'm afraid it might balloon the size of the dataset (at 35 MB it is already rather large). A more elegant solution would be preferable.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

11 Dec 2015, 10:47

Something like this would avoid using split directly. From your example id is numeric.

Code:

gen match = 0 quietly forval j = 1/20 { replace match = 1 if real(word(numberlist, `j')) == id }

Your value of 20, so to speak, could be based on a guess. Setting it a little too high would be about best. Or you could get Stata to count, looking for the maximum:

Code:

gen wc = wordcount(id) su wc
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

11 Dec 2015, 11:07

A cool trick I use often when matching whole words is to pad the strings with spaces. Something like:

Code:

clear set obs 100 gen id = _n gen numberlist = "41 54 1 46 54 23" gen match = strpos(" " + numberlist + " ", " " + string(id) + " ") list if match
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#4

11 Dec 2015, 11:11

Robert's solution is better than mine, I think.
Comment
Paul Hofman

Join Date: Jan 2015

Posts: 5
#5

12 Dec 2015, 06:15

Thank you so much Nick and Robert. I'd played with padding with spaces, but never thought to enter them directly in the strpos command. A very elegant solution.
Comment
Catherine Perry

Join Date: Jul 2014

Posts: 19
#6

27 Jun 2016, 06:32

Hi All, i realise this post is old and maybe i should start another new post but my query relates to this. I have two string variables of height and weight values, (some respondents reported the units of measurement within the answer). However I want to search for a particular type of answer given across multiple respondents, i want to count how frequent and to what ID number this type of answers belong to 6.68. I tried using the syntax attached but none of them work.
Attached Files

string_query.do (968 Bytes, 1 view)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35734

27 Jun 2016, 07:08

Catherine: it's much, much better not to use attachments, but to show your code explicitly.

See particularly http://www.statalist.org/forums/help#stata that makes this point.

This is how to present your do-file:

Code:

list KEY if strpos(m15, "#.##")
*nothing returns for me*


strmatch m15 ("6.32", "?.??")
/*unrecognized command:  strmatch
r(199);
*/

strmatch ("6.32", "?.??")
/*unrecognized command:  strmatch
r(199);
*/

strmatch(s1,s2)
/*unrecognized command:  strmatch
r(199);
*/


strmatch("17.4","1??4")
/*unrecognized command:  strmatch
r(199);
*/


gen decimals = regexs(1) if regexm(m15, "[?.??]")
/*invalid number, outside of allowed range
invalid number, outside of allowed range
invalid number, outside of allowed rang
...*/

g one = 1 if strmatch(m15, "*?.??*")
*(12954 missing values generated)*

gen decimals = regexs(1) if regexm(m15, "[#.##]")
/*invalid number, outside of allowed range
invalid number, outside of allowed range
invalid number, outside of allowed range
*/

gen match = strpos("#.##")
*invalid syntax
*r(198);

list KEY if strpos(m15, "?.??")
list KEY m15 if strpos(m15, "?.??")
*nothing returns for me*

The problems with your code include

1. strpos() treats string arguments literally. It doesn't support regular expression or any other pattern matching syntax.

2. strmatch() is not a command, as Stata tells you. It's a function and must be used within a command.

3. Regular expression syntax supported is documented at http://www.stata.com/support/faqs/da...r-expressions/ So, for example ?.?? and #.## are not regular expressions for particular kinds of numbers.

Although you show your code attempts you don't show us any data and I don't understand this at all

i want to count how frequent and to what ID number this type of answers belong to 6.68

Perhaps English is not your first language: someone at work may be able to help you with writing. Proof-reading posts, avoiding shortcuts used in texting and punctuating sentences grammatically make posts easier to read.

See moss (SSC) as one way to pull numeric elements out of a variable.

Here's a worked example.

Code:

clear
input str12 height 
"1.23"
"9.87"
"6 ft 1 in" 
"2 m" 
end 

cap ssc inst moss 

moss height, match("([0-9]+\.?[0-9]*)") regex 
list 


     +--------------------------------------------------------+
     |    height   _count   _match1   _pos1   _match2   _pos2 |
     |--------------------------------------------------------|
  1. |      1.23        1      1.23       1                 . |
  2. |      9.87        1      9.87       1                 . |
  3. | 6 ft 1 in        2         6       1         1       6 |
  4. |       2 m        1         2       1                 . |
     +--------------------------------------------------------

If this is far from what you want, we will need to see a sample of your data. Explain in words what you want.

Comment

Catherine Perry

Join Date: Jul 2014

Posts: 19
#8

27 Jun 2016, 08:54

Hi Nick, I am a bit embarrassed as English is my first language, so I apologise for my poor grammar. Thank you for your code [

Code:

moss height, match("([0-9]+\.?[0-9]*)") regex list

][/CODE]

and for telling me how to insert code rather than attaching a do file.

Your mock up of my data is very similar to what I have, different units of measurement mentioned and then just the numeric values with decimal places. I have tried to use dataex below
[216 "27kg" "1"
217 "32.5 kg" "4feet 5 inches"
218 "6 stone" "1m70cm"
219 "6 stone 1 pound" "2 metres"
220 "1" "1"
221 "30 kg" "1m 34cm"
222 "4kg" "1"
223 "1" "1"
224 "1" "1"
225 "5 stone" "1"
226 "5 1/2 stone" "1"
227 "1" "1"
228 "5 stone" "1"
229 "1" "1"
230 "1" "1m 40cm"
231 "1" "1"
232 "1" "141 cm"
233 "1" "1"
234 "1" "147 cm"
235 "4.22" "132"
][/CODE]

The moss code seems to work well enough, however still not 100% of what I was thinking about,

Again sorry for my badly phrased sentence from my last post I will try and rephrase; I want to find out how many respondents gave this type of answer i.e. 5.68. therefore I wanted to search for a number followed by decimal point followed by two decimal places within the string variable because everything won't be exactly 5.68 that's why I tried to use the #.## option. I know you say this is not a regular expression but is there a similar option in Stata.

So in summary I want a tabulation/frequency of how many people responded in this fashion and to identify the corresponding ID number.

Would there be any way to adapt your moss code to just identify those with two decimal places?
[
moss height, match("([0-9]+\.??[0-9]*)") regex list ][/CODE]

or

[
moss height, match("([0-9]+\.?*[0-9]*)") regex list ][/CODE]
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#9

27 Jun 2016, 08:59

Code:

moss height, match("([0-9]\.[0-9][0-9])") regex

as you want one digit followed by a decimal point and two digits.
Comment
Catherine Perry

Join Date: Jul 2014

Posts: 19
#10

27 Jun 2016, 09:15

Hi Nick, thank you, this is a great command sorry I didn't see that I should just put in the [0-9] twice.
Thanks for you patience
Catherine
Comment

Announcement