Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching for numbers in a string of numbers

    I have a string variable (numberlist) that contains a list of numbers with spaces in between them (e.g. "41 54 1 46 54 23"). I want to see if a different variable in my dataset (id) is in this list of numbers. The obvious approach is using strpos:

    gen match=1 if strpos(numberlist, string(id))

    The problem with this is is that it matches 1 to 41. Which makes sense, 1 is indeed in 41, but it is not the behaviour I am looking for.

    I suppose I could simply -split- the numberlist variable, and then loop over all the resulting variables. This is a bit cumbersome though, and I'm afraid it might balloon the size of the dataset (at 35 MB it is already rather large). A more elegant solution would be preferable.

  • #2
    Something like this would avoid using split directly. From your example id is numeric.

    Code:
    gen match = 0 
    
    quietly forval j = 1/20 { 
             replace match = 1 if real(word(numberlist, `j')) == id 
    }
    Your value of 20, so to speak, could be based on a guess. Setting it a little too high would be about best. Or you could get Stata to count, looking for the maximum:

    Code:
    gen wc = wordcount(id) 
    su wc

    Comment


    • #3
      A cool trick I use often when matching whole words is to pad the strings with spaces. Something like:

      Code:
      clear
      set obs 100
      gen id = _n
      
      gen numberlist = "41 54 1 46 54 23"
      
      gen match = strpos(" " + numberlist + " ", " " + string(id) + " ")
      
      list if match

      Comment


      • #4
        Robert's solution is better than mine, I think.

        Comment


        • #5
          Thank you so much Nick and Robert. I'd played with padding with spaces, but never thought to enter them directly in the strpos command. A very elegant solution.

          Comment


          • #6
            Hi All, i realise this post is old and maybe i should start another new post but my query relates to this. I have two string variables of height and weight values, (some respondents reported the units of measurement within the answer). However I want to search for a particular type of answer given across multiple respondents, i want to count how frequent and to what ID number this type of answers belong to 6.68. I tried using the syntax attached but none of them work.
            Attached Files

            Comment


            • #7
              Catherine: it's much, much better not to use attachments, but to show your code explicitly.

              See particularly http://www.statalist.org/forums/help#stata that makes this point.

              This is how to present your do-file:

              Code:
              list KEY if strpos(m15, "#.##")
              *nothing returns for me*
              
              
              strmatch m15 ("6.32", "?.??")
              /*unrecognized command:  strmatch
              r(199);
              */
              
              strmatch ("6.32", "?.??")
              /*unrecognized command:  strmatch
              r(199);
              */
              
              strmatch(s1,s2)
              /*unrecognized command:  strmatch
              r(199);
              */
              
              
              strmatch("17.4","1??4")
              /*unrecognized command:  strmatch
              r(199);
              */
              
              
              gen decimals = regexs(1) if regexm(m15, "[?.??]")
              /*invalid number, outside of allowed range
              invalid number, outside of allowed range
              invalid number, outside of allowed rang
              ...*/
              
              g one = 1 if strmatch(m15, "*?.??*")
              *(12954 missing values generated)*
              
              gen decimals = regexs(1) if regexm(m15, "[#.##]")
              /*invalid number, outside of allowed range
              invalid number, outside of allowed range
              invalid number, outside of allowed range
              */
              
              gen match = strpos("#.##")
              *invalid syntax
              *r(198);
              
              list KEY if strpos(m15, "?.??")
              list KEY m15 if strpos(m15, "?.??")
              *nothing returns for me*
              The problems with your code include

              1. strpos() treats string arguments literally. It doesn't support regular expression or any other pattern matching syntax.

              2. strmatch() is not a command, as Stata tells you. It's a function and must be used within a command.

              3. Regular expression syntax supported is documented at http://www.stata.com/support/faqs/da...r-expressions/ So, for example ?.?? and #.## are not regular expressions for particular kinds of numbers.

              Although you show your code attempts you don't show us any data and I don't understand this at all

              i want to count how frequent and to what ID number this type of answers belong to 6.68
              Perhaps English is not your first language: someone at work may be able to help you with writing. Proof-reading posts, avoiding shortcuts used in texting and punctuating sentences grammatically make posts easier to read.

              See moss (SSC) as one way to pull numeric elements out of a variable.

              Here's a worked example.

              Code:
              clear
              input str12 height 
              "1.23"
              "9.87"
              "6 ft 1 in" 
              "2 m" 
              end 
              
              cap ssc inst moss 
              
              moss height, match("([0-9]+\.?[0-9]*)") regex 
              list 
              
              
                   +--------------------------------------------------------+
                   |    height   _count   _match1   _pos1   _match2   _pos2 |
                   |--------------------------------------------------------|
                1. |      1.23        1      1.23       1                 . |
                2. |      9.87        1      9.87       1                 . |
                3. | 6 ft 1 in        2         6       1         1       6 |
                4. |       2 m        1         2       1                 . |
                   +--------------------------------------------------------
              If this is far from what you want, we will need to see a sample of your data. Explain in words what you want.

              Comment


              • #8
                Hi Nick, I am a bit embarrassed as English is my first language, so I apologise for my poor grammar. Thank you for your code [
                Code:
                 
                 moss height, match("([0-9]+\.?[0-9]*)") regex  list
                ][/CODE]

                and for telling me how to insert code rather than attaching a do file.

                Your mock up of my data is very similar to what I have, different units of measurement mentioned and then just the numeric values with decimal places. I have tried to use dataex below
                [216 "27kg" "1"
                217 "32.5 kg" "4feet 5 inches"
                218 "6 stone" "1m70cm"
                219 "6 stone 1 pound" "2 metres"
                220 "1" "1"
                221 "30 kg" "1m 34cm"
                222 "4kg" "1"
                223 "1" "1"
                224 "1" "1"
                225 "5 stone" "1"
                226 "5 1/2 stone" "1"
                227 "1" "1"
                228 "5 stone" "1"
                229 "1" "1"
                230 "1" "1m 40cm"
                231 "1" "1"
                232 "1" "141 cm"
                233 "1" "1"
                234 "1" "147 cm"
                235 "4.22" "132"
                ][/CODE]

                The moss code seems to work well enough, however still not 100% of what I was thinking about,

                Again sorry for my badly phrased sentence from my last post I will try and rephrase; I want to find out how many respondents gave this type of answer i.e. 5.68. therefore I wanted to search for a number followed by decimal point followed by two decimal places within the string variable because everything won't be exactly 5.68 that's why I tried to use the #.## option. I know you say this is not a regular expression but is there a similar option in Stata.

                So in summary I want a tabulation/frequency of how many people responded in this fashion and to identify the corresponding ID number.

                Would there be any way to adapt your moss code to just identify those with two decimal places?
                [
                moss height, match("([0-9]+\.??[0-9]*)") regex list ][/CODE]

                or

                [
                moss height, match("([0-9]+\.?*[0-9]*)") regex list ][/CODE]

                Comment


                • #9
                  Code:
                  moss height, match("([0-9]\.[0-9][0-9])") regex
                  as you want one digit followed by a decimal point and two digits.

                  Comment


                  • #10
                    Hi Nick, thank you, this is a great command sorry I didn't see that I should just put in the [0-9] twice.
                    Thanks for you patience
                    Catherine

                    Comment

                    Working...
                    X