Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regular expressions used to extract number

    Hi, everyone. I have the data including number and letter as following
    "1 reviews 2 hotel reviews 3 helpful votes"
    "5 re 7 hotel 6 helpful vote"
    "11 reviews 2 hotel s 3 helpful votes"

    I want to extract the number as sperate three columns

    case1 case2 case3
    1 2 3
    5 7 6
    11 2 3

    I extract the first column as following

    Code:
    clear
    input str60 case
    "1 reviews 2 hotel reviews 3 helpful votes"
    "5 re 7 hotel 6 helpful vote"
    "11 reviews 2 hotel s 3 helpful votes"
    end
    gen zip1 = regexs(0) if regexm(case, "^[0-9]+")


    How could I do it using regular expressions? Any suggestions much appreciate.


    Bests,
    wanhaiyou
    Last edited by wanhaiyou; 11 Nov 2014, 23:53.

  • #2
    Check out moss from SSC.

    Code:
    ssc desc moss
    ssc inst moss

    Comment


    • #3
      Code:
      . clear 
      
      . input str40 report 
      
                                             report
        1. "1 reviews 2 hotel reviews 3 helpful votes"
        2. "5 re 7 hotel 6 helpful vote"
        3. "11 reviews 2 hotel s 3 helpful votes"
        4. end
      
      . moss report, match("([0-9]+)") regex 
      
      . l report _match* 
      
           +------------------------------------------------------------------------+
           |                                   report   _match1   _match2   _match3 |
           |------------------------------------------------------------------------|
        1. | 1 reviews 2 hotel reviews 3 helpful vote         1         2         3 |
        2. |              5 re 7 hotel 6 helpful vote         5         7         6 |
        3. |     11 reviews 2 hotel s 3 helpful votes        11         2         3 |
           +------------------------------------------------------------------------+

      Comment


      • #4
        Originally posted by Nick Cox View Post
        Check out moss from SSC.

        Code:
        ssc desc moss
        ssc inst moss
        Thank you Nick,
        The routine is perfectly. The following codes are working well.
        Code:
        clear
        input str60 case
        "1 reviews 2 hotel reviews 36 helpful votes"
        "1 re 2 hotel 32 helpful vote"
        "11 reviews 2 hotel s 3 helpful votes"
        end
        
        moss case, match("([0-9]+)") regex
        list  _match1  _match2   _match3
        If I want to implement this with regular expressions directly, how could I do?

        Thanks very much for your input.

        Bests,
        wanhaiyou

        Comment


        • #5
          You have the regular expression that works right there. If you want to know what moss does, feel free to look at the code.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            You have the regular expression that works right there. If you want to know what moss does, feel free to look at the code.
            Thanks very much, Nick, I see.

            Bests,
            wanhaiyou

            Comment


            • #7
              Hi,Nick,
              I try to understand the 'moss' code written by Robert Picard and you,but the problem still exist.
              Following your suggestion, the corresponding source codes as follows
              Code:
               
                                      if "`regex'" != "" {
                                              tempvar match`j'
                                              gen `match`j'' = regexs(1) if `touse' & ///
                                                      regexm(`copy',`"`match'"')
                                              replace `touse' = 0 if `match`j'' == ""
                                              replace `copy' = regexs(2) if `touse' & ///
                                                      regexm(`copy',`"`match'"')
                                              gen `pos`j'' = `varlen' - ///
                                                      length(`copy') - length(`match`j'') + 1 if `touse'
              However,the following codes are not working well.
              Code:
              clear
              input str60 case
              "1 reviews 2 hotel reviews 36 helpful votes"
              "1 re 2 hotel 32 helpful vote"
              "11 reviews 2 hotel s 3 helpful votes"
              end
              
              
              gen match1 = regexs(1) if regexm(case,"([0-9]+)")
              gen copy = regexs(2) if regexm(case,"([0-9]+)")
              Any suggestions? Thanks very much.

              Bests,
              wanhaiyou

              Comment


              • #8
                Hi, Nick,
                A method to implement it as following
                Code:
                clear
                input str60 case
                "1 reviews 2 hotel reviews 36 helpful votes"
                "1 re 2 hotel 32 helpful vote"
                "11 reviews 2 hotel s 3 helpful votes"
                end
                
                
                
                gen str60 y = subinstr(case," ","",.)
                gen nn0 = regexs(0) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn1 = regexs(1) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn2 = regexs(2) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn3 = regexs(3) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn4 = regexs(4) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn5 = regexs(5) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn6 = regexs(6) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                gen nn7 = regexs(7) if regexm(y, "(^([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]+)([0-9]+)([a-zA-Z]*$))")
                Are there anything simple methods to do it? Thanks very much for your input.

                Bests,wanhaiyou

                Comment


                • #9
                  #7

                  If you

                  Code:
                   
                  set trace on 
                  set traced 1
                  before running moss, you will see that the code is more subtle than you have it. I think Robert Picard wrote that part and he might want to comment.

                  #8

                  I don't know what you are trying here.

                  Comment


                  • #10
                    As Nick pointed out, moss provides an easy solution so I don't understand the "but the problem still exist" comment in #7. However, the moss approach reduces to the following when applied to the example posted:

                    Code:
                    clear
                    input str60 case
                    "1 reviews 2 hotel reviews 3 helpful votes"
                    "5 re 7 hotel 6 helpful vote"
                    "11 reviews 24 hotel s 3 helpful votes"
                    end
                    
                    local more 1
                    local i 0
                    while `more' {
                        
                        gen zip`++i' = regexs(1) if regexm(case, "([0-9]+)(.*)")
                        replace case = regexs(2) if regexm(case, "([0-9]+)(.*)")
                        count if !mi(zip`i')
                        local more = r(N)
                        
                    }
                    
                    list zip*

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      #7

                      If you

                      Code:
                      set trace on
                      set traced 1
                      before running moss, you will see that the code is more subtle than you have it. I think Robert Picard wrote that part and he might want to comment.

                      #8

                      I don't know what you are trying here.
                      Hi,Nicks. I see now. Thanks very much for your help.

                      Bests,wanhaiyou

                      Comment


                      • #12
                        Originally posted by Robert Picard View Post
                        As Nick pointed out, moss provides an easy solution so I don't understand the "but the problem still exist" comment in #7. However, the moss approach reduces to the following when applied to the example posted:

                        Code:
                        clear
                        input str60 case
                        "1 reviews 2 hotel reviews 3 helpful votes"
                        "5 re 7 hotel 6 helpful vote"
                        "11 reviews 24 hotel s 3 helpful votes"
                        end
                        
                        local more 1
                        local i 0
                        while `more' {
                        
                        gen zip`++i' = regexs(1) if regexm(case, "([0-9]+)(.*)")
                        replace case = regexs(2) if regexm(case, "([0-9]+)(.*)")
                        count if !mi(zip`i')
                        local more = r(N)
                        
                        }
                        
                        list zip*
                        Thanks very much for you kindly help. I am sorry for my mistake.
                        I say "but the problem still exist" because I had not seen the code - local match `"`match'(.*)"' - at first.
                        Code:
                         if "`regex'" != "" {
                                        if !regexm(`"`match'"',"(^\(|[^\\]\().*[^\\]\)") {
                                                dis as err "regex option: " ///
                                                        `"no subexpression in match(`match')"'
                                                exit 198
                                        }
                                        if regexm(`"`match'"',"(^\(|[^\\]\().*[^\\]\(") | ///
                                                regexm(`"`match'"',"(^\(\(|[^\\]\(\()") {
                                                dis as err "regex option: " ///
                                                        `"match(`match') can only contain one subexpression"'
                                                exit 198
                                        }
                                        // add a second subexpression to capture what's left after the match
                                        local match `"`match'(.*)"'
                                }
                        Please forgive me for my carless. Thanks very much for your work.

                        Bests,
                        wanhaiyou
                        Last edited by wanhaiyou; 13 Nov 2014, 18:56.

                        Comment


                        • #13
                          Sorry to resurrect this old thread, but would -moss- allow for matching of multiple instances of the string within the same variable? I have a string with several numeric codes within square brackets (separated by a comma not followed by a space) and would like to create a variable for each of those numeric codes.

                          Alternatively, I tried to substr the string but I'm not able to specify the comma not followed by a space as the parser.

                          Thanks,
                          Manuel

                          Comment


                          • #14
                            Manuel: "multiple instances of the string within the same variable" is precisely what moss is designed to find. I'd give a realistic data example and the moss code you tried; otherwise I don't understand what the difficulty is.

                            Comment


                            • #15
                              Sorry for the late reply. An example of a string from which I would like to extract each individual ICD code into a dedicated variable is the following:

                              [ICD-401.9] Ipertensione essenziale non specificata,[ICD-272.2] Iperlipidemia mista,[ICD-278.02] Sovrappeso,[ICD-427.31] Fibrillazione atriale

                              I can't use commas to subset strings as some text definitions also include commas.

                              Comment

                              Working...
                              X