Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tokenize and trimming

    I need to feed different strings to strpos using a loop to create variables. I have found an old question on Statalist regarding the tokenize command's odd behaviour with the parse character, and I could circumvent it using the idea given there. In my humble opinion it's still a serious bug, but you may not need to agree (again an heresy, but I would highly recommend to Stata developers the functionality of SAS's scan function). However, as I see it trims the leading spaces off, and precisely that's why I would like to use a parse character different from blank space. May I ask your help how to keep all blanks in the tokens?

    Code:
    local regexp = "abc|a.b.| ab | ba| cd |.cd "
    tokenize `regexp', parse("|")
    while "`*'" ~= "" {
     disp "*`1'*"
     macro shift
    }
    Thanks indeed!

    Kazi
    Last edited by Kazi Bacsi; 13 Jul 2016, 04:03. Reason: Formatting...

  • #2
    Welcome to Statalist!

    Perhaps if you described more fully the ultimate problem you are trying to solve - creating variables using strpos - someone here could suggest an approach that capitalizes on Stata's capabilities and avoids the problem you are experiencing with tokenize.

    Comment


    • #3
      In Mata this is easy

      Code:
      version 12.1
      
      mata :
      
      string rowvector mytokenize(string scalar s)
      {
          transmorphic scalar t
          
          t = tokeninit("", "|")
          tokenset(t, s)
          return(tokengetall(t))
      }
      
      end
      
      mata : mytokenize("abc|a.b.| ab | ba| cd |.cd ")
      and I am sure you can almost as easily replicate the referenced SAS function (although I a have never used SAS nor looked into the function deeper).

      I can see that you probably do not wish to program these details. Tell us more about the problem and someone might invest a little more time for a tailored solution.

      Best
      Daniel

      Comment


      • #4
        Dear William and Daniel!

        Thank you for your kind and prompt answer! The main goal is to extract numerical identifiers from string variables that can be found after these pieces of text. So I search with strpos the location of the relevant part, I create a substring starting from this position with substr, and with regexm I'm looking for digits. To do it in a loop, I thought it was a good idea to tokenize the string containing the "keywords" and give it to strpos as an argument. But the leading blank is trimmed, so I get false matches...

        For example:

        Code:
        gen a = "some random text and the identifier comes ba 1234"
        gen b = strpos(a, " ba")
        gen c = substr(a, b, .)
        gen d = regexs(1) if regexm(c, "([0-9]+)")
        Kazi

        Comment


        • #5
          Not entirely clear to me.

          How do you know what the relevant part is, that you then search with strpos? Why do you need the substrings, anyway. The example above boils down to

          Code:
          gen a = "some random text and the identifier comes ba 1234"
          generate d = regexs(1) if regexm(a, "([0-9]+)")
          This assumes the identifier is the first and only numeric part in the string variable.

          Maybe you could show an example illustrating the problem more clearly?

          Anyway, you should have a look at the split command and also at moss (SSC).

          Best
          Daniel

          Comment


          • #6
            Imagine identifiers called POB and ZIP:
            raw string POB ZIP
            "Company A is located in 1234 London, POB 98765, number of employees 13" 98765 .
            "It was said that Factory EPOB was fined for 100 dollars, its identifiers are: POB 6543, ZIP 7890" 6543 7890
            "I saw random letters on Statalist: ZPOBZIP, but I wanted to give 3 examples instead" . .
            By having a leading blank, you can avoid a word ending with POB or ZIP by chance. Using split you can split string variables, but you can't slice up strings into parts to feed them into local macros if I'm not mistaken.
            Thanks again for your help!

            Kazi
            Last edited by Kazi Bacsi; 13 Jul 2016, 09:00.

            Comment


            • #7
              How about extending Daniel's example code to
              Code:
              generate ZIP=regexs(1) if regexm(a," ZIP ([0-9]+)")
              generate POB=regexs(1) if regexm(a," POB ([0-9]+)")
              Or is this too simple? I agree with Daniel: Why bother with the position? All you need is to extract a part of a string following a fixed (regular) expression.

              Regards
              Bela

              Comment


              • #8
                But this still is just

                Code:
                clear
                inp str244 s
                "Company A is located in 1234 London, POB 98765, number of employees 13"
                "It was said that Factory EPOB was fined for 100 dollars, its identifiers are: POB 6543, ZIP 7890"
                "I saw random letters on Statalist: ZPOBZIP, but I wanted to give 3 examples instead"
                end
                
                list
                
                generate POB = regexs(1) if regexm(s, " POB ([0-9]+)")
                generate ZIB = regexs(1) if regexm(s, " ZIP ([0-9]+)")
                
                list
                Best
                Daniel


                btw. see dataex (SSC) for the preferred way to show examples here on Statalist.

                Comment


                • #9
                  Daniel (Bela) was quicker.

                  Comment


                  • #10
                    Thanks, you're perfectly right that the substr part is redundant (I was keeping it for sight check). The problem is that I have 6 different kewyords to search. One ugly way to solve it is by copy-pasting 6 times these lines. I thought the other, more elegant way is to save a string in a local macro containing all the keywords, chop it using tokenize, and give it to regexm. But if the leading blank is trimmed from the individual tokens, I can't get only those matches, where the keyword is not at the end of a random word instead.

                    Comment


                    • #11
                      Code:
                      local keywords POB ZIP FOO BAR 
                      foreach kw of local keywords {
                          generate `kw' = regexs(1) if regexm(a, " `x' ([0-9]+)")
                      }
                      Best
                      Daniel

                      Comment


                      • #12
                        Or, making every single regex explicit, something like this:
                        Code:
                        local searchstrings `"pob=" POB ([0-9]+)"|zip=" ZIP ([0-9]+)"|abc=" a([0-9]+)" "'
                        
                        while (!missing(`"`searchstrings'"')) {
                            gettoken entry searchstrings : searchstrings , parse("|") quotes
                            if (`"`entry'"'=="|") continue
                            display as text `"working on search element: {it:`entry'}"'
                            gettoken varname regex : entry , parse("=") quotes
                            local regex=substr(`"`regex'"',2,.)
                            display `"varname to be generated: {it:`varname'}"'
                            display `"regex to be used: |{it:`regex'}|"'
                            generate `varname'=regexs(1) if regexm(stringvar,`regex')
                        }
                        Or in other words: Maybe your original plan simply did not work out because of quoting? Also, I would go with -gettoken- instead of -tokenize-, but this may be a matter of personal liking.

                        Regards
                        Bela

                        Comment


                        • #13
                          I suppose Daniel Klein didn't get my point with the problem of the leading and trailing blanks, but based on his solution and a Statalist thread I could find my own one!

                          Code:
                          local kw `" " POB"  "ZIP "  "  FOO " " BAR ""'
                          foreach k of local kw {
                           disp "*`k'*"
                          }
                          I will check gettoken as well, thanks for the hint!

                          Thank you guys!

                          Just as a disclaimer, I maintain that it's a bug in tokenize that it treats the parsing characters as separate tokens and this trimming thing is also rather odd...

                          Comment


                          • #14
                            Just as a disclaimer, I maintain that it's a bug in tokenize that it treats the parsing characters as separate tokens and this trimming thing is also rather odd...
                            From the pedantry corner: I understand the term "bug" to mean that a program does not perform in accordance with its description. But -tokenize- does work exactly as described in the user's manual. So I think the way to describe Kazi's objections is as a possible design defect, not a bug.

                            More substantively, -tokenize- is a very old command. It goes back to at least version 4, and, for all I know, earlier. In those early days, the -syntax- command did not yet exist (or if it did, I was unaware of it). Programmers often had to do a lot of work with -tokenize- to parse the command lines for programs they wrote. For that matter, we didn't have -foreach- back then, and loops were often done by -tokenize-ing a list, and then using a -while- structure in conjunction with a counter. In those contexts, the design decision to strip blanks but retain other parsing characters was actually quite helpful nearly all the time. In fact, in those contexts, putting the blanks into separate tokens or retaining them as part of the tokens would have been extremely inconvenient. Today -tokenize- is less used for these older purposes, and perhaps a case can be made for a new command that works more along the lines that Kazi would like. Even if that is true, I would oppose changing the behavior of -tokenize- itself, to preserve compatibility for older programs that rely on its current behavior.

                            Comment


                            • #15
                              Of course this can be called "rather odd", but Stata always strips off leading whitespace when saving text to a local macro, unless this text is quoted.

                              If you change your original code to use quoting around each element, I think it does what you originally wanted it to:
                              Code:
                              local regexp `""abc"|"a.b."|" ab "|" ba"|" cd "|".cd ""'
                              tokenize `"`regexp'"', parse("|")
                              while `"`*'"' ~= `""' {
                               disp `"*`1'*"'
                               macro shift
                              }
                              The fact that the pipes in your example are parsed as tokens is documented. The PDF documentation manual states in [P] tokenize:
                              These examples illustrate that the quotes surrounding the string are optional; the space parsing
                              character is not saved in the numbered macros; nonspace parsing characters are saved in the numbered
                              macros together with the tokens being parsed; and more than one parsing character may be specified.
                              So, again, this is documented in the manual. I would not call this a bug. Anyways, with correct quoting in your original example, you could simply parse by spaces and are fine:
                              Code:
                              local regexp `""abc" "a.b." " ab " " ba" " cd " ".cd ""'
                              tokenize `"`regexp'"'
                              while `"`*'"' ~= `""' {
                                  disp `"*`1'*"'
                                  macro shift
                              }
                              Anyways, Daniel Klein's and your -foreach- way of iterating through elements (using correct quoting!) is likely to be the most efficient way to achieve what you're after.

                              Regards
                              Bela

                              Comment

                              Working...
                              X