Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing the nth word of a string

    Hi All,

    I am trying to replace the nth word of a string and i am struggling to find an efficient way of doing it.

    Below are a couple of ugly inefficient soloutions.
    Code:
    clear
    set obs 1
    local mytext = "to be or not to be, that is the question"
    gen text = "`mytext'"
    
    local wordN = wordcount("`mytext'")
    gen newword= ""
    forvalues  i = 1/`wordN' {
        if `i'==7 {
             replace newword= newword + " "+"it"
        }
        else {
           replace newword= newword + " "+ word(text, `i')
        }
     }
     list
    or
    Code:
    clear
    set obs 1
    local mytext = "to be or not to be, that is the question"
    gen text = "`mytext'"
    
    split text, gen(word)
    replace word7 = "it"
    egen newword = concat(word*), punct(" ")
    list

    I was imagining there was a function called subword(text,re) or something more direct and more efficient.

    Thanks

    Adrian


  • #2
    Here is a way using the functions -strpos()-, -substr()- and -word()-. Regular expressions could be used as well.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str40 text
    "to be or not to be, that is the question"
    end
    
    gen wanted= substr(text, 1, strpos(text, word(text, 7))-1) + "replacement " +substr(text, strpos(text, word(text, 8)), .)
    Res.:

    Code:
    . l
    
         +--------------------------------------------------------------------------------------------+
         |                                     text                                            wanted |
         |--------------------------------------------------------------------------------------------|
      1. | to be or not to be, that is the question   to be or not to be, replacement is the question |
         +--------------------------------------------------------------------------------------------+

    Comment


    • #3
      Code:
      // input some example data
      clear
      input str56   txt
        "Ber. Who's there?"
        "Fran. Nay, answer me. Stand and unfold yourself."
        "Ber. Long live the King!"
        "Fran. Bernardo?"
        "Ber. He."
        "Fran. You come most carefully upon your hour."
        "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
        "Fran. For this relief much thanks. 'Tis bitter cold,"
        "  And I am sick at heart."
        "Ber. Have you had quiet guard?"
        "Fran. Not a mouse stirring."
        "Ber. Well, good night."
        "  If you do meet Horatio and Marcellus,"
        "  The rivals of my watch, bid them make haste."
       end
       
       // split the sentences into words
       split txt, gen(part)
       
       // say we don't want the third word
       local varl = r(varlist)
       local notwanted = "part3"
       local varl : list varl - notwanted
      
       gen wanted = ""
       foreach var of local varl {
          replace wanted = wanted + " " + `var'
      }
      replace wanted = trim(wanted)
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Originally posted by Andrew Musau View Post
        Here is a way using the functions -strpos()-, -substr()- and -word()-. Regular expressions could be used as well.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str40 text
        "to be or not to be, that is the question"
        end
        
        gen wanted= substr(text, 1, strpos(text, word(text, 7))-1) + "replacement " +substr(text, strpos(text, word(text, 8)), .)
        Res.:

        Code:
        . l
        
        +--------------------------------------------------------------------------------------------+
        | text wanted |
        |--------------------------------------------------------------------------------------------|
        1. | to be or not to be, that is the question to be or not to be, replacement is the question |
        +--------------------------------------------------------------------------------------------+
        Thanks Andrew,
        Its a nice solution but its using the word “that” as a positional argument rather than the 7th word.

        Cheers
        A

        Comment


        • #5
          Originally posted by Maarten Buis View Post
          Code:
          // input some example data
          clear
          input str56 txt
          "Ber. Who's there?"
          "Fran. Nay, answer me. Stand and unfold yourself."
          "Ber. Long live the King!"
          "Fran. Bernardo?"
          "Ber. He."
          "Fran. You come most carefully upon your hour."
          "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
          "Fran. For this relief much thanks. 'Tis bitter cold,"
          " And I am sick at heart."
          "Ber. Have you had quiet guard?"
          "Fran. Not a mouse stirring."
          "Ber. Well, good night."
          " If you do meet Horatio and Marcellus,"
          " The rivals of my watch, bid them make haste."
          end
          
          // split the sentences into words
          split txt, gen(part)
          
          // say we don't want the third word
          local varl = r(varlist)
          local notwanted = "part3"
          local varl : list varl - notwanted
          
          gen wanted = ""
          foreach var of local varl {
          replace wanted = wanted + " " + `var'
          }
          replace wanted = trim(wanted)
          Thank you Maarten, kind of a cross between my first and second approach. I’ll give it a go see how it benchmarks.

          It seems i am not missing anything completely obvious, which is good for my own self esteem. Not so good for the speed of my code.

          Thanks

          A

          Comment


          • #6
            Code:
            . mata
            ------------------------------------------------- mata (type end to exit) ---------------
            : text = "to be or not to be, that is the question"
            
            : tokens(text)
                          1          2          3          4          5          6          7
                +------------------------------------------------------------------------------
              1 |        to         be         or        not         to        be,       that
                +------------------------------------------------------------------------------
                          8          9         10
                 ----------------------------------+
              1          is        the   question  |
                 ----------------------------------+
            
            : TEXT = tokens(text)
            
            : TEXT[,7] = "DIFFERENT"
            
            : TEXT
                           1           2           3           4           5           6
                +-------------------------------------------------------------------------
              1 |         to          be          or         not          to         be,
                +-------------------------------------------------------------------------
                           7           8           9          10
                 -------------------------------------------------+
              1    DIFFERENT          is         the    question  |
                 -------------------------------------------------+
            
            : text = invtokens(TEXT)
            
            : text
              to be or not to be, DIFFERENT is the question

            Comment


            • #7
              Originally posted by Adrian Sayers View Post

              Thanks Andrew,
              Its a nice solution but its using the word “that” as a positional argument rather than the 7th word.
              If you have a list like Marteen's example, then there are some potential issues. Some strings may not have a seventh word. Building on my suggestion in #2, you can do it in 3 lines. But yes, you will get bitten if the 7th word is a substring of an earlier word. I would have given a regex example, but I don't think it can improve on Nick's solution in #6.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str56 txt
              "Ber. Who's there?"                                      
              "Fran. Nay, answer me. Stand and unfold yourself."        
              "Ber. Long live the King!"                                
              "Fran. Bernardo?"                                        
              "Ber. He."                                                
              "Fran. You come most carefully upon your hour."          
              "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
              "Fran. For this relief much thanks. 'Tis bitter cold,"    
              "  And I am sick at heart."                              
              "Ber. Have you had quiet guard?"                          
              "Fran. Not a mouse stirring."                            
              "Ber. Well, good night."                                  
              "  If you do meet Horatio and Marcellus,"                
              "  The rivals of my watch, bid them make haste."          
              end
              
              *REPLACE 7TH WORD
              gen start= strpos(txt, word(txt, 8))-1
              g end= strpos(txt, word(txt, 7))-1
              g wanted= cond(!end, txt, trim(itrim(substr(txt, 1,end) + " replacement " + substr(txt, start, .))))
              Res.:

              Code:
              . l txt wanted , notrim
              
                                                                          txt                                                            wanted  
                1.                                          Ber. Who's there?                                                 Ber. Who's there?  
                2.           Fran. Nay, answer me. Stand and unfold yourself.             Fran. Nay, answer me. Stand and replacement yourself.  
                3.                                   Ber. Long live the King!                                          Ber. Long live the King!  
                4.                                            Fran. Bernardo?                                                   Fran. Bernardo?  
                5.                                                   Ber. He.                                                          Ber. He.  
                6.              Fran. You come most carefully upon your hour.              Fran. You come most carefully upon replacement hour.  
                7.   Ber. 'Tis now struck twelve. Get thee to bed, Francisco.   Ber. 'Tis now struck twelve. Get replacement to bed, Francisco.  
                8.       Fran. For this relief much thanks. 'Tis bitter cold,       Fran. For this relief much thanks. replacement bitter cold,  
                9.                                    And I am sick at heart.                                           And I am sick at heart.  
               10.                             Ber. Have you had quiet guard?                                    Ber. Have you had quiet guard?  
               11.                                Fran. Not a mouse stirring.                                       Fran. Not a mouse stirring.  
               12.                                     Ber. Well, good night.                                            Ber. Well, good night.  
               13.                      If you do meet Horatio and Marcellus,                            If you do meet Horatio and replacement  
               14.               The rivals of my watch, bid them make haste.               The rivals of my watch, bid replacement make haste.
              Last edited by Andrew Musau; 17 Dec 2022, 11:24.

              Comment


              • #8
                Originally posted by Adrian Sayers View Post
                I’ll give it a go see how it benchmarks.
                Since you are benchmarking, I will offer my regex code. The only downside is that you need to specify any punctuation characters in advance (highlihted in blue). Then change the string position (highlighted in red).

                Code:
                clear
                input str56   txt
                  "Ber. Who's there?"
                  "Fran. Nay, answer me. Stand and unfold yourself."
                  "Ber. Long live the King!"
                  "Fran. Bernardo?"
                  "Ber. He."
                  "Fran. You come most carefully upon your hour."
                  "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
                  "Fran. For this relief much thanks. 'Tis bitter cold,"
                  "  And I am sick at heart."
                  "Ber. Have you had quiet guard?"
                  "Fran. Not a mouse stirring."
                  "Ber. Well, good night."
                  "  If you do meet Horatio and Marcellus,"
                  "  The rivals of my watch, bid them make haste."
                 end
                
                gen part2=ustrregexra(" "+trim(itrim(txt))+" ", "(([\w\.\,\?\']+\s){7})", "") if word(trim(itrim(txt)), 7)!=""
                gen part1=ustrregexra(trim(itrim(txt)), trim(itrim(part2)), "") if word(trim(itrim(txt)), 7)!=""
                gen wanted= trim(itrim(cond(missing(part2), trim(itrim(txt)), ustrregexra(" "+trim(itrim(part1))+" ", "(.*\s)(.*\s$)", "$1")+ "REPLACE" + part2)))

                Res,:

                Code:
                . l txt wanted, notrim
                
                                                                            txt                                                        wanted  
                  1.                                          Ber. Who's there?                                             Ber. Who's there?  
                  2.           Fran. Nay, answer me. Stand and unfold yourself.             Fran. Nay, answer me. Stand and REPLACE yourself.  
                  3.                                   Ber. Long live the King!                                      Ber. Long live the King!  
                  4.                                            Fran. Bernardo?                                               Fran. Bernardo?  
                  5.                                                   Ber. He.                                                      Ber. He.  
                  6.              Fran. You come most carefully upon your hour.              Fran. You come most carefully upon REPLACE hour.  
                  7.   Ber. 'Tis now struck twelve. Get thee to bed, Francisco.   Ber. 'Tis now struck twelve. Get REPLACE to bed, Francisco.  
                  8.       Fran. For this relief much thanks. 'Tis bitter cold,       Fran. For this relief much thanks. REPLACE bitter cold,  
                  9.                                    And I am sick at heart.                                       And I am sick at heart.  
                 10.                             Ber. Have you had quiet guard?                                Ber. Have you had quiet guard?  
                 11.                                Fran. Not a mouse stirring.                                   Fran. Not a mouse stirring.  
                 12.                                     Ber. Well, good night.                                        Ber. Well, good night.  
                 13.                      If you do meet Horatio and Marcellus,                            If you do meet Horatio and REPLACE  
                 14.               The rivals of my watch, bid them make haste.               The rivals of my watch, bid REPLACE make haste.  
                
                .

                Comment


                • #9
                  Some issue with the code in #8 if the word to be replaced ends in a question mark, resulting from the 2nd line of the code. The following is more robust and requires \(n\geq 2\). As in #8, add any extra punctuation characters in what is highlighted in blue.

                  Code:
                  clear
                  input str56   txt
                    "Ber. Who's there?"
                    "Fran. Nay, answer me. Stand and unfold yourself."
                    "Ber. Long live the King!"
                    "Fran. Bernardo?"
                    "Ber. He."
                    "Fran. You come most carefully upon your hour."
                    "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
                    "Fran. For this relief much thanks. 'Tis bitter cold,"
                    "  And I am sick at heart."
                    "Ber. Have you had quiet guard?"
                    "Fran. Not a mouse stirring."
                    "Ber. Well, good night."
                    "  If you do meet Horatio and Marcellus,"
                    "  The rivals of my watch, bid them make haste."
                   end
                  
                  local n=3
                  gen part2=ustrregexra(trim(itrim(txt)), "^(?:[\w\.\?',\!]+\s+){`n'}([^\n\r]+)$", "$1") if word(trim(itrim(txt)), `n'+1)!=""
                  gen part1=subinstr(trim(itrim(txt)), trim(itrim(part2)), "", 1) if word(trim(itrim(txt)), `n')!=""
                  gen wanted= trim(itrim(cond(missing(part1), trim(itrim(txt)), ustrregexra(trim(itrim(part1)),  "(.*)\s(.*)", "$1")+ " REPLACE " + part2)))
                  Res.:

                  Code:
                  . list txt wanted, notrim
                  
                                                                              txt                                                         wanted  
                    1.                                          Ber. Who's there?                                             Ber. Who's REPLACE 
                    2.           Fran. Nay, answer me. Stand and unfold yourself.              Fran. Nay, REPLACE me. Stand and unfold yourself.  
                    3.                                   Ber. Long live the King!                                    Ber. Long REPLACE the King!  
                    4.                                            Fran. Bernardo?                                                Fran. Bernardo?  
                    5.                                                   Ber. He.                                                       Ber. He.  
                    6.              Fran. You come most carefully upon your hour.               Fran. You REPLACE most carefully upon your hour.  
                    7.   Ber. 'Tis now struck twelve. Get thee to bed, Francisco.   Ber. 'Tis REPLACE struck twelve. Get thee to bed, Francisco.  
                    8.       Fran. For this relief much thanks. 'Tis bitter cold,        Fran. For REPLACE relief much thanks. 'Tis bitter cold,  
                    9.                                    And I am sick at heart.                                   And I REPLACE sick at heart.  
                   10.                             Ber. Have you had quiet guard?                             Ber. Have REPLACE had quiet guard?  
                   11.                                Fran. Not a mouse stirring.                              Fran. Not REPLACE mouse stirring.  
                   12.                                     Ber. Well, good night.                                      Ber. Well, REPLACE night.  
                   13.                      If you do meet Horatio and Marcellus,                     If you REPLACE meet Horatio and Marcellus,  
                   14.               The rivals of my watch, bid them make haste.              The rivals REPLACE my watch, bid them make haste.  
                  
                  .
                  Last edited by Andrew Musau; 18 Dec 2022, 04:38.

                  Comment


                  • #10
                    extending #6
                    Code:
                    clear all
                    timer clear
                    
                    input str56   txt
                      "one"
                      "Ber. Who's there?"
                      "Fran. Nay, answer me. Stand and unfold yourself."
                      "Ber. Long live the King!"
                      "Fran. Bernardo?"
                      "Ber. He."
                      "Fran. You come most carefully upon your hour."
                      "Ber. 'Tis now struck twelve. Get thee to bed, Francisco."
                      "Fran. For this relief much thanks. 'Tis bitter cold,"
                      "  And I am sick at heart."
                      "Ber. Have you had quiet guard?"
                      "Fran. Not a mouse stirring."
                      "Ber. Well, good night."
                      "  If you do meet Horatio and Marcellus,"
                      "  The rivals of my watch, bid them make haste."
                     end
                     
                    replace txt = ustrtrim(itrim(txt))
                    
                    mata :
                    
                    void replaceword(    
                        string scalar varname,
                        real scalar index,
                        string scalar replacement,
                        string scalar newname
                        )
                    
                    {
                        string colvector stvar
                        string rowvector word
                        real scalar i, j, rc_addvar, width
                    
                        width = strtoreal(substr(st_vartype(varname), 4, .)) + ustrlen(replacement)
                    
                        stvar = st_sdata(., varname)
                        
                        if (rc_addvar = st_addvar(width, newname) < 0) {
                                
                                    exit(error(rc_addvar))
                        }
                        
                        for (i = 1; i <= length(stvar); i++) {
                            
                            word = tokens(stvar[i])
                    
                            if (index <= length(word)) {
                              
                                for (j = 1; j <= length(word); j++) {
                                    
                                    if (j == index) {
                                        
                                        word[index] = replacement
                                        break
                                    }
                                }
                            }
                            
                            st_sstore(i, newname, invtokens(word))
                        }    
                    }
                    
                    end
                    
                    forvalues i = 1/20 {
                        
                        mata : replaceword("txt", `i', "REPLACEMENT", "wanted`i'")
                        
                        capture assert wanted`i' == txt // test for i > words
                        
                        if ( _rc == 0 ) {
                            
                            drop wanted`i'
                            continue, break  
                        }
                    }    
                    
                    format %-50s wanted*  
                    list *1 *5 *10 , clean
                    Code:
                           wanted1                                                           wanted5                                                        wanted10                                                  
                      1.   REPLACEMENT                                                       one                                                            one                                                        
                      2.   REPLACEMENT Who's there?                                          Ber. Who's there?                                              Ber. Who's there?                                          
                      3.   REPLACEMENT Nay, answer me. Stand and unfold yourself.            Fran. Nay, answer me. REPLACEMENT and unfold yourself.         Fran. Nay, answer me. Stand and unfold yourself.          
                      4.   REPLACEMENT Long live the King!                                   Ber. Long live the REPLACEMENT                                 Ber. Long live the King!                                  
                      5.   REPLACEMENT Bernardo?                                             Fran. Bernardo?                                                Fran. Bernardo?                                            
                      6.   REPLACEMENT He.                                                   Ber. He.                                                       Ber. He.                                                  
                      7.   REPLACEMENT You come most carefully upon your hour.               Fran. You come most REPLACEMENT upon your hour.                Fran. You come most carefully upon your hour.              
                      8.   REPLACEMENT 'Tis now struck twelve. Get thee to bed, Francisco.   Ber. 'Tis now struck REPLACEMENT Get thee to bed, Francisco.   Ber. 'Tis now struck twelve. Get thee to bed, REPLACEMENT  
                      9.   REPLACEMENT For this relief much thanks. 'Tis bitter cold,        Fran. For this relief REPLACEMENT thanks. 'Tis bitter cold,    Fran. For this relief much thanks. 'Tis bitter cold,      
                     10.   REPLACEMENT I am sick at heart.                                   And I am sick REPLACEMENT heart.                               And I am sick at heart.                                    
                     11.   REPLACEMENT Have you had quiet guard?                             Ber. Have you had REPLACEMENT guard?                           Ber. Have you had quiet guard?                            
                     12.   REPLACEMENT Not a mouse stirring.                                 Fran. Not a mouse REPLACEMENT                                  Fran. Not a mouse stirring.                                
                     13.   REPLACEMENT Well, good night.                                     Ber. Well, good night.                                         Ber. Well, good night.                                    
                     14.   REPLACEMENT you do meet Horatio and Marcellus,                    If you do meet REPLACEMENT and Marcellus,                      If you do meet Horatio and Marcellus,                      
                     15.   REPLACEMENT rivals of my watch, bid them make haste.              The rivals of my REPLACEMENT bid them make haste.              The rivals of my watch, bid them make haste.
                    Last edited by Bjarte Aagnes; 20 Dec 2022, 04:25.

                    Comment


                    • #11
                      Massive thanks to all of you for these excellent soloutions.

                      I should have looked to mata, but its something which i so rarely use.

                      Very much appreciated.

                      Happy Christmas

                      Adrian

                      Comment


                      • #12
                        So very suprisingly the results are on

                        set obs 10000000
                        local mytext = "to be or not to be, that is the question"
                        replace 7th word "that" with "it"

                        Option #1 in Post#1 24.9
                        Option #2 in Post#1 113.2
                        Option #3 in post#10 74.06
                        Option #4 in Post#9 156.3

                        Which i think is quite suprising as i would have thought the mata soloution would have been tons faster.

                        Once again thanks for suggestion
                        and i will place a small feather in the do codes cap.

                        Happy christmas


                        Adrian

                        Comment


                        • #13
                          I leave this here for anyone that comes a googling

                          Code:
                          cap prog drop subnthword
                          prog define subnthword, rclass
                          version 15.1
                          syntax varname , GENerate(name) nthword(string asis) //
                          quietly {    
                                  local nthword = trim(`"`nthword'"')
                                  // sort the nthwords, so replaced in order, large numbers require padding
                                      local maxpos =0
                                          local  nthword_ = `"`nthword'"'
                                              while `"`nthword'"' !="" {
                                                  gettoken group nthword : nthword
                                                  tokenize "`group'"
                                                      if `1'> `maxpos' {
                                                          local maxpos = `1'
                                                      }
                                              }
                                                  local pad =strlen("`maxpos'")
                                                  local format "%0`pad'.0f"
                                                  local nthword =`"`nthword_'"'
                                                  
                                              while `"`nthword'"' !="" {
                                                  gettoken group nthword : nthword
                                                      tokenize "`group'"
                                                      local 1 = string(`1', "`format'")
                                                          local sortnthword  `"`sortnthword' "`1' `2'""'
                                              }
                                                  local nthword : list sort sortnthword
                                                  return local wordlist `"`nthword'"'
                                  // Find the word count of the longest string
                                  tempvar wordN
                                      gen `wordN' = wordcount(`varlist')
                                          sum `wordN' , mean
                                              local maxN = `r(max)'
                                              return local Nwords `maxN'
                                  
                                  // Start a counter for the number of replacements        
                                  local n_wrd_rep = 0
                                  
                                  // Do the replacements
                                  gen `generate' = ""        
                                      foreach i of numlist 1/`maxN' {
                                          if `"`nthword'"' !="" {
                                              gettoken group: nthword
                                              tokenize "`group'"
                                              
                                              if `i'==`1'  {
                                                  replace `generate'= `generate' + " " + "`2'"
                                                  gettoken group nthword : nthword
                                                  local n_wrd_rep = `n_wrd_rep' +1
                                              }
                                              else {
                                                  replace  `generate'= `generate' + " "+ word(`varlist', `i')
                                              }
                                          }
                                          else {
                                              replace  `generate'= `generate' + " "+ word(`varlist', `i')
                                          }
                                      }
                                      return local  words_replaced = `n_wrd_rep'
                          }
                          end
                          
                          /*
                          clear
                          set obs 1
                          local mytext = "The quick brown fox jumped over the lazy dog, the end"
                          gen text = "`mytext'"
                          #Example
                          subnthword text , gen(newword) nthword("11 begining" "4 badger" "8 industrious" "9 cat," )
                          
                          return list
                          list in 1
                          
                          */

                          Comment


                          • #14
                            Your #1 seems to me to reuse the same string so all observations have the same value.

                            #10 is a solution to substitute a word (1-K) in a string variable having values with varying number of words. Below I make a new variant for this, building an expression by repeating word() functions. This need three passes thru data; first wordcount() then summarize - to get `r(max)' the largest number of "words", then generate using the expression with the repeating word() functions and replacement.

                            Below are timings for 1) #10 mata vs 2) new alternative.

                            N=1.6m, replacing word 1–10, repeated 10 times

                            Code:
                            . timer list
                               1:   1086.18 /      100 =      10.8618
                               2:    362.79 /      100 =       3.6279
                            timing code: timings.do

                            Code:
                            local vn txt
                            local index 7
                              
                            gen k = wordcount(`vn') 
                            su k , meanonly // get max number of words
                            
                            forvalues w = 1/`r(max)' { // build expression
                                
                                if ( `w' == 1 ) { 
                                    
                                    local exp  
                                    local concat  
                                }
                                
                                if ( `w' > 1 ) {
                                    
                                    local concat +
                                }
                                
                                if ( `w' == `index' ) { 
                                    
                                    local word = `" "REPLACEMENT"*(word(`vn',`w') != "") "'
                                }
                                
                                else {
                                    
                                    local word word(\`vn',`w')
                                }
                                     
                                loc exp `" `exp' `concat' `word' + char(32) "'  
                            }
                            
                            gen new_wanted`i' = ustrtrim(`exp')

                            Comment

                            Working...
                            X