Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare last part of two text variables

    Hello. I need to generate a variable that indicates the length of the text string that two other text variables have in common from the last position, including spaces. For example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10 var1 str13 var2 float ANSWER
    "cadena uno" "la cadena uno" 10
    "dos dos"    "dos dos dos"    7
    "tres dos"   "dos dos dos"    5
    end
    Help please. Thanks

  • #2
    Esteban, I think Ali's solution in the other post is very smart and works well in my computer. I revised a little to solve your question in this post.

    Code:
    gen newvar1 = ustrregexra(var1,"\w| ","$0?") + "$"
    gen wanted = strlen(ustrregexs(0)) if ustrregexm(var2,newvar1)
    drop newvar1
    Code:
    . l
    
         +-------------------------------------+
         |       var1            var2   wanted |
         |-------------------------------------|
      1. | cadena uno   la cadena uno       10 |
      2. |    dos dos     dos dos dos        7 |
      3. |   tres dos     dos dos dos        5 |
         +-------------------------------------+

    Comment


    • #3
      If it doesn't work in your computer, you may still try my revised clumsy solution as below.

      Code:
      gen var1_len = strlen(var1)
      qui sum var1_len
      local len = r(max)
      gen var2_len = strlen(var2)
      local len = min(`len', r(max))
      drop var1_len var2_len
      
      gen v1 = cond(substr(var1,-1,1)==substr(var2,-1,1)&!mi(substr(var1,-1,1))&!mi(substr(var2,-1,1)), 1, 0)
      gen wanted = v1
      
      forvalues i = 2/`len'  {
          gen v2 = cond(substr(var1,-`i',1)==substr(var2,-`i',1)&!mi(substr(var1,-`i',1))&!mi(substr(var2,-`i',1)), 1, 0)
          replace v1 = v1*v2
          replace wanted = wanted + v1
          drop v2
      }
      
      drop v1

      Comment


      • #4
        The basic problem here is that you are looking at the overlap between 2 variables, and this can be easily be accomplished by

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str10 var1 str13 var2 float ANSWER
        "cadena uno" "la cadena uno" 10
        "dos dos"    "dos dos dos"    7
        "tres dos"   "dos dos dos"    5
        end
        
        gen wanted=length(ustrregexra(var1, "[^"+var2+"]", ""))
        Note that Stata's Unicode regular expression functions were introduced in Stata 14, so previous versions do not support this.

        Res.:

        Code:
        . gen wanted=length(ustrregexra(var1, "[^"+var2+"]", ""))
        
        . l
        
             +-----------------------------------------------+
             |       var1            var2   ANSWER   wanted  |
             |-----------------------------------------------|
          1. | cadena uno   la cadena uno       10        10 |
          2. |    dos dos     dos dos dos        7         7 |
          3. |   tres dos     dos dos dos        5         5 |
             +-----------------------------------------------+

        Comment


        • #5
          Andrew, thanks for the information about the introduction of regular expression. I think the OP wants to count the number of letters (including space) backward until there is a difference. For example, if the 3rd line of var1 is "tdes dos", the wanted should still be 5, and I think your code gives 6.

          Comment


          • #6
            Correct, #4 will count all common elements.

            Comment


            • #7
              Code:
              tempvar len rvar1 rvar2
              gen `len' = max(ustrlen(var1), ustrlen(var2))
              su `len' , meanonly
              
              gen `rvar1' = ustrreverse(var1)
              gen `rvar2' = ustrreverse(var2)
              gen c = .
              
              foreach i of numlist `r(max)'/1 {
                  
                  replace c = `i' if mi(c) & usubstr(`rvar1', 1, `i') == usubstr(`rvar2', 1, `i')    
              }
              Last edited by Bjarte Aagnes; 18 Jul 2022, 11:22.

              Comment


              • #8
                A variant with fewer passes thru data:
                Code:
                tempvar len1 len2
                
                gen `len1' = ustrlen(var1)
                sum `len1', meanonly
                loc  max1 = r(max)
                gen `len2' = ustrlen(var2)
                sum `len2', meanonly
                loc  minmax = min(`max1', r(max))
                
                forvalues i = 1/`minmax' {
                
                    local exp `exp'`add' (usubstr(var1,-`i',1)==usubstr(var2,-`i',1))
                    local add +
                }
                
                gen byte c = `exp'

                Timings (10 reps) on 3 mill obs:
                #7 12.06
                #8 05.51
                #8 01.56 not using Unicode string functions

                Comment

                Working...
                X