Compare last part of two text variables

Esteban Jara

Join Date: Jun 2020

Posts: 91
#1

Compare last part of two text variables

18 Jul 2022, 08:29

Hello. I need to generate a variable that indicates the length of the text string that two other text variables have in common from the last position, including spaces. For example:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str10 var1 str13 var2 float ANSWER "cadena uno" "la cadena uno" 10 "dos dos" "dos dos dos" 7 "tres dos" "dos dos dos" 5 end

Help please. Thanks
Tags: None

Fei Wang

Join Date: Oct 2021
Posts: 726

18 Jul 2022, 08:48

Esteban, I think Ali's solution in the other post is very smart and works well in my computer. I revised a little to solve your question in this post.

Code:

gen newvar1 = ustrregexra(var1,"\w| ","$0?") + "$"
gen wanted = strlen(ustrregexs(0)) if ustrregexm(var2,newvar1)
drop newvar1

Code:

. l

     +-------------------------------------+
     |       var1            var2   wanted |
     |-------------------------------------|
  1. | cadena uno   la cadena uno       10 |
  2. |    dos dos     dos dos dos        7 |
  3. |   tres dos     dos dos dos        5 |
     +-------------------------------------+

Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

18 Jul 2022, 08:53

If it doesn't work in your computer, you may still try my revised clumsy solution as below.

Code:

gen var1_len = strlen(var1)
qui sum var1_len
local len = r(max)
gen var2_len = strlen(var2)
local len = min(`len', r(max))
drop var1_len var2_len

gen v1 = cond(substr(var1,-1,1)==substr(var2,-1,1)&!mi(substr(var1,-1,1))&!mi(substr(var2,-1,1)), 1, 0)
gen wanted = v1

forvalues i = 2/`len'  {
    gen v2 = cond(substr(var1,-`i',1)==substr(var2,-`i',1)&!mi(substr(var1,-`i',1))&!mi(substr(var2,-`i',1)), 1, 0)
    replace v1 = v1*v2
    replace wanted = wanted + v1
    drop v2
}

drop v1

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10481

18 Jul 2022, 09:20

The basic problem here is that you are looking at the overlap between 2 variables, and this can be easily be accomplished by

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 var1 str13 var2 float ANSWER
"cadena uno" "la cadena uno" 10
"dos dos"    "dos dos dos"    7
"tres dos"   "dos dos dos"    5
end

gen wanted=length(ustrregexra(var1, "[^"+var2+"]", ""))

Note that Stata's Unicode regular expression functions were introduced in Stata 14, so previous versions do not support this.

Res.:

Code:

. gen wanted=length(ustrregexra(var1, "[^"+var2+"]", ""))

. l

     +-----------------------------------------------+
     |       var1            var2   ANSWER   wanted  |
     |-----------------------------------------------|
  1. | cadena uno   la cadena uno       10        10 |
  2. |    dos dos     dos dos dos        7         7 |
  3. |   tres dos     dos dos dos        5         5 |
     +-----------------------------------------------+

Comment

Fei Wang

Join Date: Oct 2021

Posts: 726
#5

18 Jul 2022, 09:51

Andrew, thanks for the information about the introduction of regular expression. I think the OP wants to count the number of letters (including space) backward until there is a difference. For example, if the 3rd line of var1 is "tdes dos", the wanted should still be 5, and I think your code gives 6.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#6

18 Jul 2022, 09:59

Correct, #4 will count all common elements.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

18 Jul 2022, 10:59

Code:

tempvar len rvar1 rvar2
gen `len' = max(ustrlen(var1), ustrlen(var2))
su `len' , meanonly

gen `rvar1' = ustrreverse(var1)
gen `rvar2' = ustrreverse(var2)
gen c = .

foreach i of numlist `r(max)'/1 {
    
    replace c = `i' if mi(c) & usubstr(`rvar1', 1, `i') == usubstr(`rvar2', 1, `i')    
}

Last edited by Bjarte Aagnes; 18 Jul 2022, 11:22.

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

19 Jul 2022, 05:21

A variant with fewer passes thru data:

Code:

tempvar len1 len2

gen `len1' = ustrlen(var1)
sum `len1', meanonly
loc  max1 = r(max)
gen `len2' = ustrlen(var2)
sum `len2', meanonly
loc  minmax = min(`max1', r(max))

forvalues i = 1/`minmax' {

    local exp `exp'`add' (usubstr(var1,-`i',1)==usubstr(var2,-`i',1))
    local add +
}

gen byte c = `exp'

Timings (10 reps) on 3 mill obs:

#7 12.06
#8 05.51
#8 01.56 not using Unicode string functions

Announcement