Matching two string variables

Bader Bin Adwan

Join Date: Apr 2021
Posts: 91

Matching two string variables

08 Jun 2023, 14:10

Hi,
In the original dataset, var1 would have more than 200K values and var2 have more than 14K values (all strings).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 var1 str5 var2
"R560" "A00 " 
"T393" "A000 "
"J219" "A001 "
"P59"  "A009 "
"K092" "A01 " 
"R509" "A010 "
"G379" "A011 "
"A162" "A012 "
"P073" "A013 "
"F209" "A014 "
"Z523" "A02 " 
"C169" "K092 "
"H663" "A021 "
"S835" "A022 "
"Z501" "A028 "
"B209" "A029 "
"F192" "A03 " 
"D414" "A030 "
"C419" "A031 "
"J039" "A032 "
"U071" "A033 "
"Z000" "A038 "
"G129" "A039 "
"J189" "A04 " 
"E14"  "A040 "
"K810" "A041 "
"R560" "A042 "
"G060" "A043 "
"G473" "A044 "
"P073" "A045 "
"F209" "A046 "
"D70"  "A047 "
"C719" "A048 "
"L021" "A049 "
"M659" "A05 " 
"Z501" "A050 "
"A527" "A051 "
"F151" "A052 "
"N270" "A053 "
"C910" "A054 "
"U071" "A058 "
"U071" "A059 "
"K297" "A162 " 
"Z922" "A060 "
"J219" "A061 "
"J219" "A062 "
"S024" "A063 "
"N390" "A064 "
"Q039" "A065 "
"G473" "A066 "
end

I want Stata to remember the string values of var2 and check if there is any occasion where there is a match in var1 then create a dummy variable and assign value of 1 to it. I tried the following:

Code:

levelsof var2 , local (codes)
gen Flag =0
foreach n of local codes {
    replace Flag =1 if ( var1 == "`n'")
    }

But for some reason this does not work in terms of results.

Is there anyway around to achieve it? I also tried egen any match but it says in the error msg that var1 is a string variable and cannot be included in the varlist!

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10191
#2

08 Jun 2023, 14:19

You have trailing spaces in "var2". So eliminate these using

Code:

help strtrim

or include this in the command

Code:

replace Flag =1 if (var1 == strtrim("`n'"))
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#3

08 Jun 2023, 15:16

Thanks Andrew, we moving to right direction, but there are some instances where the command did not get the match string values especially those that starts with letter Y, V and W. Something odd and I cannot figure it out!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#4

08 Jun 2023, 15:25

Provide a dataex example of such cases.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#5

08 Jun 2023, 15:27

What I am trying to find is if there is any string values in var1 that is outside the range of string values of var2.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#6

08 Jun 2023, 15:31

var2 contains over 14K values and I copy and past it from excel file, and its not attached to any sort of id or obs unlike to var1.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#7

08 Jun 2023, 15:43

Indeed, that is clear. You state in #3 that there are cases where your code in #1 with modifications for trailing blanks fails. Show us such cases. Another way to do this is:

Code:

replace var1=strtrim(var1) frame put var2, into(match) frame match{ replace var2= strtrim(var2) contract var2, nomiss } frlink m:1 var1, frame(match var2) *VAR1 MATCHED browse var1 match if !missing(match)

You may switch var1 and var2 if you wish to find matches in the opposite direction.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#8

08 Jun 2023, 16:08

Thanks again, but I got same results as before, which I know is not correct. I wish I can share an example of the data but it is difficult to do so as it is not row-wise therefore it will not be good for comparison unless someone have the whole data which is not practical!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#9

08 Jun 2023, 16:50

You can select a sample of the data, e.g.,

Code:

dataex in 2700/2730

for observations between and including the observation numbers 2700 and 2730. A guess is that what you think are spaces are other characters, in which case substitute

Code:

strtrim()

with its Unicode counterpart

Code:

ustrtrim()

Last edited by Andrew Musau; 08 Jun 2023, 16:54.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#10

09 Jun 2023, 06:56

I think you are right about the list itself, because I opened a new stata file and entered the string values manually and then run the command and it have been picked up. How can I make sure the string values are intact, how can I use the both codes that you suggest?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#11

09 Jun 2023, 08:41

I do not know specifically what solved your problem. The only way to find out is by presenting a reproducible example which will allow someone to investigate this.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#12

11 Jun 2023, 16:14

I figured out the issue, it was var2 since it has all the list values. This codes would works

Code:

levelsof var2 , local (codes) gen Flag =0 foreach n of local codes { replace Flag =1 if ( var1 == "`n'") }

But how to exclude string values that starts with initials V*, W*, X*, Y* from local codes before looping through var2? I tried replace Flag =1 if ( var1 == "`n'") & !inlist(var2, "V*", "W*", "X*", "Y*") but it did not work
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#13

11 Jun 2023, 16:18

If I remove all values in var2 that starts with initials VWXY, the results will be correct, but I don't want to remove those values from var2, so is there any way to overcome this issue?

Thanks
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#14

11 Jun 2023, 16:29

#12

Code:

& !ustrregexm(var2, "^V|^W|^X|^Y")

#13 - see #11.
Comment

Announcement

Matching two string variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment