compare substrings in same var - same same, but different

Franz Gerbig

Join Date: Jan 2017

Posts: 58
#1

compare substrings in same var - same same, but different

23 May 2019, 04:48

Hi.

I have a problem with a huge list of >19,000 towns. There are > 1,200 towns with duplicates (regarding name (dup != 0)), and the same ("AB") or different string in region.
Now, the aim is to identify the code of all duplicates out of region "AB" with same name if one town out of the respective "group" of duplicates is located in region "AB".
In the following example, those desired codes would be:
12345
12346
(both are located out of region "AB", but share the name "Village" with the town of code 12344 located in region "AB").

Code:

input code str25 name str2 region 12344 "Village" "AB" 12345 "Village" "TB" 12346 "Village" "DC" 12347 "Brisbane City" "AB" 12348 "Torrento" "TB" 12349 "Torrento" "TB" 12350 "Brisbane City" "AB" 12351 "Swanlake" "DC" 12352 "Island" "WR" end duplicates tag name, gen(dup)

How may I do this with Stata commands instead of manually in Excel or something similar?
Is that clear?

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Tags: string, substring
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

23 May 2019, 06:02

the following works for your example; however, I can imagine more complicated cases where it might not work so be careful

Code:

sort name region gen byte wanted = (name==name[_n-1] & region!=region[_n-1]) | (name==name[_n+1] & region!=region[_n+1)
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#3

23 May 2019, 06:41

yeah, looks aproppriate.
since the code contains both _n-1, and _n+1, it works for duplicates "before" and "after" the name in region "AB", right?
'Cause I don't know, if the duplicates follow, or precede (or both) the one in region "AB" (due to sorting by name, in first instance).
But not for more than 1 duplicate before or after, respectively? I don't know, how many duplicates exist for one name.

Last edited by Franz Gerbig; 23 May 2019, 06:59.

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#4

23 May 2019, 08:13

If I type -20/+20 instead of -1/+1, it should work for upto 40 duplicates (20 before, 20 after the one in "AB"), right?

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#5

23 May 2019, 08:27

re: #4 - no, that is not how subscripting works in Stata; if you think you have more than 2 consecutive you need to be clearer about what is going on; note that if you have 3 repeats of the name with 3 different regions, the code I gave in #2 will mark all 3 with a "1"; please try the code and examine your results and report back on anything that is not what you want (again showing data via -dataex- as you did above)
Comment

Franz Gerbig

Join Date: Jan 2017
Posts: 58

23 May 2019, 09:13

it works fine with the example data:

Code:

clear
input code str25 name str2 region
12348 "Torrento" "TB"    
12345 "Village" "AB"    
12344 "Village" "TB"    
12346 "Village" "DC"    
12347 "Brisbane City" "AB"    
12349 "Torrento" "TB"    
12350 "Brisbane City" "AB"
12351 "Swanlake" "DC"
12352 "Island" "WR"
12359 "Village" "WR"
12366 "Village" "TB"
end

duplicates tag name, gen(dup)

sort name region
gen byte wanted = 1 if (name==name[_n-1] & region!=region[_n-1]) | (name==name[_n+1] & region!=region[_n+1])
fre wanted
fre code if wanted == 1

gen wanted2 = wanted
replace wanted2 = 0 if region == "AB" //only duplicates out of "AB" wanted
fre wanted2
fre code if wanted2 == 1 //fine

but not with the (confidential) real ones - more duplicates (there are towns with more than 90 duplicates ...) should be coded wanted = 1 (and wanted2 = 1 of course).
dunno, what's wrong

or how to illustrate

Last edited by Franz Gerbig; 23 May 2019, 09:16.

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!

Announcement

compare substrings in same var - same same, but different

Comment

Comment

Comment

Comment

Comment