The intersection of two Chinese variables

River Huang

Join Date: Mar 2016

Posts: 1908
#1

The intersection of two Chinese variables

27 Feb 2021, 20:24

Dear All, I found this question here (in Chinese). The data set is

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str20(var1 var2 newvar) "中国人民大学" "中国农业大学" "中国大学" "北京大学" "清华大学" "大学" "北京科技大学" "北京大学" "北京大学" "北京师范大学" "上海师范大学" "师范大学" end

Given "var1" and "var2", the desired result is "newvar". Basically, "newvar" is the words appearing in both "var1" and "var2". Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4405
#2

27 Feb 2021, 21:11

It doesn't do any error checking and it doesn't take advantage of the fact that, for example, "中国", "人民", "大学", "北京" are words, but the following will give you what you want for the specific example that you show.

.ÿ
.ÿversionÿ16.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿquietlyÿinputÿstr20(var1ÿvar2ÿnewvar)

.ÿquietlyÿcompress

.ÿ
.ÿquietlyÿgenerateÿstrÿictÿ=ÿ""

.ÿ
.ÿforvaluesÿiÿ=ÿ1/`=_N'ÿ{
ÿÿ2.ÿ
.ÿÿÿÿÿÿÿÿÿlocalÿleftÿ=ÿvar1[`i']
ÿÿ3.ÿ
.ÿÿÿÿÿÿÿÿÿlocalÿparsed_left
ÿÿ4.ÿÿÿÿÿÿÿÿÿforvaluesÿjÿ=ÿ1/`:ÿustrlenÿlocalÿleft'ÿ{
ÿÿ5.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿlocalÿparsed_leftÿ=ÿ"`parsed_left'"ÿ+ÿ"ÿ"ÿ+ÿusubstr("`left'",ÿ`j',ÿ1)
ÿÿ6.ÿÿÿÿÿÿÿÿÿ}
ÿÿ7.ÿ
.ÿ
.ÿÿÿÿÿÿÿÿÿlocalÿrightÿ=ÿvar2[`i']
ÿÿ8.ÿ
.ÿÿÿÿÿÿÿÿÿlocalÿparsed_right
ÿÿ9.ÿÿÿÿÿÿÿÿÿforvaluesÿjÿ=ÿ1/`:ÿustrlenÿlocalÿright'ÿ{
ÿ10.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿlocalÿparsed_rightÿ=ÿ"`parsed_right'"ÿ+ÿ"ÿ"ÿ+ÿusubstr("`right'",ÿ`j',ÿ1)
ÿ11.ÿÿÿÿÿÿÿÿÿ}
ÿ12.ÿ
.ÿÿÿÿÿÿÿÿÿlocalÿintersectionÿ:ÿlistÿparsed_leftÿ&ÿparsed_right
ÿ13.ÿÿÿÿÿÿÿÿÿlocalÿintersectionÿ:ÿsubinstrÿlocalÿintersectionÿ"ÿ"ÿ"",ÿall
ÿ14.ÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿictÿ=ÿ"`intersection'"ÿinÿ`i'
ÿ15.ÿ}

.ÿ
.ÿlist,ÿnoobs

ÿÿ+---------------------------------------------------+
ÿÿ|ÿÿÿÿÿÿÿÿÿvar1ÿÿÿÿÿÿÿÿÿÿÿvar2ÿÿÿÿÿnewvarÿÿÿÿÿÿÿÿictÿ|
ÿÿ|---------------------------------------------------|
ÿÿ|ÿ中国人民大学ÿÿÿ中国农业大学ÿÿÿ中国大学ÿÿÿ中国大学ÿ|
ÿÿ|ÿÿÿÿÿ北京大学ÿÿÿÿÿÿÿ清华大学ÿÿÿÿÿÿÿ大学ÿÿÿÿÿÿÿ大学ÿ|
ÿÿ|ÿ北京科技大学ÿÿÿÿÿÿÿ北京大学ÿÿÿ北京大学ÿÿÿ北京大学ÿ|
ÿÿ|ÿ北京师范大学ÿÿÿ上海师范大学ÿÿÿ师范大学ÿÿÿ师范大学ÿ|
ÿÿ+---------------------------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.

You might want to look into what Python offers in that it probably has a richer set of string-handling functions. There are probably more elegant approaches even within Stata that others on the list could offer.
3 likes
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

28 Feb 2021, 05:29

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str20(var1 var2 newvar)
"中国人民大学" "中国农业大学" "中国大学"
"北京大学"       "清华大学"       "大学"      
"北京科技大学" "北京大学"       "北京大学"
"北京师范大学" "上海师范大学" "师范大学"
end

gen wanted=ustrregexra(var1, "[^"+var2+"]", "")
assert wanted==newvar

Res.:

Code:

. l

     +---------------------------------------------------+
     |         var1           var2     newvar     wanted |
     |---------------------------------------------------|
  1. | 中国人民大学   中国农业大学   中国大学   中国大学 |
  2. |     北京大学       清华大学       大学       大学 |
  3. | 北京科技大学       北京大学   北京大学   北京大学 |
  4. | 北京师范大学   上海师范大学   师范大学   师范大学 |
     +---------------------------------------------------+

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

28 Feb 2021, 09:09

Andrew Musau has presented in post #3 a most persuasive example of the power that Stata's Unicode regular expression functions bring to dealing with string data in Stata. I'm jealous because I don't think I would have thought of the elegant solution he presented. Just "liking" it seemed inadequate.
4 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#5

28 Feb 2021, 17:43

Dear Andrew, Many thanks for this suggestion. (However, I always have trouble understanding the command. Is there a way to learn more about -ustrregexra- (and the like) command?)

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

28 Feb 2021, 18:22

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

A good reference site with for regular expressions can be found at https://www.regular-expressions.info.

With all that said, what Andrew's code does is use the value of var2 to construct a different regular expression for each observation. In the first observation, the ustrregexra() function call (not command) becomes

Code:

ustrregexra(var1, "[^中国农业大学]", "")

where the regular expression (the second argument) was constructed by surrounding value of var2 by [^ and ] creating a pattern that matches any character not among the characters following the ^ - the characters 中国农业大学 taken from the value of var2 in that observation. And then every matched character is removed from the result, leaving only those characters in var1 that were also in var2.

My admiration for this Andrew's code comes from his understanding that the regular expression can be constructed using a string expression and thus can differ from observation to observation - that's just so elegant and so powerful.

Last edited by William Lisowski; 28 Feb 2021, 18:24.
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#7

01 Mar 2021, 00:30

Dear William, Thanks a lot for the useful information. I have no knowledge of Unicode regular expression and need to invest time in understanding the comamnd.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

The intersection of two Chinese variables

Comment

Comment

Comment

Comment

Comment

Comment