Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • The intersection of two Chinese variables

    Dear All, I found this question here (in Chinese). The data set is
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str20(var1 var2 newvar)
    "中国人民大学" "中国农业大学" "中国大学"
    "北京大学"       "清华大学"       "大学"      
    "北京科技大学" "北京大学"       "北京大学"
    "北京师范大学" "上海师范大学" "师范大学"
    end
    Given "var1" and "var2", the desired result is "newvar". Basically, "newvar" is the words appearing in both "var1" and "var2". Thanks.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    It doesn't do any error checking and it doesn't take advantage of the fact that, for example, "中国", "人民", "大学", "北京" are words, but the following will give you what you want for the specific example that you show.

    .ÿ
    .ÿversionÿ16.1

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿquietlyÿinputÿstr20(var1ÿvar2ÿnewvar)

    .ÿquietlyÿcompress

    .ÿ
    .ÿquietlyÿgenerateÿstrÿictÿ=ÿ""

    .ÿ
    .ÿforvaluesÿiÿ=ÿ1/`=_N'ÿ{
    ÿÿ2.ÿ
    .ÿÿÿÿÿÿÿÿÿlocalÿleftÿ=ÿvar1[`i']
    ÿÿ3.ÿ
    .ÿÿÿÿÿÿÿÿÿlocalÿparsed_left
    ÿÿ4.ÿÿÿÿÿÿÿÿÿforvaluesÿjÿ=ÿ1/`:ÿustrlenÿlocalÿleft'ÿ{
    ÿÿ5.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿlocalÿparsed_leftÿ=ÿ"`parsed_left'"ÿ+ÿ"ÿ"ÿ+ÿusubstr("`left'",ÿ`j',ÿ1)
    ÿÿ6.ÿÿÿÿÿÿÿÿÿ}
    ÿÿ7.ÿ
    .ÿ
    .ÿÿÿÿÿÿÿÿÿlocalÿrightÿ=ÿvar2[`i']
    ÿÿ8.ÿ
    .ÿÿÿÿÿÿÿÿÿlocalÿparsed_right
    ÿÿ9.ÿÿÿÿÿÿÿÿÿforvaluesÿjÿ=ÿ1/`:ÿustrlenÿlocalÿright'ÿ{
    ÿ10.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿlocalÿparsed_rightÿ=ÿ"`parsed_right'"ÿ+ÿ"ÿ"ÿ+ÿusubstr("`right'",ÿ`j',ÿ1)
    ÿ11.ÿÿÿÿÿÿÿÿÿ}
    ÿ12.ÿ
    .ÿÿÿÿÿÿÿÿÿlocalÿintersectionÿ:ÿlistÿparsed_leftÿ&ÿparsed_right
    ÿ13.ÿÿÿÿÿÿÿÿÿlocalÿintersectionÿ:ÿsubinstrÿlocalÿintersectionÿ"ÿ"ÿ"",ÿall
    ÿ14.ÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿictÿ=ÿ"`intersection'"ÿinÿ`i'
    ÿ15.ÿ}

    .ÿ
    .ÿlist,ÿnoobs

    ÿÿ+---------------------------------------------------+
    ÿÿ|ÿÿÿÿÿÿÿÿÿvar1ÿÿÿÿÿÿÿÿÿÿÿvar2ÿÿÿÿÿnewvarÿÿÿÿÿÿÿÿictÿ|
    ÿÿ|---------------------------------------------------|
    ÿÿ|ÿ中国人民大学ÿÿÿ中国农业大学ÿÿÿ中国大学ÿÿÿ中国大学ÿ|
    ÿÿ|ÿÿÿÿÿ北京大学ÿÿÿÿÿÿÿ清华大学ÿÿÿÿÿÿÿ大学ÿÿÿÿÿÿÿ大学ÿ|
    ÿÿ|ÿ北京科技大学ÿÿÿÿÿÿÿ北京大学ÿÿÿ北京大学ÿÿÿ北京大学ÿ|
    ÿÿ|ÿ北京师范大学ÿÿÿ上海师范大学ÿÿÿ师范大学ÿÿÿ师范大学ÿ|
    ÿÿ+---------------------------------------------------+

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .


    You might want to look into what Python offers in that it probably has a richer set of string-handling functions. There are probably more elegant approaches even within Stata that others on the list could offer.

    Comment


    • #3
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str20(var1 var2 newvar)
      "中国人民大学" "中国农业大学" "中国大学"
      "北京大学"       "清华大学"       "大学"      
      "北京科技大学" "北京大学"       "北京大学"
      "北京师范大学" "上海师范大学" "师范大学"
      end
      
      gen wanted=ustrregexra(var1, "[^"+var2+"]", "")
      assert wanted==newvar
      Res.:

      Code:
      . l
      
           +---------------------------------------------------+
           |         var1           var2     newvar     wanted |
           |---------------------------------------------------|
        1. | 中国人民大学   中国农业大学   中国大学   中国大学 |
        2. |     北京大学       清华大学       大学       大学 |
        3. | 北京科技大学       北京大学   北京大学   北京大学 |
        4. | 北京师范大学   上海师范大学   师范大学   师范大学 |
           +---------------------------------------------------+

      Comment


      • #4
        Andrew Musau has presented in post #3 a most persuasive example of the power that Stata's Unicode regular expression functions bring to dealing with string data in Stata. I'm jealous because I don't think I would have thought of the elegant solution he presented. Just "liking" it seemed inadequate.

        Comment


        • #5
          Dear Andrew, Many thanks for this suggestion. (However, I always have trouble understanding the command. Is there a way to learn more about -ustrregexra- (and the like) command?)
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

            A good reference site with for regular expressions can be found at https://www.regular-expressions.info.

            With all that said, what Andrew's code does is use the value of var2 to construct a different regular expression for each observation. In the first observation, the ustrregexra() function call (not command) becomes
            Code:
            ustrregexra(var1, "[^中国农业大学]", "")
            where the regular expression (the second argument) was constructed by surrounding value of var2 by [^ and ] creating a pattern that matches any character not among the characters following the ^ - the characters 中国农业大学 taken from the value of var2 in that observation. And then every matched character is removed from the result, leaving only those characters in var1 that were also in var2.

            My admiration for this Andrew's code comes from his understanding that the regular expression can be constructed using a string expression and thus can differ from observation to observation - that's just so elegant and so powerful.
            Last edited by William Lisowski; 28 Feb 2021, 18:24.

            Comment


            • #7
              Dear William, Thanks a lot for the useful information. I have no knowledge of Unicode regular expression and need to invest time in understanding the comamnd.
              Ho-Chuan (River) Huang
              Stata 19.0, MP(4)

              Comment

              Working...
              X