Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract the target string

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str252 A
    "王亚变 %A 薛丽洋 %A 刘佳 %A 王金相"
    "付丽亚 %A 宋玉栋 %A 王盼新 %A 赵檬 %A 黄琪 %A 席宏波 %A 于茵 %A 吴昌永"
    "苑鹏"
    "杨春月 %A 李莉 %A 马诗淇 %A 翟永杰 %A 周静逸 %A 陈徐彬"
    "陆曦 %A 梅凯"
    "柴大华 %A 金珊 %A 杨菲 %A 王晶"
    "匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A 邓凌"
    "付俊娥 %A 田宏红 %A 李纪人"
    "梁建奎 %A 龙岩 %A 郭爽"
    "王福进"
    "郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"
    "陈娜日苏"
    "汪杰 %A 杨青 %A 黄艺 %A 蔡佳亮"
    "何进朝 %A 李嘉"
    "张雯欢 %A 吴彦"
    "孙东迁 %A 周孝德 %A 曹永中"
    "钟声 %A 郁建桥 %A 徐亮"
    "张佩 %A 贾振兴 %A 郑秀清 %A 张旭阳"
    "卢滨 %A 陈义中 %A 常文婷"
    "陈义中"
    "龙岩 %A 李有明 %A 孔令仲 %A 朱杰"
    "邓宇杰 %A 刘伟"
    "吴凤平 %A 朱晓娜 %A 程铁军"
    "方航 %A 胡玮 %A 胡琳"
    "张晶晶"
    end
    [/CODE]


    How to extract the target string
    How to extract the name before the first %A, and extract the name after the last %A

    I know there should be at least 3 ways:
    Combining the ustrregexs and ustrregexm functions is one way
    Combining ends function and punct function is one way
    moss command is one way
    But I can only achieve the goal with the ends function and the punct function, other methods, can anyone provide guidance?

  • #2
    The split command can also achieve the purpose

    Comment


    • #3
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str252 A str12(B C)
      "王亚变 %A 薛丽洋 %A 刘佳 %A 王金相"                                               "王亚变 "   " 王金相"  
      "付丽亚 %A 宋玉栋 %A 王盼新 %A 赵檬 %A 黄琪 %A 席宏波 %A 于茵 %A 吴昌永" "付丽亚 "   " 吴昌永"  
      "苑鹏"                                                                                      "苑鹏"       "苑鹏"      
      "杨春月 %A 李莉 %A 马诗淇 %A 翟永杰 %A 周静逸 %A 陈徐彬"                     "杨春月 "   " 陈徐彬"  
      "陆曦 %A 梅凯"                                                                            "陆曦 "      " 梅凯"    
      "柴大华 %A 金珊 %A 杨菲 %A 王晶"                                                     "柴大华 "   " 王晶"    
      "匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A 邓凌"                           "匡翠萍 "   " 邓凌"    
      "付俊娥 %A 田宏红 %A 李纪人"                                                         "付俊娥 "   " 李纪人"  
      "梁建奎 %A 龙岩 %A 郭爽"                                                               "梁建奎 "   " 郭爽"    
      "王福进"                                                                                   "王福进"    "王福进"  
      "郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"                                        "郭爽 "      " 龙策"    
      "陈娜日苏"                                                                                "陈娜日苏" "陈娜日苏"
      "汪杰 %A 杨青 %A 黄艺 %A 蔡佳亮"                                                     "汪杰 "      " 蔡佳亮"  
      "何进朝 %A 李嘉"                                                                         "何进朝 "   " 李嘉"    
      "张雯欢 %A 吴彦"                                                                         "张雯欢 "   " 吴彦"    
      "孙东迁 %A 周孝德 %A 曹永中"                                                         "孙东迁 "   " 曹永中"  
      "钟声 %A 郁建桥 %A 徐亮"                                                               "钟声 "      " 徐亮"    
      "张佩 %A 贾振兴 %A 郑秀清 %A 张旭阳"                                               "张佩 "      " 张旭阳"  
      "卢滨 %A 陈义中 %A 常文婷"                                                            "卢滨 "      " 常文婷"  
      "陈义中"                                                                                   "陈义中"    "陈义中"  
      "龙岩 %A 李有明 %A 孔令仲 %A 朱杰"                                                  "龙岩 "      " 朱杰"    
      "邓宇杰 %A 刘伟"                                                                         "邓宇杰 "   " 刘伟"    
      "吴凤平 %A 朱晓娜 %A 程铁军"                                                         "吴凤平 "   " 程铁军"  
      "方航 %A 胡玮 %A 胡琳"                                                                  "方航 "      " 胡琳"    
      "张晶晶"                                                                                   "张晶晶"    "张晶晶"  
      end
      The two columns B and C are my target columns

      egen B = ends(A),punct("%A")
      egen C = ends(A),punct("%A") last


      I wonder if there are other good ways to achieve this?
      Last edited by fu gang; 22 Jun 2022, 00:14.

      Comment


      • #4
        Below is a solution using regular expression.

        Code:
        gen name_first = ustrregexs(1) if ustrregexm(A, "^(\S*)")
        gen name_last = ustrregexs(1) if ustrregexm(A, "(\S*)$")

        Comment


        • #5
          @ Fei Wang thank you very much for your kind help

          Thank you for the regex approach solution, It works well, thanks a lot! I have tried to apply, gen B = ustrregexs(1) if ustrregexm(A, " "), but the expression in it will not be written, the attempt failed to achieve the purpose.

          Comment


          • #6
            Thank you for your enthusiastic help, the idea is very subtle, the method of the above regular expression is to match all characters, but if there are English names in the A field, there are also other letters, if you need to match %A exactly, or in %A two There is a space on each side for the content that needs to be matched, so how should this regular expression be written?
            It seems that I should study hard to learn regex

            Comment


            • #7
              Originally posted by fu gang View Post
              Thank you for your enthusiastic help, the idea is very subtle, the method of the above regular expression is to match all characters, but if there are English names in the A field, there are also other letters, if you need to match %A exactly, or in %A two There is a space on each side for the content that needs to be matched, so how should this regular expression be written?
              It seems that I should study hard to learn regex
              Could you please give concrete examples?

              Comment


              • #8
                First of all thank you for your reply, thank you for your enthusiastic help

                clear
                input str252 A
                "Ater %A 薛丽洋 %A John %A 王金相"
                "Alice %A 宋玉栋 %A Lucy %A 赵檬 %A 黄琪 %A BAC %A 于茵 %A 吴昌永"
                "苑鹏"
                "Rose %A 李莉 %A 马诗淇 %A Elsa %A Charles %A 陈徐彬"
                "陆曦 %A 梅凯"
                "柴大华 %A 金珊 %A 杨菲 %A AB"
                "匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A AA"
                "付俊娥 %A Bill %A Mark"
                "Betty %A Camille %A Sarah"
                "Sophia"
                "郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"
                "陈娜日苏"
                end


                Another example, in fact, there are two kinds of strings in Chinese and English, including the B letter, or you want to extract the strings before the first %A and after the last %A, can you use regular expressions?
                Thank you very much!

                Comment


                • #9
                  Yes. The code would be

                  Code:
                  gen name_first = ustrregexs(0) if ustrregexm(A, "\w+")
                  gen name_last = ustrregexs(0) if ustrregexm(A, "\w+$")
                  Another example: there is no space between "%A" and names.

                  Code:
                  clear
                  input str252 A
                  "Ater%A薛丽洋%AJohn%A王金相"                              
                  "Alice%A宋玉栋%ALucy%A赵檬%A黄琪%ABAC%A于茵%A吴昌永"
                  "苑鹏"                                                        
                  "Rose%A李莉%A马诗淇%AElsa%ACharles%A陈徐彬"             
                  "陆曦%A梅凯"                                                
                  "柴大华%A金珊%A杨菲%AAB"                                 
                  "匡翠萍%A邢飞%A刘曙光%A娄厦%A贺露露%AAA"           
                  "付俊娥%ABill%AMark"                                         
                  "Betty%ACamille%ASarah"                                         
                  "Sophia"                                                        
                  "郭爽%A龙岩%A李有明%A杨艺琳%A龙策"                  
                  "陈娜日苏"                                                  
                  end
                  
                  gen name_first = ustrregexs(0) if ustrregexm(A, "\w+")
                  gen name_last = ustrregexs(2) if ustrregexm(A, "(%A)?(\w+)$")

                  Comment


                  • #10
                    It can completely achieve the goal, what a magical regular expression, @Fei Wang Thank ytou very much

                    Comment

                    Working...
                    X