How to extract the target string

fu gang

Join Date: Jan 2021

Posts: 138
#1

How to extract the target string

21 Jun 2022, 21:04

* Example generated by -dataex-. For more info, type help dataex
clear
input str252 A
"王亚变 %A 薛丽洋 %A 刘佳 %A 王金相"
"付丽亚 %A 宋玉栋 %A 王盼新 %A 赵檬 %A 黄琪 %A 席宏波 %A 于茵 %A 吴昌永"
"苑鹏"
"杨春月 %A 李莉 %A 马诗淇 %A 翟永杰 %A 周静逸 %A 陈徐彬"
"陆曦 %A 梅凯"
"柴大华 %A 金珊 %A 杨菲 %A 王晶"
"匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A 邓凌"
"付俊娥 %A 田宏红 %A 李纪人"
"梁建奎 %A 龙岩 %A 郭爽"
"王福进"
"郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"
"陈娜日苏"
"汪杰 %A 杨青 %A 黄艺 %A 蔡佳亮"
"何进朝 %A 李嘉"
"张雯欢 %A 吴彦"
"孙东迁 %A 周孝德 %A 曹永中"
"钟声 %A 郁建桥 %A 徐亮"
"张佩 %A 贾振兴 %A 郑秀清 %A 张旭阳"
"卢滨 %A 陈义中 %A 常文婷"
"陈义中"
"龙岩 %A 李有明 %A 孔令仲 %A 朱杰"
"邓宇杰 %A 刘伟"
"吴凤平 %A 朱晓娜 %A 程铁军"
"方航 %A 胡玮 %A 胡琳"
"张晶晶"
end
[/CODE]

How to extract the target string
How to extract the name before the first %A, and extract the name after the last %A

I know there should be at least 3 ways:
Combining the ustrregexs and ustrregexm functions is one way
Combining ends function and punct function is one way
moss command is one way
But I can only achieve the goal with the ends function and the punct function, other methods, can anyone provide guidance?
Tags: None
fu gang

Join Date: Jan 2021

Posts: 138
#2

21 Jun 2022, 21:09

The split command can also achieve the purpose
Comment

fu gang

Join Date: Jan 2021
Posts: 138

22 Jun 2022, 00:02

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str252 A str12(B C)
"王亚变 %A 薛丽洋 %A 刘佳 %A 王金相"                                               "王亚变 "   " 王金相"  
"付丽亚 %A 宋玉栋 %A 王盼新 %A 赵檬 %A 黄琪 %A 席宏波 %A 于茵 %A 吴昌永" "付丽亚 "   " 吴昌永"  
"苑鹏"                                                                                      "苑鹏"       "苑鹏"      
"杨春月 %A 李莉 %A 马诗淇 %A 翟永杰 %A 周静逸 %A 陈徐彬"                     "杨春月 "   " 陈徐彬"  
"陆曦 %A 梅凯"                                                                            "陆曦 "      " 梅凯"    
"柴大华 %A 金珊 %A 杨菲 %A 王晶"                                                     "柴大华 "   " 王晶"    
"匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A 邓凌"                           "匡翠萍 "   " 邓凌"    
"付俊娥 %A 田宏红 %A 李纪人"                                                         "付俊娥 "   " 李纪人"  
"梁建奎 %A 龙岩 %A 郭爽"                                                               "梁建奎 "   " 郭爽"    
"王福进"                                                                                   "王福进"    "王福进"  
"郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"                                        "郭爽 "      " 龙策"    
"陈娜日苏"                                                                                "陈娜日苏" "陈娜日苏"
"汪杰 %A 杨青 %A 黄艺 %A 蔡佳亮"                                                     "汪杰 "      " 蔡佳亮"  
"何进朝 %A 李嘉"                                                                         "何进朝 "   " 李嘉"    
"张雯欢 %A 吴彦"                                                                         "张雯欢 "   " 吴彦"    
"孙东迁 %A 周孝德 %A 曹永中"                                                         "孙东迁 "   " 曹永中"  
"钟声 %A 郁建桥 %A 徐亮"                                                               "钟声 "      " 徐亮"    
"张佩 %A 贾振兴 %A 郑秀清 %A 张旭阳"                                               "张佩 "      " 张旭阳"  
"卢滨 %A 陈义中 %A 常文婷"                                                            "卢滨 "      " 常文婷"  
"陈义中"                                                                                   "陈义中"    "陈义中"  
"龙岩 %A 李有明 %A 孔令仲 %A 朱杰"                                                  "龙岩 "      " 朱杰"    
"邓宇杰 %A 刘伟"                                                                         "邓宇杰 "   " 刘伟"    
"吴凤平 %A 朱晓娜 %A 程铁军"                                                         "吴凤平 "   " 程铁军"  
"方航 %A 胡玮 %A 胡琳"                                                                  "方航 "      " 胡琳"    
"张晶晶"                                                                                   "张晶晶"    "张晶晶"  
end

The two columns B and C are my target columns

egen B = ends(A),punct("%A")
egen C = ends(A),punct("%A") last

I wonder if there are other good ways to achieve this?

Last edited by fu gang; 22 Jun 2022, 00:14.

Comment

Fei Wang

Join Date: Oct 2021

Posts: 726
#4

22 Jun 2022, 10:00

Below is a solution using regular expression.

Code:

gen name_first = ustrregexs(1) if ustrregexm(A, "^(\S*)") gen name_last = ustrregexs(1) if ustrregexm(A, "(\S*)$")
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#5

22 Jun 2022, 15:04

@ Fei Wang thank you very much for your kind help

Thank you for the regex approach solution, It works well, thanks a lot! I have tried to apply, gen B = ustrregexs(1) if ustrregexm(A, " "), but the expression in it will not be written, the attempt failed to achieve the purpose.
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#6

22 Jun 2022, 15:13

Thank you for your enthusiastic help, the idea is very subtle, the method of the above regular expression is to match all characters, but if there are English names in the A field, there are also other letters, if you need to match %A exactly, or in %A two There is a space on each side for the content that needs to be matched, so how should this regular expression be written?
It seems that I should study hard to learn regex
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#7

23 Jun 2022, 02:45

Originally posted by fu gang View Post

Thank you for your enthusiastic help, the idea is very subtle, the method of the above regular expression is to match all characters, but if there are English names in the A field, there are also other letters, if you need to match %A exactly, or in %A two There is a space on each side for the content that needs to be matched, so how should this regular expression be written?
It seems that I should study hard to learn regex

Could you please give concrete examples?
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#8

23 Jun 2022, 12:49

First of all thank you for your reply, thank you for your enthusiastic help

clear
input str252 A
"Ater %A 薛丽洋 %A John %A 王金相"
"Alice %A 宋玉栋 %A Lucy %A 赵檬 %A 黄琪 %A BAC %A 于茵 %A 吴昌永"
"苑鹏"
"Rose %A 李莉 %A 马诗淇 %A Elsa %A Charles %A 陈徐彬"
"陆曦 %A 梅凯"
"柴大华 %A 金珊 %A 杨菲 %A AB"
"匡翠萍 %A 邢飞 %A 刘曙光 %A 娄厦 %A 贺露露 %A AA"
"付俊娥 %A Bill %A Mark"
"Betty %A Camille %A Sarah"
"Sophia"
"郭爽 %A 龙岩 %A 李有明 %A 杨艺琳 %A 龙策"
"陈娜日苏"
end

Another example, in fact, there are two kinds of strings in Chinese and English, including the B letter, or you want to extract the strings before the first %A and after the last %A, can you use regular expressions?
Thank you very much!
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

23 Jun 2022, 19:04

Yes. The code would be

Code:

gen name_first = ustrregexs(0) if ustrregexm(A, "\w+")
gen name_last = ustrregexs(0) if ustrregexm(A, "\w+$")

Another example: there is no space between "%A" and names.

Code:

clear
input str252 A
"Ater%A薛丽洋%AJohn%A王金相"                              
"Alice%A宋玉栋%ALucy%A赵檬%A黄琪%ABAC%A于茵%A吴昌永"
"苑鹏"                                                        
"Rose%A李莉%A马诗淇%AElsa%ACharles%A陈徐彬"             
"陆曦%A梅凯"                                                
"柴大华%A金珊%A杨菲%AAB"                                 
"匡翠萍%A邢飞%A刘曙光%A娄厦%A贺露露%AAA"           
"付俊娥%ABill%AMark"                                         
"Betty%ACamille%ASarah"                                         
"Sophia"                                                        
"郭爽%A龙岩%A李有明%A杨艺琳%A龙策"                  
"陈娜日苏"                                                  
end

gen name_first = ustrregexs(0) if ustrregexm(A, "\w+")
gen name_last = ustrregexs(2) if ustrregexm(A, "(%A)?(\w+)$")

Comment

fu gang

Join Date: Jan 2021

Posts: 138
#10

23 Jun 2022, 20:13

It can completely achieve the goal, what a magical regular expression, @Fei Wang Thank ytou very much
Comment

Announcement

How to extract the target string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment