Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I want to extract the questioners' text and respondents' text? How to do this using regular expressions?

    The code below didn't work well. How to fix this?
    Code:
    gen QuestionerText = ustrregexs(0) if ustrregexm(QAText, "questioner[0-9]\s\w+:\w+(.*)respondent[0-9]\s\w+:\w+")
    
    gen RespondentText = ustrregexs(0) if ustrregexm(QAText, "respondent[0-9]\s\w+:\w+(.*)questioner[0-9]\s\w+:\w+")
    Click image for larger version

Name:	1.png
Views:	3
Size:	360.3 KB
ID:	1688213
    Attached Files

  • #2
    Could you please paste a data extract using the dataex command? It might be enough to post just one observation of QAText.

    Also, if one observation has multiple questions and answers, do you want them all in one observation of QuestionerText and RespondentText, or should there be separate variables for each question and each answer?

    Comment


    • #3
      I tried, dataex cannot export the text correctly. Could I upload to onedrive file?
      I want them all in one observation of QuestionerText and RespondentText

      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input strL QAText
        "评审者4 24:35我问一下你们pms这个都是自己做的平台,还是说没有这个咱们是用评审的吗?展示者1 24:39这个是西门子的工业平台,因为pm这一块我们知道�"
        "评审者5 06:35问一下你们现在的这4种盈利模式里面资金占比是多少?展示者1 06:40第一种设备租金占比最高的基本上能占我们80%,第二种基本能占15%,剩�"
        "评审者5 20:53我们想一想现在的用户是券商还是企业。展示者1 20:56我们的券我们现在考虑是挺大的,券商他们的企业全部成为我们的使用客户。评审者5 2"
        "评审者1 07:19你们这个软件平台没有申请?展示者1 07:19申请了正在申请了,而且国家高新我们现在正在申请商标什么的都申请了,但是现在还没下来,你"
        "评审者3 03:37市场推广他怎么做?展示者1 03:49市场推广的话我们分为几个端口,第一个就是传统的市场推广的,包含传统的电销试错觉得都在有。另外我"
        "评审者4 08:24你们这个技术是卖给哪些。展示者1 08:33目标的市场,我们这个技术主要是卖给目前的网络安全的用户,因为现在国家对网络安全的要求比较"
        "评审者5 25:23您这个引擎针对的用户是什么?展示者1 25:27游戏厂家做动画的,电视台,还有一些新闻媒体做平面的,都需要。评审者5 25:34你们能得节约�"
        "评审者1 32:57你轻资产的都是负的2900多万了,是后续资金来源怎么还是。展示者1 33:06资金来源你们。评审者1 33:07怎么运转的?都严的,有点事都资不抵�"
        "评审者3 19:09像知识全是归属虚拟场景社区这个是什么意思?评审者4 19:17对知识产权是归属的一个。评审者3 19:25对虚拟场景社区系统。展示人1 19:28那个�"
        "评审者4 12:30镜片的技术跟灵犀做过对标,没有?展示者1 12:40做过。评审者4 12:43坦率的说做过说一下吗?展示者1 12:47行,我可以调拨我刚才的那些PPT,�"
        end

        Code:
        gen QuestionerText = ustrregexs(0) if ustrregexm(QAText, "评审者[0-9]\s\w+:\w+(.*)展示者[0-9]\s\w+:\w+")
        
        gen RespondentText = ustrregexs(0) if ustrregexm(QAText, "展示者0-9]\s\w+:\w+(.*)评审者[0-9]\s\w+:\w+")
        I want them all in one observation of QuestionerText and RespondentText
        If possible, I also want to know how to generate be separate variables for each question and each answer?
        Last edited by Fred Lee; 05 Nov 2022, 07:59.

        Comment


        • #5
          Okay here is some code. I don't know the language, so there may be issues. Also, is the string in row 3 incomplete?

          My code can probably be improved upon by someone who knows regular expressions better, but this seems to work.

          Code:
          clonevar work_text = QAText
          clonevar work_text_init = QAText
          gen QuestionerText = ""
          gen RespondentText = ""
          gen q_addbit = ""
          gen r_addbit = ""
          
          local continue = 1
          pause on
          while `continue' {
              replace q_addbit = ""
              replace r_addbit = ""
              replace work_text_init = work_text
              replace q_addbit = ustrregexs(1) if ustrregexm(work_text,"(评审者[0-9]+ [0-9]{2}:[0-9]{2}.+)?展示者[0-9]+")
              replace q_addbit = ustrregexs(1) if ustrregexm(work_text,"(评审者[0-9]+ [0-9]{2}:[0-9]{2}.+)展示人[0-9]+") & q_addbit == ""
              replace q_addbit = ustrregexs(1) if ustrregexm(work_text,"(评审者[0-9]+ [0-9]{2}:[0-9]{2}.+)$") & q_addbit == ""    
              replace QuestionerText = QuestionerText + " " + q_addbit
              replace work_text = subinstr(work_text,q_addbit,"",1)
              replace r_addbit = ustrregexs(1) if ustrregexm(work_text,"(展示者[0-9]+ [0-9]{2}:[0-9]{2}.+)?评审者[0-9]+")
              replace r_addbit = ustrregexs(1) if ustrregexm(work_text,"(展示人[0-9]+ [0-9]{2}:[0-9]{2}.+)?评审者[0-9]+") & r_addbit == ""
              replace r_addbit = ustrregexs(1) if ustrregexm(work_text,"(展示者[0-9]+ [0-9]{2}:[0-9]{2}.+)$") & r_addbit == ""
              replace r_addbit = ustrregexs(1) if ustrregexm(work_text,"(展示人[0-9]+ [0-9]{2}:[0-9]{2}.+)$") & r_addbit == ""
              replace RespondentText = RespondentText + " " + r_addbit
              replace work_text = subinstr(work_text,r_addbit,"",1)
              capture assert trim(work_text) == "" | work_text == work_text_init
              if _rc == 0 local continue 0
                  else local continue 1
          *    pause
          }
          
          gen remainder = work_text
          drop work_text_init *_addbit
          this produces: (the strings are abbreviated to the first 80 characters)
          Code:
          . li QuestionerText RespondentText remainder, noobs sep(0) string(80)
          
            +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | QuestionerText                                                                       RespondentText                                                                       remainder |
            |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
            |  评审者4 24:35我问一下你们pms这个都是自己做的平台,还是说没有这个咱们是用评审的..     展示者1 24:39这个是西门子的工业平台,因为pm这一块我们知道�                                    |
            |  评审者5 06:35问一下你们现在的这4种盈利模式里面资金占比是多少?                       展示者1 06:40第一种设备租金占比最高的基本上能占我们80%,第二种基本能占15%,剩�                |
            |  评审者5 20:53我们想一想现在的用户是券商还是企业。                                    展示者1 20:56我们的券我们现在考虑是挺大的,券商他们的企业全部成为我们的使用客户..   评审者5 2 |
            |  评审者1 07:19你们这个软件平台没有申请?                                              展示者1 07:19申请了正在申请了,而且国家高新我们现在正在申请商标什么的都申请了,..             |
            |  评审者3 03:37市场推广他怎么做?                                                      展示者1 03:49市场推广的话我们分为几个端口,第一个就是传统的市场推广的,包含传统..             |
            |  评审者4 08:24你们这个技术是卖给哪些。                                                展示者1 08:33目标的市场,我们这个技术主要是卖给目前的网络安全的用户,因为现在国..             |
            |  评审者5 25:23您这个引擎针对的用户是什么? 评审者5 25:34你们能得节约�                 展示者1 25:27游戏厂家做动画的,电视台,还有一些新闻媒体做平面的,都需要。                     |
            |  评审者1 32:57你轻资产的都是负的2900多万了,是后续资金来源怎么还是。 评审者1 33:..    展示者1 33:06资金来源你们。                                                                   |
            |  评审者3 19:09像知识全是归属虚拟场景社区这个是什么意思?评审者4 19:17对知识产权..     展示人1 19:28那个�                                                                            |
            |  评审者4 12:30镜片的技术跟灵犀做过对标,没有?展示者1 12:40做过。评审者4 12:43坦..    展示者1 12:47行,我可以调拨我刚才的那些PPT,�                                                 |
            +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
          Edit: sorry, the Unicode seems to be messing up the display. Not sure what I can do about it.
          Last edited by Hemanshu Kumar; 05 Nov 2022, 10:49.

          Comment


          • #6
            Thanks so much!

            Comment


            • #7
              Alternatively, split and reshape long to get a structure which possibly represent the data structure better, and may be easier to work with (code below listing):
              Click image for larger version

Name:	Capture.PNG
Views:	2
Size:	84.4 KB
ID:	1688321



              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input strL QAText
              "评审者4 24:35我问一下你们pms这个都是自己做的平台,还是说没有这个咱们是用评审的吗?展示者1 24:39这个是西门子的工业平台,因为pm这一块我们知道�"
              "评审者5 06:35问一下你们现在的这4种盈利模式里面资金占比是多少?展示者1 06:40第一种设备租金占比最高的基本上能占我们80%,第二种基本能占15%,剩�"
              "评审者5 20:53我们想一想现在的用户是券商还是企业。展示者1 20:56我们的券我们现在考虑是挺大的,券商他们的企业全部成为我们的使用客户。评审者5 2"
              "评审者1 07:19你们这个软件平台没有申请?展示者1 07:19申请了正在申请了,而且国家高新我们现在正在申请商标什么的都申请了,但是现在还没下来,你"
              "评审者3 03:37市场推广他怎么做?展示者1 03:49市场推广的话我们分为几个端口,第一个就是传统的市场推广的,包含传统的电销试错觉得都在有。另外我"
              "评审者4 08:24你们这个技术是卖给哪些。展示者1 08:33目标的市场,我们这个技术主要是卖给目前的网络安全的用户,因为现在国家对网络安全的要求比较"
              "评审者5 25:23您这个引擎针对的用户是什么?展示者1 25:27游戏厂家做动画的,电视台,还有一些新闻媒体做平面的,都需要。评审者5 25:34你们能得节约�"
              "评审者1 32:57你轻资产的都是负的2900多万了,是后续资金来源怎么还是。展示者1 33:06资金来源你们。评审者1 33:07怎么运转的?都严的,有点事都资不抵�"
              "评审者3 19:09像知识全是归属虚拟场景社区这个是什么意思?评审者4 19:17对知识产权是归属的一个。评审者3 19:25对虚拟场景社区系统。展示人1 19:28那个�"
              "评审者4 12:30镜片的技术跟灵犀做过对标,没有?展示者1 12:40做过。评审者4 12:43坦率的说做过说一下吗?展示者1 12:47行,我可以调拨我刚才的那些PPT,�"
              end
              
              gen row = _n
              replace QAText = ustrtrim(QAText)
              replace QAText= ustrregexra(QAText, "评审者([0-9])","ACTIONq$1")
              replace QAText= ustrregexra(QAText, "展示者([0-9])","ACTIONa$1")
              replace QAText= ustrregexra(QAText, "展示人([0-9])","ACTIONa$1")
                
              split QAText , parse("ACTION") generate(action)
              rename action# action#, renumber(0)
              drop action0 // empty
              drop QAText
              
              reshape long action, i(row) j(j)
              drop if mi(action)
              gen str2 actiontype = action
              gen time = usubstr(action, 4, 5)
              gen byte timefail = !ustrregexm(time,"\d\d:\d\d")
              replace action = usubstr(action, 9, .)
              gen byte actionfail = ustrlen(ustrtrim(action))==0
              
              
              * add som tests
              bysort row (j) : assert usubstr(actiontype[1],1,1) == "q"
              bysort row (j) : gen byte seqOK = ///
                  usubstr(actiontype[_n],1,1) != usubstr(actiontype[_n-1],1,1)  
                  
              gen lastcharnotHan = ustrregexs(0) if ustrregexm(action, "\P{sc=Han}$")
              
              chartab lastcharnotHan // ssc desc chartab
              
              gen lastcharnotHanChar = ""
              replace lastcharnotHanChar = "IDEOGRAPHIC FULL STOP" ///
                  if  lastcharnotHan==ustrunescape("\u3002")
              replace lastcharnotHanChar = "FULLWIDTH QUESTION MARK" ///
                  if  lastcharnotHan==ustrunescape("\uff1f")
              replace lastcharnotHanChar = "REPLACEMENT CHARACTER" ///
                  if  lastcharnotHan==ustrunescape("\ufffd")  
              format %-25s lastcharnotHanChar  
              
              order actiontype time,  after(j)
              format %-50s action
              list , sepby(row) abbrev(32)
              Last edited by Bjarte Aagnes; 06 Nov 2022, 09:39.

              Comment


              • #8
                Thanks a ton, Bjarte Aagnes! Nice idea!
                I will try to absorb it!

                Comment

                Working...
                X