Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using moss to cut extremely long string variables in two (based on character position)

    Hello all,

    I am trying to import and clean a number of documents (imported into a dataset as a single variable) for later analysis.

    Each document consists of long dialogs between speakers, where each speaker is identified with parentheses. Some of the documents are very long, and include thousands of statements (which exceed the total number of variables I can add in my flavor of Stata).

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 filename str235 text
    "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
    "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
    ""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                             
    end


    In order to process the documents, I use split text, p("):"). One I have split the document, I then reshape it to long (so that each individual statements is a separate observation).

    However, as noted, some of the documents are so long that the split will generate too many variables. (The longest document has around 3,000 statements).

    I have several options I am thinking about conceptually, but I'm not sure how to properly execute them.

    -The easiest would be to simply cut the string in half, and then run split in two different datasets. The problem, however, is that a) the string length varies dramatically (300,000 to 900,000 characters), and it is not well correlated with the number of statements (some of the statements are very short interjections).

    -use moss text, match("):") to identify all the instances of "):" in the string. Using the string position identified by moss, I can then split each string based on roughly the 1,500th instance of "):", then run split on these separately (and avoid the variable addition limit).

    However, I'm not sure how to use the value of _pos1500 to cut each string individually into two parts:

    It would look something like this (not using code that works):

    Code:
    gen text1 = text
    
    replace text= strpos(text, 1, [value of _pos1500])
    replace text1= strpos(text1, [value of _pos1500], [end of the string])
    I'd appreciate any advice anyone had on this problem, or another way to think about it entirely!

    Thanks.



  • #2
    Nate, I have a clumsy way of handling your case as below. There must be better ways.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 filename str235 text
    "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
    "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
    ""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
    end
    
    gen dialog = ""        //the variable storing all dialogs
    gen dialog_file = ""    //which file is a dialog from
    local fileno = 1    //the text file no. 
    local line = 1        //the line no.
    
    replace text = subinstr(text, ":)", "):", .)
    
    while text[`fileno'] != "" {
        while text[`fileno'] != "" {
            set obs `=max(`line', _N)'
            replace dialog = regexs(1) if regexm(text[`fileno'], "(^\([ a-zA-Z]*\):[ a-zA-Z]*.?[ ]*).*") in `line'
            replace dialog_file = filename[`fileno'] in `line'
            replace text = subinstr(text[`fileno'], dialog[`line'], "", .) in `fileno'
            local ++line
        }
        local ++fileno
    }
    
    drop filename text
    replace dialog = strtrim(dialog)
    Code:
    . list
    
         +------------------------------------------------------------------------------+
         |                                                       dialog     dialog_file |
         |------------------------------------------------------------------------------|
      1. |                     (speaker a): Lorem ipsum dolor sit amet.   document1.txt |
      2. |                        (speaker b): Ut enim ad minim veniam.   document1.txt |
      3. | (speaker c): quis nostrud exercitation ullamco laboris nisi.   document1.txt |
      4. |                                (speaker x): Tincidunt vitae.   document1.txt |
      5. |       (speaker f): Tortor consequat id porta nibh venenatis.   document2.txt |
         |------------------------------------------------------------------------------|
      6. |                                       (speaker g): Enim sed.   document2.txt |
      7. |                         (speaker h): Tincidunt vitae semper.                 |
      8. |             (speaker i): quis lectus nulla at volutpat diam.                 |
      9. |                       (speaker j): Quis varius quam quisque.                 |
         +------------------------------------------------------------------------------+
    Last edited by Fei Wang; 09 Nov 2021, 03:53.

    Comment


    • #3
      It seems to me that a period uniquely identifies the end of a conversation. You can modify this slightly if the end is determined by the sequence: period + space + opening parenthesis.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str13 filename str235 text
      "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
      "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
      ""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
      end
      
      split text, p(.) g(line)
      reshape long line, i(text) j(which)
      Res.:

      Code:
      . l filename line, sepby(filename)
      
           +------------------------------------------------------------------------------+
           |      filename                                                           line |
           |------------------------------------------------------------------------------|
        1. | document1.txt                        (speaker a): Lorem ipsum dolor sit amet |
        2. | document1.txt                           (speaker b): Ut enim ad minim veniam |
        3. | document1.txt    (speaker c): quis nostrud exercitation ullamco laboris nisi |
        4. | document1.txt                                   (speaker x): Tincidunt vitae |
           |------------------------------------------------------------------------------|
        5. | document2.txt          (speaker f): Tortor consequat id porta nibh venenatis |
        6. | document2.txt                                          (speaker g:) Enim sed |
        7. | document2.txt                                                                |
        8. | document2.txt                                                                |
           |------------------------------------------------------------------------------|
        9. |                                          (speaker h:) Tincidunt vitae semper |
       10. |                              (speaker i): quis lectus nulla at volutpat diam |
       11. |                                        (speaker j): Quis varius quam quisque |
       12. |                                                                              |
           +------------------------------------------------------------------------------+

      Comment


      • #4
        Many thanks Fei Wang and Andrew for your replies.

        -I ran Fei Wang's code, but received a message of "text not found" - reading through the code, I'm not sure where the error is coming from.

        -Andrew, your approach is what I want to do - but unfortunately, the problem of too many variables persists. If I split, it will try to generate too many variables, and I will receive the message "no room to add more variables because of width".

        I wonder if there's a split alternative that works to create new observations, rather than new variables? That way, I won't run into the problem of way too many variables being generated....










        Comment


        • #5
          Nate, my code is based on you example data where there are two variables "filename", storing txt file names, and "text", storing dialogs from each txt file. I assume my code will work if your complete data have the same structure.

          Comment


          • #6
            deleted.
            Last edited by Bjarte Aagnes; 10 Nov 2021, 02:55.

            Comment


            • #7
              To use the split + reshape you might split your line variable to avoid the max var limit, then repeat "the split + reshape" to each part. One way to split:
              Code:
              gen lastpart = usubstr(substr(text,int(ustrlen(text)/2),.),ustrpos(substr(text,int(ustrlen(text)/2),.),"(speaker"),.)
              gen firstpart = usubinstr(text, lastpart,"",1)
              assert text == firstpart + lastpart

              Comment


              • #8
                mata have a ustrsplit() funtion:
                Code:
                clear
                
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str13 filename str235 text
                "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
                "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
                "document3.txt" "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
                end
                
                mata : 
                
                outputname = "myres.csv"
                
                fh_out = fopen(outputname, "w" )
                
                for (i=1; i<=3; i++) {
                    
                        statements = ustrsplit(st_sdata(i,"text"),"[(]speaker") 
                        
                        filename = st_sdata(i,"filename")
                        
                        for (j=1; j<=cols(statements); j++) {
                                     
                            if ( statements[j] != "" ) {
                               
                                fput(fh_out, filename + ";" + "(speaker" + statements[j] )               
                            }
                        }
                    }
                    
                fclose(fh_out)
                        
                end
                
                import delimited using "myres.csv" , delim(";") clear
                
                format %-100s v? 
                list, clean
                Code:
                       v1              v2                                                             
                  1.   document1.txt   (speaker a): Lorem ipsum dolor sit amet.                       
                  2.   document1.txt   (speaker b): Ut enim ad minim veniam.                          
                  3.   document1.txt   (speaker c): quis nostrud exercitation ullamco laboris nisi.   
                  4.   document1.txt   (speaker x): Tincidunt vitae.                                  
                  5.   document2.txt   (speaker f): Tortor consequat id porta nibh venenatis.         
                  6.   document2.txt   (speaker g:) Enim sed.                                         
                  7.   document3.txt   (speaker h:) Tincidunt vitae semper.                           
                  8.   document3.txt   (speaker i): quis lectus nulla at volutpat diam.               
                  9.   document3.txt   (speaker j): Quis varius quam quisque.

                Comment


                • #9
                  Originally posted by Nate Tamment View Post
                  .

                  -Andrew, your approach is what I want to do - but unfortunately, the problem of too many variables persists. If I split, it will try to generate too many variables, and I will receive the message "no room to add more variables because of width".

                  I wonder if there's a split alternative that works to create new observations, rather than new variables? That way, I won't run into the problem of way too many variables being generated....
                  If you have Stata 16+, you can make use of frames to expand the number of variables available to you. In any case, the native Stata string functions still work well for your problem without having to create extra variables.

                  Code:
                  clear
                  input str13 filename str235 text
                  "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
                  "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
                  ""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."
                  end
                  gen conversations= length(text) - length(subinstr(text, ".", "", .))
                  expand conversations
                  bys filename: gen which=_n
                  gen wanted= substr(text, 1, strpos(text, ".") + 1)
                  gen text2=text
                  qui sum which
                  forval i=2/`r(max)'{
                      replace text2= subinstr(text2, substr(text2, 1, strpos(text2, ".") + 1), "", 1) if which>=`i'
                      replace wanted= substr(text2, 1, strpos(text2, ".") + 1) if  which>=`i'
                  }
                  Res.:

                  Code:
                  . gsort -filename wanted
                  
                  . l filename wanted, sepby(filename)
                  
                       +-------------------------------------------------------------------------------+
                       |      filename                                                          wanted |
                       |-------------------------------------------------------------------------------|
                    1. | document2.txt         (speaker f): Tortor consequat id porta nibh venenatis.  |
                    2. | document2.txt                                          (speaker g:) Enim sed. |
                       |-------------------------------------------------------------------------------|
                    3. | document1.txt                       (speaker a): Lorem ipsum dolor sit amet.  |
                    4. | document1.txt                          (speaker b): Ut enim ad minim veniam.  |
                    5. | document1.txt   (speaker c): quis nostrud exercitation ullamco laboris nisi.  |
                    6. | document1.txt                                   (speaker x): Tincidunt vitae. |
                       |-------------------------------------------------------------------------------|
                    7. |                                         (speaker h:) Tincidunt vitae semper.  |
                    8. |                             (speaker i): quis lectus nulla at volutpat diam.  |
                    9. |                                        (speaker j): Quis varius quam quisque. |
                       +-------------------------------------------------------------------------------+

                  Comment


                  • #10
                    Fei Wang, Bjarte, and Andrew, thanks for your very helpful responses!

                    Bjarte and Andrew: both of your approaches seem to work well in solving this issue. Bjarte, thanks for the Mata code - I've never ventured in this direction, but this works well in splitting up the variable. Andrew, thanks for the code as well as to pointing me to frames - it's a Stata feature that I clearly need to investigate further.

                    Thanks again all, the quality of the responses to queries on this list never fails to amaze.


                    Comment

                    Working...
                    X