Using moss to cut extremely long string variables in two (based on character position)

Nate Tamment

Join Date: Jun 2020

Posts: 19
#1

Using moss to cut extremely long string variables in two (based on character position)

09 Nov 2021, 01:21

Hello all,

I am trying to import and clean a number of documents (imported into a dataset as a single variable) for later analysis.

Each document consists of long dialogs between speakers, where each speaker is identified with parentheses. Some of the documents are very long, and include thousands of statements (which exceed the total number of variables I can add in my flavor of Stata).

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str13 filename str235 text "document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae." "document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed." "" "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque." end

In order to process the documents, I use split text, p("):"). One I have split the document, I then reshape it to long (so that each individual statements is a separate observation).

However, as noted, some of the documents are so long that the split will generate too many variables. (The longest document has around 3,000 statements).

I have several options I am thinking about conceptually, but I'm not sure how to properly execute them.

-The easiest would be to simply cut the string in half, and then run split in two different datasets. The problem, however, is that a) the string length varies dramatically (300,000 to 900,000 characters), and it is not well correlated with the number of statements (some of the statements are very short interjections).

-use moss text, match("):") to identify all the instances of "):" in the string. Using the string position identified by moss, I can then split each string based on roughly the 1,500th instance of "):", then run split on these separately (and avoid the variable addition limit).

However, I'm not sure how to use the value of _pos1500 to cut each string individually into two parts:

It would look something like this (not using code that works):

Code:

gen text1 = text replace text= strpos(text, 1, [value of _pos1500]) replace text1= strpos(text1, [value of _pos1500], [end of the string])

I'd appreciate any advice anyone had on this problem, or another way to think about it entirely!

Thanks.
Tags: None

Fei Wang

Join Date: Oct 2021
Posts: 726

09 Nov 2021, 03:00

Nate, I have a clumsy way of handling your case as below. There must be better ways.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 filename str235 text
"document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
"document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
end

gen dialog = ""        //the variable storing all dialogs
gen dialog_file = ""    //which file is a dialog from
local fileno = 1    //the text file no. 
local line = 1        //the line no.

replace text = subinstr(text, ":)", "):", .)

while text[`fileno'] != "" {
    while text[`fileno'] != "" {
        set obs `=max(`line', _N)'
        replace dialog = regexs(1) if regexm(text[`fileno'], "(^\([ a-zA-Z]*\):[ a-zA-Z]*.?[ ]*).*") in `line'
        replace dialog_file = filename[`fileno'] in `line'
        replace text = subinstr(text[`fileno'], dialog[`line'], "", .) in `fileno'
        local ++line
    }
    local ++fileno
}

drop filename text
replace dialog = strtrim(dialog)

Code:

. list

     +------------------------------------------------------------------------------+
     |                                                       dialog     dialog_file |
     |------------------------------------------------------------------------------|
  1. |                     (speaker a): Lorem ipsum dolor sit amet.   document1.txt |
  2. |                        (speaker b): Ut enim ad minim veniam.   document1.txt |
  3. | (speaker c): quis nostrud exercitation ullamco laboris nisi.   document1.txt |
  4. |                                (speaker x): Tincidunt vitae.   document1.txt |
  5. |       (speaker f): Tortor consequat id porta nibh venenatis.   document2.txt |
     |------------------------------------------------------------------------------|
  6. |                                       (speaker g): Enim sed.   document2.txt |
  7. |                         (speaker h): Tincidunt vitae semper.                 |
  8. |             (speaker i): quis lectus nulla at volutpat diam.                 |
  9. |                       (speaker j): Quis varius quam quisque.                 |
     +------------------------------------------------------------------------------+

Last edited by Fei Wang; 09 Nov 2021, 03:53.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10141

09 Nov 2021, 04:53

It seems to me that a period uniquely identifies the end of a conversation. You can modify this slightly if the end is determined by the sequence: period + space + opening parenthesis.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 filename str235 text
"document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
"document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
end

split text, p(.) g(line)
reshape long line, i(text) j(which)

Res.:

Code:

. l filename line, sepby(filename)

     +------------------------------------------------------------------------------+
     |      filename                                                           line |
     |------------------------------------------------------------------------------|
  1. | document1.txt                        (speaker a): Lorem ipsum dolor sit amet |
  2. | document1.txt                           (speaker b): Ut enim ad minim veniam |
  3. | document1.txt    (speaker c): quis nostrud exercitation ullamco laboris nisi |
  4. | document1.txt                                   (speaker x): Tincidunt vitae |
     |------------------------------------------------------------------------------|
  5. | document2.txt          (speaker f): Tortor consequat id porta nibh venenatis |
  6. | document2.txt                                          (speaker g:) Enim sed |
  7. | document2.txt                                                                |
  8. | document2.txt                                                                |
     |------------------------------------------------------------------------------|
  9. |                                          (speaker h:) Tincidunt vitae semper |
 10. |                              (speaker i): quis lectus nulla at volutpat diam |
 11. |                                        (speaker j): Quis varius quam quisque |
 12. |                                                                              |
     +------------------------------------------------------------------------------+

Comment

Nate Tamment

Join Date: Jun 2020

Posts: 19
#4

09 Nov 2021, 18:01

Many thanks Fei Wang and Andrew for your replies.

-I ran Fei Wang's code, but received a message of "text not found" - reading through the code, I'm not sure where the error is coming from.

-Andrew, your approach is what I want to do - but unfortunately, the problem of too many variables persists. If I split, it will try to generate too many variables, and I will receive the message "no room to add more variables because of width".

I wonder if there's a split alternative that works to create new observations, rather than new variables? That way, I won't run into the problem of way too many variables being generated....
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#5

09 Nov 2021, 19:33

Nate, my code is based on you example data where there are two variables "filename", storing txt file names, and "text", storing dialogs from each txt file. I assume my code will work if your complete data have the same structure.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#6

10 Nov 2021, 02:33

deleted.

Last edited by Bjarte Aagnes; 10 Nov 2021, 02:55.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#7

10 Nov 2021, 05:59

To use the split + reshape you might split your line variable to avoid the max var limit, then repeat "the split + reshape" to each part. One way to split:

Code:

gen lastpart = usubstr(substr(text,int(ustrlen(text)/2),.),ustrpos(substr(text,int(ustrlen(text)/2),.),"(speaker"),.) gen firstpart = usubinstr(text, lastpart,"",1) assert text == firstpart + lastpart
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

10 Nov 2021, 08:01

mata have a ustrsplit() funtion:

Code:

clear

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 filename str235 text
"document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
"document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
"document3.txt" "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                            
end

mata : 

outputname = "myres.csv"

fh_out = fopen(outputname, "w" )

for (i=1; i<=3; i++) {
    
        statements = ustrsplit(st_sdata(i,"text"),"[(]speaker") 
        
        filename = st_sdata(i,"filename")
        
        for (j=1; j<=cols(statements); j++) {
                     
            if ( statements[j] != "" ) {
               
                fput(fh_out, filename + ";" + "(speaker" + statements[j] )               
            }
        }
    }
    
fclose(fh_out)
        
end

import delimited using "myres.csv" , delim(";") clear

format %-100s v? 
list, clean

Code:

       v1              v2                                                             
  1.   document1.txt   (speaker a): Lorem ipsum dolor sit amet.                       
  2.   document1.txt   (speaker b): Ut enim ad minim veniam.                          
  3.   document1.txt   (speaker c): quis nostrud exercitation ullamco laboris nisi.   
  4.   document1.txt   (speaker x): Tincidunt vitae.                                  
  5.   document2.txt   (speaker f): Tortor consequat id porta nibh venenatis.         
  6.   document2.txt   (speaker g:) Enim sed.                                         
  7.   document3.txt   (speaker h:) Tincidunt vitae semper.                           
  8.   document3.txt   (speaker i): quis lectus nulla at volutpat diam.               
  9.   document3.txt   (speaker j): Quis varius quam quisque.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10141

10 Nov 2021, 08:38

Originally posted by Nate Tamment View Post

.

-Andrew, your approach is what I want to do - but unfortunately, the problem of too many variables persists. If I split, it will try to generate too many variables, and I will receive the message "no room to add more variables because of width".

I wonder if there's a split alternative that works to create new observations, rather than new variables? That way, I won't run into the problem of way too many variables being generated....

If you have Stata 16+, you can make use of frames to expand the number of variables available to you. In any case, the native Stata string functions still work well for your problem without having to create extra variables.

Code:

clear
input str13 filename str235 text
"document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
"document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."
end
gen conversations= length(text) - length(subinstr(text, ".", "", .))
expand conversations
bys filename: gen which=_n
gen wanted= substr(text, 1, strpos(text, ".") + 1)
gen text2=text
qui sum which
forval i=2/`r(max)'{
    replace text2= subinstr(text2, substr(text2, 1, strpos(text2, ".") + 1), "", 1) if which>=`i'
    replace wanted= substr(text2, 1, strpos(text2, ".") + 1) if  which>=`i'
}

Res.:

Code:

. gsort -filename wanted

. l filename wanted, sepby(filename)

     +-------------------------------------------------------------------------------+
     |      filename                                                          wanted |
     |-------------------------------------------------------------------------------|
  1. | document2.txt         (speaker f): Tortor consequat id porta nibh venenatis.  |
  2. | document2.txt                                          (speaker g:) Enim sed. |
     |-------------------------------------------------------------------------------|
  3. | document1.txt                       (speaker a): Lorem ipsum dolor sit amet.  |
  4. | document1.txt                          (speaker b): Ut enim ad minim veniam.  |
  5. | document1.txt   (speaker c): quis nostrud exercitation ullamco laboris nisi.  |
  6. | document1.txt                                   (speaker x): Tincidunt vitae. |
     |-------------------------------------------------------------------------------|
  7. |                                         (speaker h:) Tincidunt vitae semper.  |
  8. |                             (speaker i): quis lectus nulla at volutpat diam.  |
  9. |                                        (speaker j): Quis varius quam quisque. |
     +-------------------------------------------------------------------------------+

Comment

Nate Tamment

Join Date: Jun 2020

Posts: 19
#10

10 Nov 2021, 18:44

Fei Wang, Bjarte, and Andrew, thanks for your very helpful responses!

Bjarte and Andrew: both of your approaches seem to work well in solving this issue. Bjarte, thanks for the Mata code - I've never ventured in this direction, but this works well in splitting up the variable. Andrew, thanks for the code as well as to pointing me to frames - it's a Stata feature that I clearly need to investigate further.

Thanks again all, the quality of the responses to queries on this list never fails to amaze.
Comment

Announcement