Hi all, I have a couple of related questions about the txttool command, for which I don't find the answer in the journal page or in the help file.
-With regards to the stopwords option, both refer to a list of frequently used English words supplied with txttool, but neither make any suggestions on how to find/access/use this list. How can I find/use this packaged list of stop words? Just running the command as in the journal:
gives the expected error "Stopwords file stopwordexample.txt not found" as I never created/saved that file (I did create the file subwordexample.txt as the tab-delimited text file shown in the journal page, so no error there)
-The command has the option "stem", which calls the Porter stemmer implementation to stem all words in the variable. Is there a way to change this Porter stemmer? That is, I am looking at specific data for which another stemmer (.csv file) has been specifically made, that should be able to "be used in conjunction with another stemmer, such as the Porter algorithm". The suggested stemmer is published and explained here for if I failed to include some important information about it.
In the spirit of the XY problem, some information about what I am doing: I want to analyze the effect nonprofit mission statements on some organizational/financial aspects. To do so, I plan to "quantify" the mission statements in several ways, such as their strength, positive/negative emotions, and the presence of certain values. If I understand correctly, the first step in doing so is cleaning the text data by removing misspellings and stopwords, and then stemming the text data, so I can apply some dictionaries on the text data.
Thanks in advance,
Johannes de Ruig
PS: I am using Stata 18.0 on Windows
PPS: I did read this Statalist post about "stopwords removal with txttool", but it seemed to me to be about a different issue, so I thought it better to start a new thread, apologies if that was the wrong conclusion.
-With regards to the stopwords option, both refer to a list of frequently used English words supplied with txttool, but neither make any suggestions on how to find/access/use this list. How can I find/use this packaged list of stop words? Just running the command as in the journal:
Code:
txttool txtexample, gen(stopped) subwords("subwordexample.txt") stopwords("stopwordexample.txt")
-The command has the option "stem", which calls the Porter stemmer implementation to stem all words in the variable. Is there a way to change this Porter stemmer? That is, I am looking at specific data for which another stemmer (.csv file) has been specifically made, that should be able to "be used in conjunction with another stemmer, such as the Porter algorithm". The suggested stemmer is published and explained here for if I failed to include some important information about it.
In the spirit of the XY problem, some information about what I am doing: I want to analyze the effect nonprofit mission statements on some organizational/financial aspects. To do so, I plan to "quantify" the mission statements in several ways, such as their strength, positive/negative emotions, and the presence of certain values. If I understand correctly, the first step in doing so is cleaning the text data by removing misspellings and stopwords, and then stemming the text data, so I can apply some dictionaries on the text data.
Thanks in advance,
Johannes de Ruig
PS: I am using Stata 18.0 on Windows
PPS: I did read this Statalist post about "stopwords removal with txttool", but it seemed to me to be about a different issue, so I thought it better to start a new thread, apologies if that was the wrong conclusion.
Comment