txttool stopwords and stemming

Johannes de Ruig

Join Date: Jun 2024

Posts: 9
#1

txttool stopwords and stemming

08 May 2025, 03:50

Hi all, I have a couple of related questions about the txttool command, for which I don't find the answer in the journal page or in the help file.

-With regards to the stopwords option, both refer to a list of frequently used English words supplied with txttool, but neither make any suggestions on how to find/access/use this list. How can I find/use this packaged list of stop words? Just running the command as in the journal:

Code:

txttool txtexample, gen(stopped) subwords("subwordexample.txt") stopwords("stopwordexample.txt")

gives the expected error "Stopwords file stopwordexample.txt not found" as I never created/saved that file (I did create the file subwordexample.txt as the tab-delimited text file shown in the journal page, so no error there)

-The command has the option "stem", which calls the Porter stemmer implementation to stem all words in the variable. Is there a way to change this Porter stemmer? That is, I am looking at specific data for which another stemmer (.csv file) has been specifically made, that should be able to "be used in conjunction with another stemmer, such as the Porter algorithm". The suggested stemmer is published and explained here for if I failed to include some important information about it.

In the spirit of the XY problem, some information about what I am doing: I want to analyze the effect nonprofit mission statements on some organizational/financial aspects. To do so, I plan to "quantify" the mission statements in several ways, such as their strength, positive/negative emotions, and the presence of certain values. If I understand correctly, the first step in doing so is cleaning the text data by removing misspellings and stopwords, and then stemming the text data, so I can apply some dictionaries on the text data.

Thanks in advance,
Johannes de Ruig

PS: I am using Stata 18.0 on Windows
PPS: I did read this Statalist post about "stopwords removal with txttool", but it seemed to me to be about a different issue, so I thought it better to start a new thread, apologies if that was the wrong conclusion.
Tags: None

Chen Samulsion

Join Date: Jan 2018
Posts: 923

08 May 2025, 05:58

I guess the problem is that you didn't download ancillary files of txttool, or you didn't put them in right directory. So for your first question, you can firstly download and depose these files in the way it asked.

Code:

. ssc describe txttool

-----------------------------------------------------------------------------------------------------------------
package txttool from http://fmwww.bc.edu/repec/bocode/t
-----------------------------------------------------------------------------------------------------------------

TITLE
      'TXTTOOL': module providing utilities for text analysis

DESCRIPTION/AUTHOR(S)
      
       txttool provides a set of tools for managing and analyzing
      free-form text. The program integrates    several built-in Stata
      functions with new text capabilities, including a utility to
      create a    bag-of-words representation of text and an
      implementation of Porter's word stemming algorithm.
      
      KW: text
      KW: free-form text
      KW: bag-of-words
      KW: word stemming
      
      Requires: Stata version 10
      
      Distribution-Date: 20150216
      
      Author: Unislawa Williams , Spelman College
      Support: email [email protected]
      

INSTALLATION FILES                              (type net install txttool)
      txttool.ado
      txttool.sthlp
      ../l/lmmtxttool.mlib
      ../l/lmmtxttool_source.sthlp

ANCILLARY FILES                                 (type net get txttool)
     ../s/stopwordexample.txt
      ../s/subwordexample.txt
-----------------------------------------------------------------------------------------------------------------
(type ssc install txttool to install)

Code:

*Suppose you have not downloaded stopwordexample.txt and subwordexample.txt, and or have not put them in right place
*then you will get error message

sysuse auto, clear
replace make=make + " " + "and" in 1/20
replace make=make + " " + "because" in 21/40
txttool make, gen(wanted) stopword(stopwordexample.txt)
Stopwords file stopwordexample.txt not found
r(198);

*Suppose you have downloaded stopwordexample.txt and subwordexample.txt, and put them in your current working directory
*then you need NOT to specify the path of these two files

pwd
sysuse auto, clear
replace make=make + " " + "and" in 1/20
replace make=make + " " + "because" in 21/40
txttool make, gen(wanted) stem stopwords(stopwordexample.txt) subwords(subwordexample.txt)

*Suppose you have downloaded stopwordexample.txt and subwordexample.txt, and put them in your system directory of PLUS
*in this case you must specify the path of these two files, say C:/ado/plus/s/

sysuse auto, clear
replace make=make + " " + "and" in 1/20
replace make=make + " " + "because" in 21/40
txttool make, gen(wanted) stem stopwords(C:/ado/plus/s/stopwordexample.txt) subwords(C:/ado/plus/s/subwordexample.txt)

Last edited by Chen Samulsion; 08 May 2025, 06:02.

Comment

Johannes de Ruig

Join Date: Jun 2024

Posts: 9
#3

08 May 2025, 06:46

Thanks a lot! I did not know about ancillary files and that they work like this.
Simply doing

Code:

net get txttool

easily solved my first question.

My second question still stands, so if anyone has any input there, that would be greatly appreciated!
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#4

08 May 2025, 08:10

As to your second question: not without Mata programming. The source of the Mata functions used by txttool is in help lmmtxttool_source. If you know how to use that to change the program, then by all means do so. If you don't, then this is a bigger project. If you want to do it, then you should expect to invest quite a bit of time in it. This book ( https://www.stata-press.com/books/mata-book/ ) will prove helpful.

If you are already playing around with the source code, you could rewrite the wordbag() function in terms of associative arrays, and speed up the txttool command for larger texts ( https://www.statalist.org/forums/for...nd-inefficient )

I realize that that is not the answer you were hoping for, but I hope it is still helps.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Johannes de Ruig

Join Date: Jun 2024

Posts: 9
#5

08 May 2025, 08:42

Thanks for the quick response!

As you said, not the answer I was hoping for, but an answer either way.
I don't think I will have the time to dive into Mata, but you never know!
Comment

Announcement

txttool stopwords and stemming

Comment

Comment

Comment

Comment