Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • txttool stopwords and stemming

    Hi all, I have a couple of related questions about the txttool command, for which I don't find the answer in the journal page or in the help file.

    -With regards to the stopwords option, both refer to a list of frequently used English words supplied with txttool, but neither make any suggestions on how to find/access/use this list. How can I find/use this packaged list of stop words? Just running the command as in the journal:
    Code:
    txttool txtexample, gen(stopped) subwords("subwordexample.txt") stopwords("stopwordexample.txt")
    gives the expected error "Stopwords file stopwordexample.txt not found" as I never created/saved that file (I did create the file subwordexample.txt as the tab-delimited text file shown in the journal page, so no error there)

    -The command has the option "stem", which calls the Porter stemmer implementation to stem all words in the variable. Is there a way to change this Porter stemmer? That is, I am looking at specific data for which another stemmer (.csv file) has been specifically made, that should be able to "be used in conjunction with another stemmer, such as the Porter algorithm". The suggested stemmer is published and explained here for if I failed to include some important information about it.

    In the spirit of the XY problem, some information about what I am doing: I want to analyze the effect nonprofit mission statements on some organizational/financial aspects. To do so, I plan to "quantify" the mission statements in several ways, such as their strength, positive/negative emotions, and the presence of certain values. If I understand correctly, the first step in doing so is cleaning the text data by removing misspellings and stopwords, and then stemming the text data, so I can apply some dictionaries on the text data.

    Thanks in advance,
    Johannes de Ruig

    PS: I am using Stata 18.0 on Windows
    PPS: I did read this Statalist post about "stopwords removal with txttool", but it seemed to me to be about a different issue, so I thought it better to start a new thread, apologies if that was the wrong conclusion.

  • #2
    I guess the problem is that you didn't download ancillary files of txttool, or you didn't put them in right directory. So for your first question, you can firstly download and depose these files in the way it asked.

    Code:
    . ssc describe txttool
    
    -----------------------------------------------------------------------------------------------------------------
    package txttool from http://fmwww.bc.edu/repec/bocode/t
    -----------------------------------------------------------------------------------------------------------------
    
    TITLE
          'TXTTOOL': module providing utilities for text analysis
    
    DESCRIPTION/AUTHOR(S)
          
           txttool provides a set of tools for managing and analyzing
          free-form text. The program integrates    several built-in Stata
          functions with new text capabilities, including a utility to
          create a    bag-of-words representation of text and an
          implementation of Porter's word stemming algorithm.
          
          KW: text
          KW: free-form text
          KW: bag-of-words
          KW: word stemming
          
          Requires: Stata version 10
          
          Distribution-Date: 20150216
          
          Author: Unislawa Williams , Spelman College
          Support: email [email protected]
          
    
    INSTALLATION FILES                              (type net install txttool)
          txttool.ado
          txttool.sthlp
          ../l/lmmtxttool.mlib
          ../l/lmmtxttool_source.sthlp
    
    ANCILLARY FILES                                 (type net get txttool)
         ../s/stopwordexample.txt
          ../s/subwordexample.txt
    -----------------------------------------------------------------------------------------------------------------
    (type ssc install txttool to install)
    Code:
    *Suppose you have not downloaded stopwordexample.txt and subwordexample.txt, and or have not put them in right place
    *then you will get error message
    
    sysuse auto, clear
    replace make=make + " " + "and" in 1/20
    replace make=make + " " + "because" in 21/40
    txttool make, gen(wanted) stopword(stopwordexample.txt)
    Stopwords file stopwordexample.txt not found
    r(198);
    
    *Suppose you have downloaded stopwordexample.txt and subwordexample.txt, and put them in your current working directory
    *then you need NOT to specify the path of these two files
    
    pwd
    sysuse auto, clear
    replace make=make + " " + "and" in 1/20
    replace make=make + " " + "because" in 21/40
    txttool make, gen(wanted) stem stopwords(stopwordexample.txt) subwords(subwordexample.txt)
    
    *Suppose you have downloaded stopwordexample.txt and subwordexample.txt, and put them in your system directory of PLUS
    *in this case you must specify the path of these two files, say C:/ado/plus/s/
    
    sysuse auto, clear
    replace make=make + " " + "and" in 1/20
    replace make=make + " " + "because" in 21/40
    txttool make, gen(wanted) stem stopwords(C:/ado/plus/s/stopwordexample.txt) subwords(C:/ado/plus/s/subwordexample.txt)
    Last edited by Chen Samulsion; 08 May 2025, 06:02.

    Comment


    • #3
      Thanks a lot! I did not know about ancillary files and that they work like this.
      Simply doing
      Code:
      net get txttool
      easily solved my first question.

      My second question still stands, so if anyone has any input there, that would be greatly appreciated!

      Comment


      • #4
        As to your second question: not without Mata programming. The source of the Mata functions used by txttool is in help lmmtxttool_source. If you know how to use that to change the program, then by all means do so. If you don't, then this is a bigger project. If you want to do it, then you should expect to invest quite a bit of time in it. This book ( https://www.stata-press.com/books/mata-book/ ) will prove helpful.

        If you are already playing around with the source code, you could rewrite the wordbag() function in terms of associative arrays, and speed up the txttool command for larger texts ( https://www.statalist.org/forums/for...nd-inefficient )

        I realize that that is not the answer you were hoping for, but I hope it is still helps.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Thanks for the quick response!

          As you said, not the answer I was hoping for, but an answer either way.
          I don't think I will have the time to dive into Mata, but you never know!

          Comment

          Working...
          X