Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Doing a classification using text analysis in Stata?

    Dear Statalist, I have a list of codes and descriptions (see below) and I would like to classify these codes into green or non-green classification related to environmental descriptions. However, given that I do not know all possible combinations of words linked to environmental friend technologies, I think it would be better to use an algorithm for that.

    The idea would be to start from a list of keywords (e.g., environmental, sustainable, renewable energy), and then tell Stata to look for other keywords related to these initial keywords in the whole list of codes and descriptions (e.g., eco-friendly, wind power,…).

    I have been looking for this in Stata but do not find nothing. Maybe you can point me in the right direction, or give me any feedback on how to proceed.

    Thanks in advance!

    Click image for larger version

Name:	Captura de pantalla 2023-12-07 134133.png
Views:	1
Size:	26.7 KB
ID:	1736348


  • #2
    you might want to have a look at "help txttool" and esp at the linked SJ article which if freely available via the Stata website

    Comment


    • #3
      Dear Rich Goldstein, thanks for your advice. I have looked at txttool but I do not think it will do what I need. However, just to see what it offers, I tried the following but it gave an error (see below). I assume that it could be because of the length of the variable names?

      Given that I have a large amount of words (see also below), I am not sure Stata can create all these words as variables (which I believe it is doing with such a command). I tried to include the option "stopwords" in the command, but it gave another error (option stopwords incorrectly specified). Maybe splitting the dataset would be useful to try.

      I am open for other alternatives, thanks!

      Code:
      . txttool( Description ), replace
      Input:   9436 unique words, 31691 total words
      Output:  6229 unique words, 31682 total words
      Total time: 30.64 seconds
      Code:
      . txttool( Description ), stem generate(newtext) bagwords prefix("w_")
      w_telecommunicationsinfrastructur invalid name
                   st_addvar():  3300  argument out of range
                     wordbag():     -  function returned error
                  mm_txttool():     -  function returned error
                       <istmt>:     -  function returned error
      r(3300);

      Comment


      • #4
        I haven't worked with these programs myself, but there are a number of Stata Journal articles and programs for text analysis. In addition to txttool, search in Stata for lsemantica, ngram, and ldagibbs.

        Comment


        • #5
          I have a strong sense that AI is on the verge of surpassing traditional text analysis tools, especially in classification tasks. Currently, AI excels at this. Why not get in early and ride the wave?

          Comment


          • #6
            Dear Erik Ruzek and Andrew Musau, thanks for your answers. I am looking at "lsemantica" but I am not sure it will do what I need. Even though it have a "similarity words analysis" that could help, not for training the program to look for new related words/phrases, but to look for similar words. Maybe this could help.

            With respect to the AI, I totally agree. In fact I know researchers do this type of analysis using Python. However, for someone with no knowledge of Python I thought of taking advantage of the only program I more or less dominate (Stata), even though I am not an advanced Stata user.

            So, should I discard Stata for doing these type of analysis? Any recommendation about a program within Python to start looking at?

            Thanks again!

            Comment


            • #7
              I was thinking more of the chatbot. You input a list of classification instructions and text to be classified and ask it to create a table with the classifications. The table can be CSV, tex or any other text format that you like, which you may import back to Stata. See, e.g., https://www.designmind.com/blog/clas...-chatgpt-part1.

              Comment


              • #8
                Originally posted by Doris Rivera View Post

                With respect to the AI, I totally agree. In fact I know researchers do this type of analysis using Python. However, for someone with no knowledge of Python I thought of taking advantage of the only program I more or less dominate (Stata), even though I am not an advanced Stata user.

                Thanks again!
                If you have the time to commit, I'd recommend doing this sequence of courses on Coursera by DeepLearning.ai: https://www.coursera.org/specializations/deep-learning

                The course on sequence models is particularly relevant to what you want to do with your keywords. You'll learn about word embeddings and how to use them to identify keywords related to the ones you mentioned. The exercises also teach you the basics of Python and libraries Numpy, Pandas, TensorFlow, and Keras. The key trick to your solution is to use something like word2vec to get the embeddings for the words in your Description column, then calculate their distances to your chosen keywords' embeddings and keep the ones that are closest.

                Lot's of work, especially if you were expecting a quick solution, but well worth it. All the chatbots are based in large part on neural networks and, in particular, sequence models. You'll get at least a tiny peek into a part of what those models actually do.

                Stata has Python integration now, so you can use Python tools for AI and NNs within the Stata environment. I haven't played with that yet, but I hope to start messing with it at the start of the new year.


                Comment


                • #9
                  Thanks for your replies. I tried with ChatGPT copy/paste the list by parts and then asking for the list with the classification, but it was not too good as I thought. Also, I do not think it can be replicated the same way each time, which could be problematic for academic research.

                  About the coursera courses, yes that is a lot of work, but something that must be done in this new age of AI. Unfortunately that mean to dedicate months for something you do not know if it will be useful for the current problem. I thought that it could be solved with one or two commands, but I can see that it involve much more.Thanks anyway for the advice and links!

                  Comment

                  Working...
                  X