Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Detecting language of string

    Hi all,

    I have a dataset with string variables that contain text in multiple languages (i.e., obs1 can be in English, obs2 Spanish, obs3 Chinese, etc.).
    Is it possible to detect the language used for each observation? There seem to be several Python packages (e.g., Polyglot, Langdetect) that can do so, but I have been unable to find anything equivalent for Stata.
    For my purposes, it would be sufficient to detect whether the language used is English or not.

    I'm using Stata version 16.

    Any help is much appreciated!

  • #2
    Hey Jurian,

    I am not aware of a similar package in Stata. You can run python code from within Stata in the more recent versions of it, however. Then you use the packages you mentioned.

    Code:
    help python
    Best,
    Sebastian

    Comment


    • #3
      Just to see how it performed, I tried this crude heuristic on sentences.csv from tatoeba.org:

      Code:
      gen english = !ustrregexm(sentence,"[\u00c0-\u1fff\u2c00-\uffff]")
      Of 1,819,501 English sentences, this incorrectly categorized 1,638 (99.9% correct).

      Of 9,715,431 non-English sentences, this incorrectly categorized 2,946,576 (69.7% correct). Some troublesome languages were Indonesian, Tagalog, Dutch, and Italian.

      So probably don't do this, but nevertheless a good rate of false negatives for a one-liner.


      Comment


      • #4
        Thanks a lot for both responses! I am aware that it is possible to run Python code within Stata, and even though it seems like there is no foolproof way of detecting the language of a string in Stata, and using Python seems necessary, I would still be interested in solutions using Stata only.

        Indeed a good false negative rate for just one line of code. The false positives are problematic though.

        Anyone any further thoughts/suggestions?

        Much appreciated.

        Comment

        Working...
        X