Detecting language of string

Jurian Hendrikse

Join Date: Apr 2021

Posts: 6
#1

Detecting language of string

06 Aug 2023, 20:34

Hi all,

I have a dataset with string variables that contain text in multiple languages (i.e., obs1 can be in English, obs2 Spanish, obs3 Chinese, etc.).
Is it possible to detect the language used for each observation? There seem to be several Python packages (e.g., Polyglot, Langdetect) that can do so, but I have been unable to find anything equivalent for Stata.
For my purposes, it would be sufficient to detect whether the language used is English or not.

I'm using Stata version 16.

Any help is much appreciated!
Tags: None
Sebastian Schirner

Join Date: Jan 2023

Posts: 53
#2

07 Aug 2023, 01:29

Hey Jurian,

I am not aware of a similar package in Stata. You can run python code from within Stata in the more recent versions of it, however. Then you use the packages you mentioned.

Code:

help python

Best,
Sebastian
1 like
Comment
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#3

07 Aug 2023, 15:30

Just to see how it performed, I tried this crude heuristic on sentences.csv from tatoeba.org:

Code:

gen english = !ustrregexm(sentence,"[\u00c0-\u1fff\u2c00-\uffff]")

Of 1,819,501 English sentences, this incorrectly categorized 1,638 (99.9% correct).

Of 9,715,431 non-English sentences, this incorrectly categorized 2,946,576 (69.7% correct). Some troublesome languages were Indonesian, Tagalog, Dutch, and Italian.

So probably don't do this, but nevertheless a good rate of false negatives for a one-liner.
1 like
Comment
Jurian Hendrikse

Join Date: Apr 2021

Posts: 6
#4

07 Aug 2023, 20:14

Thanks a lot for both responses! I am aware that it is possible to run Python code within Stata, and even though it seems like there is no foolproof way of detecting the language of a string in Stata, and using Python seems necessary, I would still be interested in solutions using Stata only.

Indeed a good false negative rate for just one line of code. The false positives are problematic though.

Anyone any further thoughts/suggestions?

Much appreciated.
Comment

Announcement

Detecting language of string

Comment

Comment

Comment