tiny-lang-detect

Generate tiny models for language detection  https://p.ce9e.org/tiny-lang-detect/demo/
git clone https://git.ce9e.org/tiny-lang-detect.git

NameSize
.gitignore5B
LICENSE1072B
README.md2301B
demo/demo.js1014B
demo/index.html456B
demo/style.css364B
download_data.sh440B
gen_model.py1188B
test.py2235B

tiny language detection

Language detection libraries like langdetect usually come with large models. But if we just want to distinguish between a small set of languages, the size of the model can be reduce significantly.

This is an experiment to generate tiny models that only contain the most significant n-grams needed to distinguish between two languages.

Example usage:

$ ./download_data.sh
$ python gen_model.py en de -n 10 > en_de.json
$ python test.py en_de.json
981 out of 1000 samples were detected correctly (98.1%)

A model might look like this:

{
  "ngrams": ["o", "e", "a", "en ", "er", " th", "ch", " t", "en", "ei"],
  "freq": {
    "en": [0.0716, 0.1067, 0.0897, 0.0023, 0.0135, 0.0161, 0.0036, 0.0164, 0.0079, 0.0009],
    "de": [0.0311, 0.1466, 0.0574, 0.0202, 0.0299, 0.0002, 0.0195, 0.0006, 0.0233, 0.0159]
  }
}

You can use the model like this:

def probability(p, q):
    return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, q))

def classify(model, text):
    n = len(text) + 1
    freq = [text.count(g) / (n - len(g)) for g in model['ngrams']]
    return max(model['freq'], key=lambda lang: probability(freq, model['freq'][lang]))

An even simpler classifier

To take this idea to the extreme, you could reduce the model to the single most significant n-gram:

def classify(text):
    freq = text.count('o') / len(text)
    return 'en' if freq > 0.05 else 'de'

This classifier still has an accuracy of 82.1% on the test data.

How does it work?

langdetect works by comparing n-gram frequencies. For example, the 3-gram " th" is much more common in English than in German.

Before counting n-grams, it does some pre-processing, e.g. removing punctuation, URLs, or Latin characters in non-Latin texts. Then it uses Bayesian methods to find the most likely language for those frequencies.

The examples in this repo are much simpler though. They do not do any pre-processing. This is ultimately a trade-off between accuracy and simplicity.

To simplify the model, gen_model.py filters out all but the most significant n-grams. N-grams are considered more significant if their frequencies have a large absolute difference between the candidate languages.