# tiny language detection

Language detection libraries like
[langdetect](https://github.com/DoodleBears/langdetect/) usually come with
large models. But if we just want to distinguish between a small set of
languages, the size of the model can be reduce significantly.

This is an experiment to generate tiny models that only contain the most
significant n-grams needed to distinguish between two languages.

Example usage:

```sh
$ ./download_data.sh
$ python gen_model.py en de -n 10 > en_de.json
$ python test.py en_de.json
981 out of 1000 samples were detected correctly (98.1%)
```

A model might look like this:

```json
{
  "ngrams": ["o", "e", "a", "en ", "er", " th", "ch", " t", "en", "ei"],
  "freq": {
    "en": [0.0716, 0.1067, 0.0897, 0.0023, 0.0135, 0.0161, 0.0036, 0.0164, 0.0079, 0.0009],
    "de": [0.0311, 0.1466, 0.0574, 0.0202, 0.0299, 0.0002, 0.0195, 0.0006, 0.0233, 0.0159]
  }
}
```

You can use the model like this:

```py
def probability(p, q):
    return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, q))

def classify(model, text):
    n = len(text) + 1
    freq = [text.count(g) / (n - len(g)) for g in model['ngrams']]
    return max(model['freq'], key=lambda lang: probability(freq, model['freq'][lang]))
```

## An even simpler classifier

To take this idea to the extreme, you could reduce the model to the single most
significant n-gram:

```py
def classify(text):
    freq = text.count('o') / len(text)
    return 'en' if freq > 0.05 else 'de'
```

This classifier still has an accuracy of 82.1% on the test data.

## How does it work?

`langdetect` works by comparing n-gram frequencies. For example, the 3-gram
" th" is much more common in English than in German.

Before counting n-grams, it does some pre-processing, e.g. removing
punctuation, URLs, or Latin characters in non-Latin texts. Then it uses
Bayesian methods to find the most likely language for those frequencies.

The examples in this repo are much simpler though. They do not do any
pre-processing. This is ultimately a trade-off between accuracy and simplicity.

To simplify the model, `gen_model.py` filters out all but the most significant
n-grams. N-grams are considered more significant if their frequencies have a
large absolute difference between the candidate languages.