tiny-lang-detect: Generate tiny models for language detection

commit: 281071942a7647e36686aae2deb2511b2079b637
parent: 2479e1d6933082faf7d4fc5527d76f55c5087f3e
Author: Tobias Bengfort <tobias.bengfort@posteo.de>
Date: 2025-05-10 18:07

add some explanation

Diffstat

README.md

+++++++++++++++++

1 files changed, 17 insertions, 0 deletions

diff --git a/README.md b/README.md

@@ -31,3 +31,20 @@ A model might look like this:
   31    31 
   32    32 For examples how to use a model to classify languages, see `test.py` and
   33    33 `demo/demo.js`.
   -1    34 
   -1    35 ## How does it work?
   -1    36 
   -1    37 `langdetect` works by comparing n-gram frequencies. For example, the 3-gram
   -1    38 " th" is much more common in english than in german.
   -1    39 
   -1    40 Before counting n-grams, it does some pre-processing, e.g. removing
   -1    41 punctuation, URLs, or latin characters in non-latin texts. The it uses bayesian
   -1    42 methods to find the most likely language for those frequencies.
   -1    43 
   -1    44 The examples in this repo are much simpler though. They do not do any
   -1    45 pre-processing, and they use the euclidean distance to find the best match.
   -1    46 This is ultimately a tradeoff between accuracy and complexity.
   -1    47 
   -1    48 To simplify the model, `gen_model.py` filters out all but the most significant
   -1    49 n-grams. N-grams are considered more significant if the absolute difference of
   -1    50 their frequencies in the candidate language is big.