tiny-lang-detect: Generate tiny models for language detection

commit: 430f0819379897f02e6e8af12bb744f17030b2e1
parent: e160d8cb04fb6d6e6c762edcb27cbc4f704fb1d1
Author: Tobias Bengfort <tobias.bengfort@posteo.de>
Date: 2025-05-10 18:34

tweak readme

Diffstat

README.md

++++----

1 files changed, 4 insertions, 4 deletions

diff --git a/README.md b/README.md

@@ -35,15 +35,15 @@ For examples how to use a model to classify languages, see `test.py` and
   35    35 ## How does it work?
   36    36 
   37    37 `langdetect` works by comparing n-gram frequencies. For example, the 3-gram
   38    -1 " th" is much more common in english than in german.
   -1    38 " th" is much more common in English than in German.
   39    39 
   40    40 Before counting n-grams, it does some pre-processing, e.g. removing
   41    -1 punctuation, URLs, or latin characters in non-latin texts. The it uses bayesian
   42    -1 methods to find the most likely language for those frequencies.
   -1    41 punctuation, URLs, or Latin characters in non-Latin texts. Then it uses
   -1    42 Bayesian methods to find the most likely language for those frequencies.
   43    43 
   44    44 The examples in this repo are much simpler though. They do not do any
   45    45 pre-processing, and they use the euclidean distance to find the best match.
   46    -1 This is ultimately a tradeoff between accuracy and complexity.
   -1    46 This is ultimately a trade-off between accuracy and simplicity.
   47    47 
   48    48 To simplify the model, `gen_model.py` filters out all but the most significant
   49    49 n-grams. N-grams are considered more significant if the absolute difference of