- commit
- 430f0819379897f02e6e8af12bb744f17030b2e1
- parent
- e160d8cb04fb6d6e6c762edcb27cbc4f704fb1d1
- Author
- Tobias Bengfort <tobias.bengfort@posteo.de>
- Date
- 2025-05-10 18:34
tweak readme
Diffstat
| M | README.md | 8 | ++++---- |
1 files changed, 4 insertions, 4 deletions
diff --git a/README.md b/README.md
@@ -35,15 +35,15 @@ For examples how to use a model to classify languages, see `test.py` and 35 35 ## How does it work? 36 36 37 37 `langdetect` works by comparing n-gram frequencies. For example, the 3-gram38 -1 " th" is much more common in english than in german.-1 38 " th" is much more common in English than in German. 39 39 40 40 Before counting n-grams, it does some pre-processing, e.g. removing41 -1 punctuation, URLs, or latin characters in non-latin texts. The it uses bayesian42 -1 methods to find the most likely language for those frequencies.-1 41 punctuation, URLs, or Latin characters in non-Latin texts. Then it uses -1 42 Bayesian methods to find the most likely language for those frequencies. 43 43 44 44 The examples in this repo are much simpler though. They do not do any 45 45 pre-processing, and they use the euclidean distance to find the best match.46 -1 This is ultimately a tradeoff between accuracy and complexity.-1 46 This is ultimately a trade-off between accuracy and simplicity. 47 47 48 48 To simplify the model, `gen_model.py` filters out all but the most significant 49 49 n-grams. N-grams are considered more significant if the absolute difference of