- commit
- 281071942a7647e36686aae2deb2511b2079b637
- parent
- 2479e1d6933082faf7d4fc5527d76f55c5087f3e
- Author
- Tobias Bengfort <tobias.bengfort@posteo.de>
- Date
- 2025-05-10 18:07
add some explanation
Diffstat
| M | README.md | 17 | +++++++++++++++++ |
1 files changed, 17 insertions, 0 deletions
diff --git a/README.md b/README.md
@@ -31,3 +31,20 @@ A model might look like this: 31 31 32 32 For examples how to use a model to classify languages, see `test.py` and 33 33 `demo/demo.js`. -1 34 -1 35 ## How does it work? -1 36 -1 37 `langdetect` works by comparing n-gram frequencies. For example, the 3-gram -1 38 " th" is much more common in english than in german. -1 39 -1 40 Before counting n-grams, it does some pre-processing, e.g. removing -1 41 punctuation, URLs, or latin characters in non-latin texts. The it uses bayesian -1 42 methods to find the most likely language for those frequencies. -1 43 -1 44 The examples in this repo are much simpler though. They do not do any -1 45 pre-processing, and they use the euclidean distance to find the best match. -1 46 This is ultimately a tradeoff between accuracy and complexity. -1 47 -1 48 To simplify the model, `gen_model.py` filters out all but the most significant -1 49 n-grams. N-grams are considered more significant if the absolute difference of -1 50 their frequencies in the candidate language is big.