tiny-lang-detect: Generate tiny models for language detection

commit: 60da7d4b74c54273652a4bbdb84408bb9f055886
parent: a3e48ff6c5886c46a18cbb9fa15646a3b1ce4041
Author: Tobias Bengfort <tobias.bengfort@posteo.de>
Date: 2025-05-06 06:22

add README

Diffstat

README.md

+++++++++++++++++++++

1 files changed, 21 insertions, 0 deletions

diff --git a/README.md b/README.md

@@ -0,0 +1,21 @@
   -1     1 # tiny language detection
   -1     2 
   -1     3 Language detection libraries like
   -1     4 [langdetect](https://github.com/DoodleBears/langdetect/) usually come with
   -1     5 large models. But if we just want to distinguish between a small set of
   -1     6 languages, the size of the model can be reduce significantly.
   -1     7 
   -1     8 This is an experiment to generate tiny models that only contain the most
   -1     9 significant n-grams needed to distinguish between two languages.
   -1    10 
   -1    11 Example usage:
   -1    12 
   -1    13 ```sh
   -1    14 $ ./download_data.sh
   -1    15 $ python gen_model.py en de -n 10 > en_de.json
   -1    16 $ python test.py en_de.json
   -1    17 overall correctness 96.3% (1000)
   -1    18 ```
   -1    19 
   -1    20 For examples how to use a model to classify languages, see `test.py` and
   -1    21 `demo/demo.js`.