- commit
- 60da7d4b74c54273652a4bbdb84408bb9f055886
- parent
- a3e48ff6c5886c46a18cbb9fa15646a3b1ce4041
- Author
- Tobias Bengfort <tobias.bengfort@posteo.de>
- Date
- 2025-05-06 06:22
add README
Diffstat
| A | README.md | 21 | +++++++++++++++++++++ |
1 files changed, 21 insertions, 0 deletions
diff --git a/README.md b/README.md
@@ -0,0 +1,21 @@ -1 1 # tiny language detection -1 2 -1 3 Language detection libraries like -1 4 [langdetect](https://github.com/DoodleBears/langdetect/) usually come with -1 5 large models. But if we just want to distinguish between a small set of -1 6 languages, the size of the model can be reduce significantly. -1 7 -1 8 This is an experiment to generate tiny models that only contain the most -1 9 significant n-grams needed to distinguish between two languages. -1 10 -1 11 Example usage: -1 12 -1 13 ```sh -1 14 $ ./download_data.sh -1 15 $ python gen_model.py en de -n 10 > en_de.json -1 16 $ python test.py en_de.json -1 17 overall correctness 96.3% (1000) -1 18 ``` -1 19 -1 20 For examples how to use a model to classify languages, see `test.py` and -1 21 `demo/demo.js`.