tiny-lang-detect

Generate tiny models for language detection  https://p.ce9e.org/tiny-lang-detect/demo/
git clone https://git.ce9e.org/tiny-lang-detect.git

commit
2f92e8aef5503af2ab121000b74fe1c1b3bf160a
parent
27981b2f5adbba86e975add9b04cc0df32c62c9e
Author
Tobias Bengfort <tobias.bengfort@posteo.de>
Date
2025-05-12 07:08
README: add single-ngram classifier

Diffstat

M README.md 13 +++++++++++++

1 files changed, 13 insertions, 0 deletions


diff --git a/README.md b/README.md

@@ -42,6 +42,19 @@ def classify(model, text):
   42    42     return min(model['freq'], key=lambda lang: dist(model['freq'][lang], freq))
   43    43 ```
   44    44 
   -1    45 ## An even simpler classifier
   -1    46 
   -1    47 To take this idea to the exteme, you could reduce the model to the single most
   -1    48 siginificant n-gram:
   -1    49 
   -1    50 ```py
   -1    51 def classify(text):
   -1    52     freq = text.count('o') / len(text)
   -1    53     return 'en' if freq > 0.05 else 'de'
   -1    54 ```
   -1    55 
   -1    56 This classifier still has an accuracy of 82.1% on the test data.
   -1    57 
   45    58 ## How does it work?
   46    59 
   47    60 `langdetect` works by comparing n-gram frequencies. For example, the 3-gram