tiny-lang-detect

Generate tiny models for language detection  https://p.ce9e.org/tiny-lang-detect/demo/
git clone https://git.ce9e.org/tiny-lang-detect.git

commit
fe18df939733a8d8e6286631930220e1ca5c1315
parent
9c128468b32a9d10a1ac0dea0908e38d036832df
Author
Tobias Bengfort <tobias.bengfort@posteo.de>
Date
2025-05-26 15:40
probability: assume frequencies to be independent

For dependent probabilities that sum to 1, we get the total probability
by calculating:

n! * \prod(q_i^{k_i} / k_i!)

If we want to compare two of these probabilities, we can leave out
anything that does not depend on q. With $p_i = k_i/n$, this results in:

\prod(q_i^{p_i})

This is what we have done so far. However, our probabilities do not sum to
1. If we only look at 1-grams, we can add an "everything else" factor:

(1 - \sum q_i)^{1 - sum p_i}

While 1-grams, 2-grams and 3-grams are not totally independent, they are
also not dependent in the sense that one implies not-the-other. So we
could caluclate the probability of each group individually and multiply
the results.

In practice, just treating everything as independent seems to work just
as well and is much simpler.

Diffstat

M README.md 4 ++--
M demo/demo.js 2 +-
M test.py 9 +++------

3 files changed, 6 insertions, 9 deletions


diff --git a/README.md b/README.md

@@ -14,7 +14,7 @@ Example usage:
   14    14 $ ./download_data.sh
   15    15 $ python gen_model.py en de -n 10 > en_de.json
   16    16 $ python test.py en_de.json
   17    -1 984 out of 1000 samples were detected correctly (98.4%)
   -1    17 981 out of 1000 samples were detected correctly (98.1%)
   18    18 ```
   19    19 
   20    20 A model might look like this:
@@ -33,7 +33,7 @@ You can use the model like this:
   33    33 
   34    34 ```py
   35    35 def probability(p, q):
   36    -1     return math.prod(qi ** pi for pi, qi in zip(p, q))
   -1    36     return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, q))
   37    37 
   38    38 def classify(model, text):
   39    39     n = len(text) + 1

diff --git a/demo/demo.js b/demo/demo.js

@@ -14,7 +14,7 @@ var prod = a => a.reduce((s, v) => s * v, 1);
   14    14 var max = (a, key) => a.reduce((m, v) => !m || key(v) > key(m) ? v : m, null);
   15    15 
   16    16 var probability = (p, q) => {
   17    -1     return prod(p.map((pi, i) => Math.pow(q[i], pi)));
   -1    17     return prod(p.map((pi, i) => Math.pow(q[i], pi) * Math.pow(1 - q[i], 1 - pi)));
   18    18 };
   19    19 
   20    20 var classify = text => {

diff --git a/test.py b/test.py

@@ -62,13 +62,10 @@ LANG_MAP = {
   62    62 
   63    63 
   64    64 def probability(p, q):
   65    -1     if len(p) == 1:
   66    -1         p = [p[0], 1 - p[0]]
   67    -1         q = [q[0], 1 - q[0]]
   68    -1 
   69    65     # 0 does not mean impossible, just very unlikely
   70    -1     qq = [qi + 0.0000001 for qi in q]
   71    -1     return math.prod(qi ** pi for pi, qi in zip(p, qq))
   -1    66     a = 0.0000001
   -1    67     qq = [qi * (1 - 2 * a) + a for qi in q]
   -1    68     return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, qq))
   72    69 
   73    70 
   74    71 def classify(model, text):