- commit
- fe18df939733a8d8e6286631930220e1ca5c1315
- parent
- 9c128468b32a9d10a1ac0dea0908e38d036832df
- Author
- Tobias Bengfort <tobias.bengfort@posteo.de>
- Date
- 2025-05-26 15:40
probability: assume frequencies to be independent
For dependent probabilities that sum to 1, we get the total probability
by calculating:
n! * \prod(q_i^{k_i} / k_i!)
If we want to compare two of these probabilities, we can leave out
anything that does not depend on q. With $p_i = k_i/n$, this results in:
\prod(q_i^{p_i})
This is what we have done so far. However, our probabilities do not sum to
1. If we only look at 1-grams, we can add an "everything else" factor:
(1 - \sum q_i)^{1 - sum p_i}
While 1-grams, 2-grams and 3-grams are not totally independent, they are
also not dependent in the sense that one implies not-the-other. So we
could caluclate the probability of each group individually and multiply
the results.
In practice, just treating everything as independent seems to work just
as well and is much simpler.
Diffstat
| M | README.md | 4 | ++-- |
| M | demo/demo.js | 2 | +- |
| M | test.py | 9 | +++------ |
3 files changed, 6 insertions, 9 deletions
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ Example usage: 14 14 $ ./download_data.sh 15 15 $ python gen_model.py en de -n 10 > en_de.json 16 16 $ python test.py en_de.json17 -1 984 out of 1000 samples were detected correctly (98.4%)-1 17 981 out of 1000 samples were detected correctly (98.1%) 18 18 ``` 19 19 20 20 A model might look like this: @@ -33,7 +33,7 @@ You can use the model like this: 33 33 34 34 ```py 35 35 def probability(p, q):36 -1 return math.prod(qi ** pi for pi, qi in zip(p, q))-1 36 return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, q)) 37 37 38 38 def classify(model, text): 39 39 n = len(text) + 1
diff --git a/demo/demo.js b/demo/demo.js
@@ -14,7 +14,7 @@ var prod = a => a.reduce((s, v) => s * v, 1); 14 14 var max = (a, key) => a.reduce((m, v) => !m || key(v) > key(m) ? v : m, null); 15 15 16 16 var probability = (p, q) => {17 -1 return prod(p.map((pi, i) => Math.pow(q[i], pi)));-1 17 return prod(p.map((pi, i) => Math.pow(q[i], pi) * Math.pow(1 - q[i], 1 - pi))); 18 18 }; 19 19 20 20 var classify = text => {
diff --git a/test.py b/test.py
@@ -62,13 +62,10 @@ LANG_MAP = {
62 62
63 63
64 64 def probability(p, q):
65 -1 if len(p) == 1:
66 -1 p = [p[0], 1 - p[0]]
67 -1 q = [q[0], 1 - q[0]]
68 -1
69 65 # 0 does not mean impossible, just very unlikely
70 -1 qq = [qi + 0.0000001 for qi in q]
71 -1 return math.prod(qi ** pi for pi, qi in zip(p, qq))
-1 66 a = 0.0000001
-1 67 qq = [qi * (1 - 2 * a) + a for qi in q]
-1 68 return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, qq))
72 69
73 70
74 71 def classify(model, text):