Came across Bloom filter during a presentation, and it seems that I am the only one who doesn’t know it. (Bummer.)
The above wiki page is pretty straightforward, and I could follow most of it except the derivation of optimal
k = m/n * ln(2), the number of hash
Let’s formulate it in pure mathematics:
f(k) = (1 - e^(-kn/m))^k, find
k to minimize
For the sake of easy derivation, let’s substitute
k = -m/n*ln(p).
ln f = e ^ [k * ln(1-p)] = e ^ [-m/n * ln(p) * ln(1-p)]
f is equivalent to maximizing
g = ln(p) * ln(1-p), due the negative sign before
m. According to the domain of
0 < p < 1. In other words, we would like to find
(0, 1), which maximizes
g = ln(p) * ln(1-p).
g is a composed function, and we could see how it looks like using wolframalpha*ln(1-x)). After
seeing its plot, we just need to solve
g'(p) == 0.
g'(p) = 1/p * ln(1-p) + ln(p) * (-1/1-p) = 0 =>
p*ln(p) = (1-p)*ln(1-p). I don’t know how to find all roots for this equation, but one simple root
p = 1-p, i.e.
p = 1/2. Then, we know
k = m/n *ln(2) using our original substitution. Q.E.D.