Bloom Filter
Came across Bloom filter during a presentation, and it seems that I am the only one who doesn’t know it. (Bummer.)
The above wiki page is pretty straightforward, and I could follow most of it except the derivation of optimal k = m/n * ln(2)
, the number of hash
functions.
Let’s formulate it in pure mathematics:
For f(k) = (1 - e^(-kn/m))^k
, find k
to minimize f(k)
.
For the sake of easy derivation, let’s substitute e^(-kn/m)
with p
, then k = -m/n*ln(p)
.
ln f = e ^ [k * ln(1-p)] = e ^ [-m/n * ln(p) * ln(1-p)]
Therefore, minimizing f
is equivalent to maximizing g = ln(p) * ln(1-p)
, due the negative sign before m
. According to the domain of ln
, we
know that 0 < p < 1
. In other words, we would like to find p
between (0, 1)
, which maximizes g = ln(p) * ln(1-p)
.
g
is a composed function, and we could see how it looks like using wolframalpha. After
seeing its plot, we just need to solve g'(p) == 0
.
g'(p) = 1/p * ln(1-p) + ln(p) * (-1/1-p) = 0
=> p*ln(p) = (1-p)*ln(1-p)
. I don’t know how to find all roots for this equation, but one simple root
is p = 1-p
, i.e. p = 1/2
. Then, we know k = m/n *ln(2)
using our original substitution. Q.E.D.
§Reference
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf