Bloom Filter

Came across Bloom filter during a presentation, and it seems that I am the only one who doesn’t know it. (Bummer.)

The above wiki page is pretty straightforward, and I could follow most of it except the derivation of optimal k = m/n * ln(2), the number of hash functions.

Let’s formulate it in pure mathematics:

For f(k) = (1 - e^(-kn/m))^k, find k to minimize f(k).

For the sake of easy derivation, let’s substitute e^(-kn/m) with p, then k = -m/n*ln(p).

ln f = e ^ [k * ln(1-p)] = e ^ [-m/n * ln(p) * ln(1-p)]

Therefore, minimizing f is equivalent to maximizing g = ln(p) * ln(1-p), due the negative sign before m. According to the domain of ln, we know that 0 < p < 1. In other words, we would like to find p between (0, 1), which maximizes g = ln(p) * ln(1-p).

g is a composed function, and we could see how it looks like using wolframalpha. After seeing its plot, we just need to solve g'(p) == 0.

g'(p) = 1/p * ln(1-p) + ln(p) * (-1/1-p) = 0 => p*ln(p) = (1-p)*ln(1-p). I don’t know how to find all roots for this equation, but one simple root is p = 1-p, i.e. p = 1/2. Then, we know k = m/n *ln(2) using our original substitution. Q.E.D.

§Reference

http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf