# Bloom Filter

Came across Bloom filter during a presentation, and it seems that I am the only one who doesn’t know it. (Bummer.)

The above wiki page is pretty straightforward, and I could follow most of it except the derivation of optimal `k = m/n * ln(2)`

, the number of hash
functions.

Let’s formulate it in pure mathematics:

For `f(k) = (1 - e^(-kn/m))^k`

, find `k`

to minimize `f(k)`

.

For the sake of easy derivation, let’s substitute `e^(-kn/m)`

with `p`

, then `k = -m/n*ln(p)`

.

`ln f = e ^ [k * ln(1-p)] = e ^ [-m/n * ln(p) * ln(1-p)]`

Therefore, minimizing `f`

is equivalent to maximizing `g = ln(p) * ln(1-p)`

, due the negative sign before `m`

. According to the domain of `ln`

, we
know that `0 < p < 1`

. In other words, we would like to find `p`

between `(0, 1)`

, which maximizes `g = ln(p) * ln(1-p)`

.

`g`

is a composed function, and we could see how it looks like using wolframalpha. After
seeing its plot, we just need to solve `g'(p) == 0`

.

`g'(p) = 1/p * ln(1-p) + ln(p) * (-1/1-p) = 0`

=> `p*ln(p) = (1-p)*ln(1-p)`

. I don’t know how to find all roots for this equation, but one simple root
is `p = 1-p`

, i.e. `p = 1/2`

. Then, we know `k = m/n *ln(2)`

using our original substitution. Q.E.D.

## §Reference

http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf