## Monday, March 17, 2014

### summary hilary mason machine learing intro - part 5

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

Suppose you have two sets, $A$ and $B$, and you would like to know how similar they are. First you might ask, how big is their intersection?

$\displaystyle |A\cap B|$

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

$\displaystyle \frac{|A\cap B|}{|A\cup B|}$

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.

SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values