A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
Suppose you have two sets,
and
, and you would like to know how similar they are. First you might ask, how big is their intersection?
data:image/s3,"s3://crabby-images/7ca94/7ca94440bb6d2a5698319f42c122f92dd9ec4d88" alt="\displaystyle |A\cap B| \displaystyle |A\cap B|"
That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.
data:image/s3,"s3://crabby-images/16b9e/16b9e1e485daf0b9e63b9b1e3ada712899b102b2" alt="\displaystyle \frac{|A\cap B|}{|A\cup B|} \displaystyle \frac{|A\cap B|}{|A\cup B|}"
This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.
SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values