A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
Suppose you have two sets, and , and you would like to know how similar they are. First you might ask, how big is their intersection?
That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.
This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.
SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values