Monday, March 17, 2014

summary hilary mason machine learing intro - part 5

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

Suppose you have two sets, A and B, and you would like to know how similar they are. First you might ask, how big is their intersection?

\displaystyle |A\cap B|

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

\displaystyle \frac{|A\cap B|}{|A\cup B|}

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.


SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values

No comments:

Blog Archive