Software Troubles and Troubleshooting: March 2014

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

Jaccard Index

Suppose you have two sets, $A$ and $B$ , and you would like to know how similar they are. First you might ask, how big is their intersection?

$\displaystyle |A\cap B|$

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

$\displaystyle \frac{|A\cap B|}{|A\cup B|}$

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.

SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values

Sunday, March 16, 2014

Python numpy install

yum install numpy

Installing PyCluster on centos

yum install python-devel
wget http://bonsai.hgc.jp/~mdehoon/software/cluster/Pycluster-1.52.tar.gz
tar -xvzf Pycluster-1.52.tar.gz
cd Pycluster-1.52
python setup.py install

Friday, March 14, 2014

installing nltk for python on centos

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
-O - | python
easy_install pip
pip install -U pyyaml nltk

Monday, March 31, 2014

Monday, March 17, 2014

Sunday, March 16, 2014

Friday, March 14, 2014

Blog Archive