Tuesday, January 8, 2019

Categorical and ordinal features

First, ordinal is a special case of categorical feature but with values sorted in some meaningful order.
 - for e.g. 1st class, 2nd class in railways.

Second, label encoding, basically replace the unique values of categorical features with numbers.
 - either by sorting them alphabetically or assigning a code in order of appearance.

Third, frequency encoding - maps unique values to their frequencies.
- for e.g. how many times 1st class occurred.

Fourth, label encoding and frequency encoding are often used for tree-based methods.

Fifth, One-hot encoding is often used for non-tree-based-methods.

And finally, applying One-hot encoding combination on combinations of categorical features allows non-tree- based-models to take into consideration interactions between features, and improve.
 - for e.g. in titanic dataset - you could create a new categorical feature by combining sex and pclass.

If pclass = 1,2,3 and sex = M,F
then features could be:
1M, 1F, 2M, 2F, 3M, 3F and we could use one-hot encoding here.

One-hot encodings can be stored as Sparse metrices(which use the storage efficiently when number of non-zero values are less than half of total values).

No comments:

Blog Archive