Sunday, March 24, 2019

stratified cross validation

Advantages:
While validating, split the data in such a way that all classes are
represented in both train/validation sets.

Good for small and unbalanced datasets.

Disadvantages:
1. One specific issue that is important across even unbiased or
balanced algorithms, is that they tend not to be able to learn or test
a class that isn't represented at all in a fold, and furthermore even
the case where only one of a class is represented in a fold doesn't
allow generalization to performed resp. evaluated.

2. Also, supervised stratification compromises the technical purity of
the evaluation as the labels of the test data shouldn't affect
training, but in stratification are used in the selection of the
training instances.

tl;dr
Stratification is recommended if very few samples are available. For
large datasets, law of large numbers kicks in i.e. samples in
train/validation data will also be huge and representative of the
actual data distribution.

No comments:

Blog Archive