fa.bianp.net

Holdout cross-validation generator

Category: misc
#Python #scikit-learn #machine learning #model selection

Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the __iter__ method and that for each call to this method return (or more precisely, yield) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API.

import numpy as np
from sklearn.utils import check_random_state


class HoldOut:
    """
    Hold-out cross-validator generator. In the hold-out, the
    data is split only once into a train set and a test set.
    Unlike in other cross-validation schemes, the hold-out
    consists of only one iteration.

    Parameters
    ----------
    n : total number of samples
    test_size : 0 < float < 1
        Fraction of samples to use as test set. Must be a
        number between 0 and 1.
    random_state : int
        Seed for the random number generator.
    """
    def __init__(self, n, test_size=0.2, random_state=0):
        self.n = n
        self.test_size = test_size
        self.random_state = random_state

    def __iter__(self):
        n_test = int(np.ceil(self.test_size * self.n))
        n_train = self.n - n_test
        rng = check_random_state(self.random_state)
        permutation = rng.permutation(self.n)
        ind_test = permutation[:n_test]
        ind_train = permutation[n_test:n_test + n_train]
        yield ind_train, ind_test

Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout.