Holdout cross-validation generator
Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the __iter__
method and that for each call to this method return (or more precisely, yield
) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API.
import numpy as np
from sklearn.utils import check_random_state
class HoldOut:
"""
Hold-out cross-validator generator. In the hold-out, the
data is split only once into a train set and a test set.
Unlike in other cross-validation schemes, the hold-out
consists of only one iteration.
Parameters
----------
n : total number of samples
test_size : 0 < float < 1
Fraction of samples to use as test set. Must be a
number between 0 and 1.
random_state : int
Seed for the random number generator.
"""
def __init__(self, n, test_size=0.2, random_state=0):
self.n = n
self.test_size = test_size
self.random_state = random_state
def __iter__(self):
n_test = int(np.ceil(self.test_size * self.n))
n_train = self.n - n_test
rng = check_random_state(self.random_state)
permutation = rng.permutation(self.n)
ind_test = permutation[:n_test]
ind_train = permutation[n_test:n_test + n_train]
yield ind_train, ind_test
Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout.