Cross-validation

Published:

Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy. Cross-Validation is employed repeatedly in building decision trees or in the validation process in Machine learning in order to avoid overfitting.

The main types of cross-validation are:

  • Leave-one-out cross-validation (similar to jackknife)
  • Leave-p-out cross-validation
  • K-fold cross-validation (splits the data into K subsets, each is held out in turn as the validation set)

Cross-validation main goal is to avoid “self-influence”.

This is often used for deciding how many predictor variables to use in regression. Without cross-validation, adding predictors always reduces the residual sum of squares (or possibly leaves it unchanged). In contrast, the cross-validated mean-square error will tend to decrease if valuable predictors are added, but increase if worthless predictors are added.

See also

Resampling methods

Papers