machine learning - Shoul I split my data into training/testing/validation sets with k-fold-cross validation? -
when evaluating recommender system, 1 split data 3 pieces: training, validation , testing sets. in such case, training set used learn recommendation model data , validation set used choose best model or parameters use. then, using chosen model, user evaluate performance of algorithm using testing set.
i have found documentation page scikit-learn cross validation (http://scikit-learn.org/stable/modules/cross_validation.html) says not necessary split data 3 pieces when using k-fold-cross validation, two: training , testing.
a solution problem procedure called cross-validation (cv short). test set should still held out final evaluation, validation set no longer needed when doing cv. in basic approach, called k-fold cv, training set split k smaller sets (other approaches described below, follow same principles).
i wondering if approach. , if so, show me reference article/book backing theory up?
cross validation not avoid validation set, uses many. in other words instead of 1 split 3 parts, have 1 split two, , call "training" has been training , validation, cv repeated splits (in more smart manner randomly) train , test, , averaging results. theory backing available in pretty ml book; crucial bit "should use it" , answer suprisingly simple - only if not have enough data 1 split. cv used when not have enough data each of splits representative distribution interested in, doing repeated splits reduce variance. furthermore, small datasets 1 nested cv - 1 [train+val][test] split , internal [train][val], variance of both - model selection , final evaluation - reduced.
Comments
Post a Comment