+3 votes
in Machine Learning by (73.8k points)
recategorized by
I want to generate unique train and test sets to run 5-fold cross-validation. In each fold, 80% of the data should be selected as a train set and the remaining 20% as a test set. Each fold should have different data in the test set. How can I do this?

1 Answer

+1 vote
by (348k points)
selected by
 
Best answer

You can use sklearn's StratifiedKFold() method to split the data into train and test sets to run 5-fold cross-validation. This method will generate unique test sets for each fold.

Here is an example:

import numpy as np
from sklearn.model_selection import StratifiedKFold

# sample data
X = np.array([[1, 2, 3], [2, 4, 6], [3, 6, 9], [4, 8, 12], [5, 10, 15],
              [6, 12, 18], [7, 14, 21], [8, 16, 24], [9, 18, 27], [10, 20, 30]])
y = np.array([0, 1, 1, 1, 1, 0, 0, 0, 1, 0])

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1001)

for k, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # print('X_train:', X_train)
    # print('y_train:', y_train)
    print('fold: {0}, X_test: \n{1}'.format(k, X_test))
    print('fold: {0}, y_test: {1}'.format(k, y_test))

The above code prints the following output. You can see that each fold has different test set.

fold: 0, X_test:
[[ 5 10 15]
 [ 6 12 18]]
fold: 0, y_test: [1 0]
fold: 1, X_test:
[[ 2  4  6]
 [10 20 30]]
fold: 1, y_test: [1 0]
fold: 2, X_test:
[[ 3  6  9]
 [ 7 14 21]]
fold: 2, y_test: [1 0]
fold: 3, X_test:
[[ 1  2  3]
 [ 9 18 27]]
fold: 3, y_test: [0 1]
fold: 4, X_test:
[[ 4  8 12]
 [ 8 16 24]]
fold: 4, y_test: [1 0]


...