+3 votes
in Machine Learning by (17.9k points)
I want to divide a given data into train and test sets for training and testing a classification model. The train set should contain 75% of the data, and the test set should have 25% of the data. Is there any function to split the data?

1 Answer

+2 votes
by (28.5k points)

If you are familiar with the scikit-learn library, you can use its train_test_split() function to create train and test data for your classification model. You can specify the test_size as the argument of this function.

Here is an example. I have randomly generated data and labels and will apply the function to generate train and test sets.

import numpy as np

from sklearn.model_selection import train_test_split

# generate random data

n_samples = 25

n_features = 4

np.random.seed(1234)

X, y = np.random.random(n_samples*n_features).reshape((n_samples, n_features)), \

       [np.random.randint(0, 2) for _ in range(n_samples)]

print("data shape: {0}".format(X.shape))

# split data into train (75%) and test (25%) sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1234)

print("train set shape: {0}".format(X_train.shape))

print("test set shape: {0}".format(X_test.shape))

The above code prints the following output:

data shape: (25, 4)

train set shape: (18, 4)

test set shape: (7, 4)


...