+2 votes
in Machine Learning by (73.8k points)
recategorized by
In my training data, all class "1" records are after all class "0" records. I want to randomize their orders. Is there any way to shuffle both records and labels in the same order so that labels do not get messed up?

1 Answer

+1 vote
by (349k points)
selected by
 
Best answer

You can try one of the following two approaches to shuffle both data and labels in the same order.

Approach 1: Using the number of elements in your data, generate a random index using function permutation(). Use that random index to shuffle the data and labels.

>>> import numpy as np
>>> X=np.array([[1.1,2.2,3.3,4.4],[1.2,2.3,3.4,4.5],[2.1,2.2,2.3,2.4],[3.1,3.2,3.3,3.4],[4.1,4.2,4.3,4.4]])
>>> X
array([[1.1, 2.2, 3.3, 4.4],
       [1.2, 2.3, 3.4, 4.5],
       [2.1, 2.2, 2.3, 2.4],
       [3.1, 3.2, 3.3, 3.4],
       [4.1, 4.2, 4.3, 4.4]])
>>> y=np.array([0,1,2,3,4])

>>> p = np.random.permutation(len(y))
>>> p
array([1, 0, 3, 4, 2])

>>> X_shuffled=X[p]
>>> X_shuffled
array([[1.2, 2.3, 3.4, 4.5],
       [1.1, 2.2, 3.3, 4.4],
       [3.1, 3.2, 3.3, 3.4],
       [4.1, 4.2, 4.3, 4.4],
       [2.1, 2.2, 2.3, 2.4]])

>>> y_shuffled=y[p]
>>> y_shuffled
array([1, 0, 3, 4, 2])

Approach 2: You can also use the shuffle() module of sklearn to randomize the data and labels in the same order.

>>> from sklearn.utils import shuffle
>>> X_shuffled,y_shuffled = shuffle(X, y, random_state=0)
>>> X_shuffled
array([[2.1, 2.2, 2.3, 2.4],
       [1.1, 2.2, 3.3, 4.4],
       [1.2, 2.3, 3.4, 4.5],
       [3.1, 3.2, 3.3, 3.4],
       [4.1, 4.2, 4.3, 4.4]])
>>> y_shuffled
array([2, 0, 1, 3, 4])


...