+3 votes
in Programming Languages by (73.8k points)
I have a training data set with class 0 and class 1. The data set is highly imbalanced as the ratio of class 0 to class 1 records is ~12. I want to generate a balanced training data set by selecting all class 1 records and the same number of class 0 records. How to get this done in Python?

1 Answer

+1 vote
by (348k points)
selected by
 
Best answer

Let's say your imbalanced dataset is X_all and labels are Y_all. You can write the following function to generate a balanced data set. It selects all class 1 records and randomly selects the same number of class 0 records.

import numpy as np
import random

def generate_balanced_data(X_all, Y_all):
    """
    This function generates balanced dataset for the classifier.
    """
    Y_1 = set(np.where(Y_all == 1)[0])
    Y_0 = set(np.where(Y_all == 0)[0])
    Y_0_sel = random.sample(Y_0, len(Y_1))

    bal_idx = list(Y_1.union(Y_0_sel)) # indices for the balanced data
    return X_all[bal_idx], Y_all[bal_idx]


...