+3 votes
in Programming Languages by (19.3k points)
I want to use some sample data to test my machine learning codes. I used some TOY datasets available on the UCI website, but is there any way to generate sample data for ML.

1 Answer

0 votes
by (34.0k points)

You can use make_classification() method of sklearn to generate the sample data. You can specify number of samples, number of features, number of classes, etc. in this function. It also provides options to generate imbalanced data and noisy data.

Here is an example to show how to use this function. You can check this function on sklearn's website to know more about it.

from sklearn.datasets import make_classification

def generate_sample_data(sc, fc, nf):
    """
    Generate sample data using sklearn
    """
    print("Generate sample ML data")
    X, y = make_classification(n_samples=sc, n_features=fc, n_informative=2,
                               n_redundant=0, n_classes=2, flip_y=nf, class_sep=0.5,
                               n_clusters_per_class=1, random_state=4)
    return X, y


if __name__ == '__main__':
    noise_fraction_in_data = 0
    sample_count = 100000
    feature_count = 1000
    X_all, y_all = generate_sample_data(sample_count, feature_count, noise_fraction_in_data)

...