+4 votes
in Programming Languages by (73.8k points)
edited by
I have about 100 .npy files that contain CSR matrix data. Data have been saved in multiple files because np.save doesn't allow more than 4gb file. Now to train a classifier, I need to load the data from all .npy files to RAM. I am using vstack to create a combined CSR matrix, but it is slow. Can someone please let me know what is effective way to use vstack or any other method to create a combined CSR matrix using the data from all 100 .npy files?

1 Answer

+1 vote
by (348k points)
selected by
 
Best answer

I think there are two ways to use VSTACK. If you are trying to append one file at a time, vstack will be very slow. The recommended way is first load all npy files and then use vstack. Look at the following implementations. 

import numpy as np
from scipy import sparse
import os

#Approach 1
file_cnt = 0
filelist = next(os.walk(path/to/npy_files))[2]
for filename in filelist:
 filename = path/to/npy_files + '/' + filename
 if (file_cnt == 0):
  data1 = np.load(filename).tolist()
  print ('Number of records in file {0} is {1}'.format(filename, data1.shape[0]))
 else:
  data2 = np.load(filename).tolist()
  print ('Number of records in file {0} is {1}'.format(filename, data2.shape[0]))
  data1 = sparse.vstack((data1, data2), format='csr')  #merge the data
  print ('Total number of records in the merged file is {0}'.format(data1.shape[0]))
 file_cnt+=1


#Approach 2
filelist = next(os.walk(path/to/npy_files))[2]
print ('Loading all .npy files into RAM....')
npylist = [np.load(path/to/npy_files + '/' + e).tolist() for e in filelist]
print ('Merging npy files....')
data1=sparse.vstack(npylist)

Approach 2 is faster than approach 1. If you have used approach 1, try to use approach 2. It should help you.


...