Python : Merging large .npy files to create a combined CSR matrix (compressed sparse)

Question

Python : Merging large .npy files to create a combined CSR matrix (compressed sparse)

1 Answer

answered Mar 7, 2018 by pkumar81 (351k points)
selected Apr 21, 2023 by pkumar81

Best answer

I think there are two ways to use VSTACK. If you are trying to append one file at a time, vstack will be very slow. The recommended way is first load all npy files and then use vstack. Look at the following implementations.

import numpy as np
from scipy import sparse
import os
#Approach 1
file_cnt = 0
filelist = next(os.walk(path/to/npy_files))[2]
for filename in filelist:
filename = path/to/npy_files + '/' + filename
if (file_cnt == 0):
  data1 = np.load(filename).tolist()
  print ('Number of records in file {0} is {1}'.format(filename, data1.shape[0]))
else:
  data2 = np.load(filename).tolist()
  print ('Number of records in file {0} is {1}'.format(filename, data2.shape[0]))
  data1 = sparse.vstack((data1, data2), format='csr') #merge the data
  print ('Total number of records in the merged file is {0}'.format(data1.shape[0]))
file_cnt+=1

#Approach 2
filelist = next(os.walk(path/to/npy_files))[2]
print ('Loading all .npy files into RAM....')
npylist = [np.load(path/to/npy_files + '/' + e).tolist() for e in filelist]
print ('Merging npy files....')
data1=sparse.vstack(npylist)

Approach 2 is faster than approach 1. If you have used approach 1, try to use approach 2. It should help you.

Python : Merging large .npy files to create a combined CSR matrix (compressed sparse)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories