+2 votes
in Programming Languages by (7.8k points)
edited by

I have a very large CSR matrix that I want to save on the disk so that I can later load it for the processing. I used numpy.save to save my matrix, but it gave error. How can I save a large CSR matrix?

 OverflowError: cannot serialize a string larger than 4GiB

I also tried allow_pickle=False since numpy.save uses pickle protocol 2, but it gave another error

ValueError: Object arrays cannot be saved when allow_pickle=False

1 Answer

0 votes
by (11k points)

You need to use the default value of allow_pickle to save an array object. This is a big issue with numpy save. I think if you use the HIGHEST_PROTOCOL, which is 4, of pickle, you can save a larger CSR matrix, however, there is no option to specify the protocol in numpy save. h5py, which can handle very large data, does not support CSR matrix. So, I would recommend to try pickle.dump() to save your CSR matrix and pickle.load() to load your saved file. If you used python 3 to save your data using pickle.dump(), you might not be able to process that data using python 2 (I have not tested it). Look at this example to understand how to use pickle.

>>> X
<7x5 sparse matrix of type '<class 'numpy.int8'>'
        with 13 stored elements in Compressed Sparse Row format>
>>> import pickle

To save
>>> with open('save_csr_data.dat', 'wb') as outfile:
...     pickle.dump(X, outfile, pickle.HIGHEST_PROTOCOL)
...

To read
>>> with open('save_csr_data.dat', 'rb') as infile:
...     x1 = pickle.load(infile)
...
>>> x1
<7x5 sparse matrix of type '<class 'numpy.int8'>'
        with 13 stored elements in Compressed Sparse Row format>
 

...