# Slicing Compressed Sparse Row (CSR) matrix in Scipy - how to select some rows from CSR matrix?

I want to select some of the rows from a CSR matrix for running XGboost as there are millions of rows are present in the CSR matrix. How can I do that?

by (48.9k points)

Scipy and Numpy have a very elegant way of selecting some of the rows from a CSR matrix. You need to define a list of row numbers that you want to select and then use that list for the selection. Look at the following example:

x1 is the original CSR matrix and I selected rows 0,2, and 4. x2 is the new CSR matrix with only 3 rows (0,2,4).

>>> print (x1)
(0, 0)        1
(0, 2)        1
(1, 2)        1
(2, 0)        1
(2, 1)        1
(2, 2)        1
(3, 1)        1
(4, 0)        1
(4, 2)        1
>>> x1.toarray()
array([[1, 0, 1],
[0, 0, 1],
[1, 1, 1],
[0, 1, 0],
[1, 0, 1]])
>>> idx=[0,2,4]
>>> x2=x1[idx,:]

>>> print(x2)
(0, 2)        1
(0, 0)        1
(1, 2)        1
(1, 1)        1
(1, 0)        1
(2, 2)        1
(2, 0)        1
>>> x2.toarray()
array([[1, 0, 1],
[1, 1, 1],
[1, 0, 1]], dtype=int64)
>>>