+2 votes
in Programming Languages by (7.6k points)
I want to select some of the rows from a CSR matrix for running XGboost as there are millions of rows are present in the CSR matrix. How can I do that?

1 Answer

0 votes
by (14.8k points)

Scipy and Numpy have a very elegant way of selecting some of the rows from a CSR matrix. You need to define a list of row numbers that you want to select and then use that list for the selection. Look at the following example: 

x1 is the original CSR matrix and I selected rows 0,2, and 4. x2 is the new CSR matrix with only 3 rows (0,2,4).

>>> print (x1)
  (0, 0)        1
  (0, 2)        1
  (1, 2)        1
  (2, 0)        1
  (2, 1)        1
  (2, 2)        1
  (3, 1)        1
  (4, 0)        1
  (4, 2)        1
>>> x1.toarray()
array([[1, 0, 1],
       [0, 0, 1],
       [1, 1, 1],
       [0, 1, 0],
       [1, 0, 1]])
>>> idx=[0,2,4]
>>> x2=x1[idx,:]

>>> print(x2)
  (0, 2)        1
  (0, 0)        1
  (1, 2)        1
  (1, 1)        1
  (1, 0)        1
  (2, 2)        1
  (2, 0)        1
>>> x2.toarray()
array([[1, 0, 1],
       [1, 1, 1],
       [1, 0, 1]], dtype=int64)
>>>

...