Support for sparse matrices in scikits.learn
I recently added support for sparse matrices (as defined in scipy.sparse) in some classifiers of scikits.learn. In those classes, the fit method will perform the algorithm without converting to a dense representation and will also store parameters in an efficient format. Right now, the only classese that implements this is SVC and LinearSVC in scikits.learn.svm.sparse, although the plan is to add more classes in the future. These are capable of taking sparse matrices in the fit() method and will also store support vectors as sparse matrices. Here is an example. We first create a toy dataset and import relevant modules:
[cc lang="python"] In [1]: import scipy.sparse In [2]: from scikits.learn. import svm In [3]: X, Y = scipy.sparse.csr_matrix([[0,0], [0, 1]]), [0, 1] In [4]: clf = svm.sparse.SVC(kernel='linear') [/cc]
now we will fit the model and query some of its parameters: [cc lang="python"] In [5]: clf.fit(X, Y) Out[5]: SVC(kernel='linear', C=1.0, probability=0, shrinking=1, eps=0.001, cache_size=100.0, coef0=0.0, gamma=0.0) In [6]: clf.support_ Out[6]: <2x2 sparse matrix of type '' with 1 stored elements in Compressed Sparse Row format> In [7]: clf.coef_ Out[7]: <1x2 sparse matrix of type '' with 1 stored elements in Compressed Sparse Row format> [/cc] For a more complete example, you can look at Classification of text documents using sparse features, contributed by Olivier Grisel.