Author: Fabian Pedregosa <fabian.pedregosa@inria.fr>

FOSDEM 2011, Data Analytics Devroom

Outline

What is scikits.learn ?
Supervised, unsupervised learning
Model selection
Future directions

Introduction¶

scikits.learn is:

General-purpose Python package for machine learning

Easy to install: easy_install -U scikits.learn

Consistent API, well documented

Open source, BSD-licensed, community-driven project

Support Vector Machines¶

LibSVM on steroids¶

Efficient on both dense and sparse data: Faster and less memory usage on dense data.

Weights on classes and samples

Different flavors: SVC, NuSVC, SVR, NuSVR, OneClass

LibLinear for large-scale learning: LinearSVC

Different kernels: Linear, Gaussian, Polynomial and custom

Custom kernels:

>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
    ...     return np.dot(x, y.T)
    ...
 >>> clf = svm.SVC(kernel=my_kernel)

Access to all parameters And indices of support vectors.

Linear Models¶

Lasso and ElasticNet¶

Lasso and ElasticNet are linear models with sparse (L1 and L1 + L2) regularization, and have become widely used in domains such as document classification, image deblurring, neuroimaging and genomics.

Two implementations for Lasso: by coordinate descent and by LARS, both state-of-the-art.

LARS : gives the exact Lasso solution at the cost of a Least Squares.

Coordinate descent : approximate method, extremely efficient in high-dimensional settings.

Large-scale learning¶

Stochastic Gradient Descent

LogisticRegression and LinearSVC using LibLinear

Benchmarks on a 500.000 sample dataset

Classifier	train-time	test-time
SVM (libsvm bindings)	>20min
LinearSVC (iblinear bindings)	9.4471s	0.0184s
Stochastic Gradient Descent	0.2137s	0.0047s

Unsupervised learning¶

RandomizedPCA, probabilistic version of PCA with better asymptotic properties.

Clustering, GMM, etc.

Model Selection¶

GridSeachCV. Search optimal value by cross validation

>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
       estimator=SVR(kernel='rbf', C=1.0, probability=False, ...

in parallel!

However, this method is stupid and ignores all model specific information, thus some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.

Statistics¶

release each 2-3 months.

30 contributors (22 in the last release).

Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.

Future direction¶

Short term¶

Manifold learning.

Hierarchical clustering + agglomeration.

More variants of Lasso: fused Lasso, grouped Lasso, etc.

More parallel: SVMs.

Long term¶

Model Selection.

Online methods.

Dictionary learning.

Funding¶

http://scikit-learn.sourceforge.net

scikits.learn : machine learning in Python