scikits.learn : machine learning in Python

Author: Fabian Pedregosa <>

FOSDEM 2011, Data Analytics Devroom


  1. What is scikits.learn ?
  2. Supervised, unsupervised learning
  3. Model selection
  4. Future directions


scikits.learn is:

  • General-purpose Python package for machine learning
  • Easy to install: easy_install -U scikits.learn
  • Consistent API, well documented
  • Open source, BSD-licensed, community-driven project

Support Vector Machines

LibSVM on steroids

Efficient on both dense and sparse data: Faster and less memory usage on dense data.


Weights on classes and samples

_images/weighted_samples.png _images/weights2.png

Different flavors: SVC, NuSVC, SVR, NuSVR, OneClass

LibLinear for large-scale learning: LinearSVC

Different kernels: Linear, Gaussian, Polynomial and custom

Custom kernels:

>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
    ...     return, y.T)
 >>> clf = svm.SVC(kernel=my_kernel)

Access to all parameters And indices of support vectors.

Linear Models

Lasso and ElasticNet

Lasso and ElasticNet are linear models with sparse (L1 and L1 + L2) regularization, and have become widely used in domains such as document classification, image deblurring, neuroimaging and genomics.

Two implementations for Lasso: by coordinate descent and by LARS, both state-of-the-art.

  • LARS : gives the exact Lasso solution at the cost of a Least Squares.
  • Coordinate descent : approximate method, extremely efficient in high-dimensional settings.

Large-scale learning

  • Stochastic Gradient Descent
  • LogisticRegression and LinearSVC using LibLinear

Benchmarks on a 500.000 sample dataset

Classifier train-time test-time
SVM (libsvm bindings) >20min
LinearSVC (iblinear bindings) 9.4471s 0.0184s
Stochastic Gradient Descent 0.2137s 0.0047s

Unsupervised learning

RandomizedPCA, probabilistic version of PCA with better asymptotic properties.


Clustering, GMM, etc.

Model Selection

GridSeachCV. Search optimal value by cross validation

>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
       estimator=SVR(kernel='rbf', C=1.0, probability=False, ...

in parallel!

However, this method is stupid and ignores all model specific information, thus some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.


  • release each 2-3 months.
  • 30 contributors (22 in the last release).
  • Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.

Future direction

Short term

  • Manifold learning.
  • Hierarchical clustering + agglomeration.
  • More variants of Lasso: fused Lasso, grouped Lasso, etc.
  • More parallel: SVMs.

Long term

  • Model Selection.
  • Online methods.
  • Dictionary learning.