scikits.learn : machine learning in Python


Outline

  1. What is scikits.learn ?
  2. Supervised, unsupervised learning
  3. Model selection
  4. Future directions

Introduction

scikits.learn is a Python library for machine learning.

  • Easy to install: easy_install -U scikits.learn
  • Clean API, well documented
  • Domain-agnostic
  • Fast ...
  • Bring machine learning to the non-specialist

Must have features :

  • Support Vector Machines
  • Generalized Linear Models
  • Model selection
  • All this in a consistent API

Support Vector Machines

Problems: transparency and overhead.

_images/overhead.png
_images/bench_svm.png

Killer Features

Different flavors: SVC, NuSVC, LinearSVC, SVR, NuSVR, OneClass

LibSVM on steroids

Equilibrate unbalanced classes

_images/unbalanced.png

Different kernels: what is a kernel?

Custom kernels:

>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
    ...     return np.dot(x, y.T)
    ...
 >>> clf = svm.SVC(kernel=my_kernel)

Weights on individual samples

_images/weighted_samples.png _images/weights2.png

Access to all parameters And indices of support vectors.

Efficient on both dense and sparse data: 16x times less memory usage than libsvm bindings on dense data.

Generalized Linear Models

Lasso

The LASSO (least absolute shrinkage and selection operator) algorithm, finds a least-squares solution with the constraint that | β | , the L1-norm of the coefficients, is no greater than a given value.

_images/lasso_path.png

Two implementations for Lasso: by coordinate descent and by LARS.

parsimony: parameter alpha and max_features:

>>> from scikits.learn import linear_model, datasets
>>> diabetes = datasets.load_diabetes()
>>> clf = linear_model.LassoLARS(alpha=0.)
>>> clf.fit(diabetes.data, diabetes.target, max_features=4)
>>> clf.coef_
array([   0.        ,    0.        ,  505.65955847,  191.26988358,
          0.        ,    0.        , -114.10097989,    0.        ,
          439.66494176,    0.        ])

LARS version 2x to 10x times faster than R version (lars). Coordinate Descent version equivalent to R version (glmnet) on low dimension, still slower on high dimension (work in progress).

And more

Unsupervised learning

Some unsupervised learning methods are able to transform data, RandomizedPCA scales on huge datasets.

Clustering, GMM, etc.

Model Selection

GridSeachCV. Search optimal value by cross validation

>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
       estimator=SVR(kernel='rbf', C=1.0, probability=False, ...

in parallel!

Pipeline

>>> anova_filter = SelectKBest(f_regression, k=5)
>>> clf = svm.SVC(kernel='linear')
>>> anova_svm = Pipeline([('anova', anova_filter), ('svm', clf)])
>>> anova_svm.fit(X, y)
>>> anova_svm.predict(X)

However, this method is stupid and ignores all model specific information.

To avoid frameworked code, some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.

Statistics

  • release each 2-3 months.
  • 30 contributors (22 in the last release).
  • Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.

Future direction

Short term

  • Manifold learning.
  • Hierarchical clustering + agglomeration.
  • More variants of Lasso: fused Lasso, grouped Lasso, etc.
  • More parallel: SVMs.

Long term

  • Model Selection.
  • Online methods.
  • Dictionary learning.