Outline

What is scikits.learn ?
Supervised, unsupervised learning
Model selection
Future directions

Introduction¶

scikits.learn is a Python library for machine learning.

Easy to install: easy_install -U scikits.learn

Clean API, well documented

Domain-agnostic

Fast ...

Bring machine learning to the non-specialist

Must have features :

Support Vector Machines

Generalized Linear Models

Model selection

All this in a consistent API

Support Vector Machines¶

Problems: transparency and overhead.

Killer Features¶

Different flavors: SVC, NuSVC, LinearSVC, SVR, NuSVR, OneClass

LibSVM on steroids

Equilibrate unbalanced classes

Different kernels: what is a kernel?

Custom kernels:

>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
    ...     return np.dot(x, y.T)
    ...
 >>> clf = svm.SVC(kernel=my_kernel)

Weights on individual samples

Access to all parameters And indices of support vectors.

Efficient on both dense and sparse data: 16x times less memory usage than libsvm bindings on dense data.

Generalized Linear Models¶

Lasso¶

The LASSO (least absolute shrinkage and selection operator) algorithm, finds a least-squares solution with the constraint that | β | , the L1-norm of the coefficients, is no greater than a given value.

Two implementations for Lasso: by coordinate descent and by LARS.

parsimony: parameter alpha and max_features:

>>> from scikits.learn import linear_model, datasets
>>> diabetes = datasets.load_diabetes()
>>> clf = linear_model.LassoLARS(alpha=0.)
>>> clf.fit(diabetes.data, diabetes.target, max_features=4)
>>> clf.coef_
array([   0.        ,    0.        ,  505.65955847,  191.26988358,
          0.        ,    0.        , -114.10097989,    0.        ,
          439.66494176,    0.        ])

LARS version 2x to 10x times faster than R version (lars). Coordinate Descent version equivalent to R version (glmnet) on low dimension, still slower on high dimension (work in progress).

And more

Unsupervised learning¶

Some unsupervised learning methods are able to transform data, RandomizedPCA scales on huge datasets.

Clustering, GMM, etc.

Model Selection¶

GridSeachCV. Search optimal value by cross validation

>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
       estimator=SVR(kernel='rbf', C=1.0, probability=False, ...

in parallel!

Pipeline

>>> anova_filter = SelectKBest(f_regression, k=5)
>>> clf = svm.SVC(kernel='linear')

>>> anova_svm = Pipeline([('anova', anova_filter), ('svm', clf)])
>>> anova_svm.fit(X, y)
>>> anova_svm.predict(X)

However, this method is stupid and ignores all model specific information.

To avoid frameworked code, some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.

Statistics¶

release each 2-3 months.

30 contributors (22 in the last release).

Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.

Future direction¶

Short term¶

Manifold learning.

Hierarchical clustering + agglomeration.

More variants of Lasso: fused Lasso, grouped Lasso, etc.

More parallel: SVMs.

Long term¶

Model Selection.

Online methods.

Dictionary learning.

scikits.learn : machine learning in Python