Outline
- What is scikits.learn ?
- Supervised, unsupervised learning
- Model selection
- Future directions
Introduction¶
scikits.learn is a Python library for machine learning.
- Easy to install: easy_install -U scikits.learn
- Clean API, well documented
- Domain-agnostic
- Fast ...
- Bring machine learning to the non-specialist
Must have features :
- Support Vector Machines
- Generalized Linear Models
- Model selection
- All this in a consistent API
Support Vector Machines¶
Problems: transparency and overhead.


Killer Features¶
Different flavors: SVC, NuSVC, LinearSVC, SVR, NuSVR, OneClass
LibSVM on steroids
Equilibrate unbalanced classes

Different kernels: what is a kernel?
Custom kernels:
>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
... return np.dot(x, y.T)
...
>>> clf = svm.SVC(kernel=my_kernel)
Weights on individual samples


Access to all parameters And indices of support vectors.
Efficient on both dense and sparse data: 16x times less memory usage than libsvm bindings on dense data.
Generalized Linear Models¶
Lasso¶
The LASSO (least absolute shrinkage and selection operator) algorithm, finds a least-squares solution with the constraint that | β | , the L1-norm of the coefficients, is no greater than a given value.

Two implementations for Lasso: by coordinate descent and by LARS.
parsimony: parameter alpha and max_features:
>>> from scikits.learn import linear_model, datasets
>>> diabetes = datasets.load_diabetes()
>>> clf = linear_model.LassoLARS(alpha=0.)
>>> clf.fit(diabetes.data, diabetes.target, max_features=4)
>>> clf.coef_
array([ 0. , 0. , 505.65955847, 191.26988358,
0. , 0. , -114.10097989, 0. ,
439.66494176, 0. ])
LARS version 2x to 10x times faster than R version (lars). Coordinate Descent version equivalent to R version (glmnet) on low dimension, still slower on high dimension (work in progress).
And more
Unsupervised learning¶
Some unsupervised learning methods are able to transform data, RandomizedPCA scales on huge datasets.
Clustering, GMM, etc.
Model Selection¶
GridSeachCV. Search optimal value by cross validation
>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
estimator=SVR(kernel='rbf', C=1.0, probability=False, ...
in parallel!
Pipeline
>>> anova_filter = SelectKBest(f_regression, k=5)
>>> clf = svm.SVC(kernel='linear')
>>> anova_svm = Pipeline([('anova', anova_filter), ('svm', clf)])
>>> anova_svm.fit(X, y)
>>> anova_svm.predict(X)
However, this method is stupid and ignores all model specific information.
To avoid frameworked code, some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.
Statistics¶
- release each 2-3 months.
- 30 contributors (22 in the last release).
- Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.