Author: Fabian Pedregosa <fabian.pedregosa@inria.fr>

FOSDEM 2011, Data Analytics Devroom

Outline

- What is scikits.learn ?
- Supervised, unsupervised learning
- Model selection
- Future directions

# Introduction¶

scikits.learn is:

General-purpose Python package for machine learningEasy to install: easy_install -U scikits.learnConsistent API, well documentedOpen source, BSD-licensed, community-driven project

## Support Vector Machines¶

### LibSVM on steroids¶

**Efficient on both dense and sparse data**:
Faster and less memory usage on dense data.

**Weights on classes and samples**

**Different flavors**:
`SVC`, `NuSVC`, `SVR`, `NuSVR`, `OneClass`

**LibLinear for large-scale learning**: `LinearSVC`

**Different kernels: Linear, Gaussian, Polynomial and custom**

**Custom kernels**:

```
>>> import numpy as np
>>> from scikits.learn import svm
>>> def my_kernel(x, y):
... return np.dot(x, y.T)
...
>>> clf = svm.SVC(kernel=my_kernel)
```

**Access to all parameters** And indices of support vectors.

## Linear Models¶

### Lasso and ElasticNet¶

Lasso and ElasticNet are linear models with sparse (L1 and L1 + L2) regularization, and have become widely used in domains such as document classification, image deblurring, neuroimaging and genomics.

Two implementations for Lasso: by coordinate descent and by LARS, both state-of-the-art.

LARS: gives the exact Lasso solution at the cost of a Least Squares.Coordinate descent: approximate method, extremely efficient in high-dimensional settings.

### Large-scale learning¶

- Stochastic Gradient Descent
- LogisticRegression and LinearSVC using LibLinear

Benchmarks on a 500.000 sample dataset

Classifier | train-time | test-time |
---|---|---|

SVM (libsvm bindings) | >20min | |

LinearSVC (iblinear bindings) | 9.4471s | 0.0184s |

Stochastic Gradient Descent | 0.2137s | 0.0047s |

## Unsupervised learning¶

**RandomizedPCA**, probabilistic version of PCA with better asymptotic properties.

**Clustering, GMM, etc.**

## Model Selection¶

**GridSeachCV**. Search optimal value by cross validation

```
>>> from scikits.learn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> param = {'C': np.arange(0.1, 2, 0.1)}
>>> svr = svm.SVR()
>>> clf = grid_search.GridSearchCV(svr, param)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(n_jobs=1, fit_params={}, loss_func=None, iid=True,
estimator=SVR(kernel='rbf', C=1.0, probability=False, ...
```

in parallel!

However, this method is stupid and ignores all model specific information, thus some classes are able to automatically tune their parameters: LassoCV, ElasticNetCV, RidgeCV.

## Statistics¶

- release each 2-3 months.
- 30 contributors (22 in the last release).
- Shipped with: Ubuntu, Debian, Macports, NetBSD, Mandriva, Enthought Python Distribution. Also easy_install and windows binaries.