scikit-learn tutorial
=====================

Loading an example dataset
--------------------------


.. figure:: images/Virginia_Iris.png
   :scale: 100 %
   :align: center
   :alt: Photo of Iris Virginia

The iris dataset

 :Features in the Iris dataset (floating-point values):

   0. sepal length in cm
   1. sepal width in cm
   2. petal length in cm
   3. petal width in cm

 :Target classes to predict (integer values):

   0. Iris Setosa
   1. Iris Versicolour
   2. Iris Virginica

Example: features = (0.5, 0.2, 0.8, 0.9), target = 0

To load the dataset into a Python object::

  >>> from scikits.learn import datasets
  >>> iris = datasets.load_iris()

This data is stored in the `.data` member, which
is a `n_samples, n_features` array.

    >>> iris.data.shape
    (150, 4)

The information about the class of each sample is stored in the target
attribute of the dataset:

>>> iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])


Learning and Predicting
------------------------

In `scikit.learn`, an *estimator* is just a plain Python class that
implements the methods `fit(X, Y)` and `predict(T)`.

    >>> from scikits.learn import svm
    >>> clf = svm.LinearSVC()
    >>> clf.fit(iris.data, iris.target) # learn form the data

Once the model is trained, it can be used to predict the most likely outcome on
unseen data:

  >>> X_new = [[ 5.0,  3.6,  1.3,  0.25]]
  >>> clf.predict(X_new)
  >>> clf.score(iris.data, iris.target) # fraction of misclassified samples


.. topic:: Exercise:

   Train a logistic regression model
   (``scikits.learn.linear_model.LogisticRegression``) on the
   iris dataset and compute the fraction of misclassified samples.


.. topic:: Question:

   Is Logistic Regression performing better than LinearSVC ?


All classifiers implement ``fit`` and ``predict``, but only some can
further predict probabilities of the outcome.  This is the case of
logistic regression models:

  >>> clf2.predict_proba(X_new)
  array([[  9.07512928e-01,   9.24770379e-02,   1.00343962e-05]])

This means that the model estimates that the sample in ``X_new`` has:

  - 90% likelyhood to belong to the 'setosa' class

  - 9% likelyhood to belong to the 'versicolor' class

  - 1% likelyhood to belong to the 'virginica' class

Set parameters by cross-validation
----------------------------------

``svm.LinearSVC`` has some model parameters. We want to find the best
value of C by cross-validation:

>>> from scikits.learn import grid_search
>>> parameters = {'C' : (.1, .5, 1.0)}
>>> gs = grid_search.GridSearchCV(svm.LinearSVC(), parameters)
>>> gs.fit(iris.data, iris.target)
>>> gs.best_estimator
    LinearSVC(C=0.5, probability=False, degree=3, coef0=0.0, tol=0.001,
      shrinking=True, gamma=0.01)
>>> gs.grid_scores_


.. topic:: Exercise:

   Find the optimal kernel ('linear' or 'rbf') for an ``svm.SVC()``
   classifier by cross-validation.


Notable implementations of classifiers
++++++++++++++++++++++++++++++++++++++

:``scikits.learn.svm.SVC and {SVR, NuSVC, NuSVR}``:

  Support Vector Machines with kernels

:``scikits.learn.linear_model.SGDClassifier``:

  Regularized linear models (SVM or logistic regression) using a Stochastic
  Gradient Descent algorithm written in ``Cython``

:``scikits.learn.neighbors.NeighborsClassifier``:

  k-Nearest Neighbors classifier.


Unsupervised Learning
---------------------

An unsupervised learning algorithm only uses a single set of
observations ``X`` and does not use any kind of labels.


Dimensionality reduction with Principal Component Analysis (PCA)
----------------------------------------------------------------

.. warning::

    Depending on your version of scikit-learn PCA will be in module
    ``decomposition`` or ``pca``.

>>> from scikits.learn import decomposition
>>> pca = decomposition.PCA(n_components=2)
>>> pca.fit(iris.data)
>>> X = pca.transform(iris.data)

Now we can visualize the (transformed) iris dataset!

>>> import pylab as pl
>>> pl.scatter(X[:, 0], X[:, 1], c=iris.target)
>>> pl.show()


PCA is not just useful for visualization of high dimensional
datasets. It can also be used as a preprocessing step to help speed up
supervised methods that are not computationally efficient with high
dimensions such as SVM classifiers with gaussian kernels.

.. topic:: Exercise:

   Visualize the iris dataset using Independent Component Analysis
   instead of PCA. Hint: use ``scikits.learn.decomposition.FastICA``
   (``scikits.learn.ica.FastICA`` on older versions of scikit-learn)


Clustering
----------

Clustering is the task of gathering samples into groups of similar
samples according to some predefined similarity or dissimilarity
measure (such as the Euclidean distance).

For instance let us reuse the output of the 2D PCA of the iris
dataset and try to find 3 groups of samples using the simplest
clustering algorithm (KMeans)::

  >>> from scikits.learn.cluster import KMeans
  >>> km = KMeans(3)
  >>> km.fit(X)
  >>> kmeans.cluster_centers_
  array([[ 1.01505989, -0.70632886],
         [ 0.33475124,  0.89126382],
         [-1.287003  , -0.43512572]])

We can plot the found cluster centers::

  >>> pl.scatter(X[:, 0], X[:, 1], c=iris.target)
  >>> pl.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='o', s=100)
  >>> pl.show()


Notable implementations of clustering models
++++++++++++++++++++++++++++++++++++++++++++

The following are two well-known clustering algorithms:

:``scikits.learn.cluster.KMeans``:

  The simplest, yet effective clustering algorithm. Needs to be
  provided with the number of clusters in advance, and assumes that the
  data is normalized as input (but use a PCA model as preprocessor).

:``scikits.learn.cluster.MeanShift``:

  Can find better looking clusters than KMeans but is not scalable
  to high number of samples.


Applications of clustering
++++++++++++++++++++++++++

Here are some common applications of clustering algorithms:

- Building customer profiles for market analysis

- Grouping related web news (e.g. Google News) and websearch results

- Grouping related stock quotes for investment portfolio management

- Can be used as a preprocessing step for recommender systems