scikit-learn tutorial ===================== Loading an example dataset -------------------------- .. figure:: images/Virginia_Iris.png :scale: 100 % :align: center :alt: Photo of Iris Virginia The iris dataset :Features in the Iris dataset (floating-point values): 0. sepal length in cm 1. sepal width in cm 2. petal length in cm 3. petal width in cm :Target classes to predict (integer values): 0. Iris Setosa 1. Iris Versicolour 2. Iris Virginica Example: features = (0.5, 0.2, 0.8, 0.9), target = 0 To load the dataset into a Python object:: >>> from scikits.learn import datasets >>> iris = datasets.load_iris() This data is stored in the `.data` member, which is a `n_samples, n_features` array. >>> iris.data.shape (150, 4) The information about the class of each sample is stored in the target attribute of the dataset: >>> iris.target array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) Learning and Predicting ------------------------ In `scikit.learn`, an *estimator* is just a plain Python class that implements the methods `fit(X, Y)` and `predict(T)`. >>> from scikits.learn import svm >>> clf = svm.LinearSVC() >>> clf.fit(iris.data, iris.target) # learn form the data Once the model is trained, it can be used to predict the most likely outcome on unseen data: >>> X_new = [[ 5.0, 3.6, 1.3, 0.25]] >>> clf.predict(X_new) >>> clf.score(iris.data, iris.target) # fraction of misclassified samples .. topic:: Exercise: Train a logistic regression model (``scikits.learn.linear_model.LogisticRegression``) on the iris dataset and compute the fraction of misclassified samples. .. topic:: Question: Is Logistic Regression performing better than LinearSVC ? All classifiers implement ``fit`` and ``predict``, but only some can further predict probabilities of the outcome. This is the case of logistic regression models: >>> clf2.predict_proba(X_new) array([[ 9.07512928e-01, 9.24770379e-02, 1.00343962e-05]]) This means that the model estimates that the sample in ``X_new`` has: - 90% likelyhood to belong to the 'setosa' class - 9% likelyhood to belong to the 'versicolor' class - 1% likelyhood to belong to the 'virginica' class Set parameters by cross-validation ---------------------------------- ``svm.LinearSVC`` has some model parameters. We want to find the best value of C by cross-validation: >>> from scikits.learn import grid_search >>> parameters = {'C' : (.1, .5, 1.0)} >>> gs = grid_search.GridSearchCV(svm.LinearSVC(), parameters) >>> gs.fit(iris.data, iris.target) >>> gs.best_estimator LinearSVC(C=0.5, probability=False, degree=3, coef0=0.0, tol=0.001, shrinking=True, gamma=0.01) >>> gs.grid_scores_ .. topic:: Exercise: Find the optimal kernel ('linear' or 'rbf') for an ``svm.SVC()`` classifier by cross-validation. Notable implementations of classifiers ++++++++++++++++++++++++++++++++++++++ :``scikits.learn.svm.SVC and {SVR, NuSVC, NuSVR}``: Support Vector Machines with kernels :``scikits.learn.linear_model.SGDClassifier``: Regularized linear models (SVM or logistic regression) using a Stochastic Gradient Descent algorithm written in ``Cython`` :``scikits.learn.neighbors.NeighborsClassifier``: k-Nearest Neighbors classifier. Unsupervised Learning --------------------- An unsupervised learning algorithm only uses a single set of observations ``X`` and does not use any kind of labels. Dimensionality reduction with Principal Component Analysis (PCA) ---------------------------------------------------------------- .. warning:: Depending on your version of scikit-learn PCA will be in module ``decomposition`` or ``pca``. >>> from scikits.learn import decomposition >>> pca = decomposition.PCA(n_components=2) >>> pca.fit(iris.data) >>> X = pca.transform(iris.data) Now we can visualize the (transformed) iris dataset! >>> import pylab as pl >>> pl.scatter(X[:, 0], X[:, 1], c=iris.target) >>> pl.show() PCA is not just useful for visualization of high dimensional datasets. It can also be used as a preprocessing step to help speed up supervised methods that are not computationally efficient with high dimensions such as SVM classifiers with gaussian kernels. .. topic:: Exercise: Visualize the iris dataset using Independent Component Analysis instead of PCA. Hint: use ``scikits.learn.decomposition.FastICA`` (``scikits.learn.ica.FastICA`` on older versions of scikit-learn) Clustering ---------- Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance). For instance let us reuse the output of the 2D PCA of the iris dataset and try to find 3 groups of samples using the simplest clustering algorithm (KMeans):: >>> from scikits.learn.cluster import KMeans >>> km = KMeans(3) >>> km.fit(X) >>> kmeans.cluster_centers_ array([[ 1.01505989, -0.70632886], [ 0.33475124, 0.89126382], [-1.287003 , -0.43512572]]) We can plot the found cluster centers:: >>> pl.scatter(X[:, 0], X[:, 1], c=iris.target) >>> pl.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='o', s=100) >>> pl.show() Notable implementations of clustering models ++++++++++++++++++++++++++++++++++++++++++++ The following are two well-known clustering algorithms: :``scikits.learn.cluster.KMeans``: The simplest, yet effective clustering algorithm. Needs to be provided with the number of clusters in advance, and assumes that the data is normalized as input (but use a PCA model as preprocessor). :``scikits.learn.cluster.MeanShift``: Can find better looking clusters than KMeans but is not scalable to high number of samples. Applications of clustering ++++++++++++++++++++++++++ Here are some common applications of clustering algorithms: - Building customer profiles for market analysis - Grouping related web news (e.g. Google News) and websearch results - Grouping related stock quotes for investment portfolio management - Can be used as a preprocessing step for recommender systems