scikit-learn tutorial¶
Loading an example dataset¶
The iris dataset
Features in the Iris dataset (floating-point values):
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
Target classes to predict (integer values):
- Iris Setosa
- Iris Versicolour
- Iris Virginica
Example: features = (0.5, 0.2, 0.8, 0.9), target = 0
To load the dataset into a Python object:
>>> from scikits.learn import datasets
>>> iris = datasets.load_iris()
This data is stored in the .data member, which is a n_samples, n_features array.
>>> iris.data.shape
(150, 4)
The information about the class of each sample is stored in the target attribute of the dataset:
>>> iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Learning and Predicting¶
In scikit.learn, an estimator is just a plain Python class that implements the methods fit(X, Y) and predict(T).
>>> from scikits.learn import svm
>>> clf = svm.LinearSVC()
>>> clf.fit(iris.data, iris.target) # learn form the data
Once the model is trained, it can be used to predict the most likely outcome on unseen data:
>>> X_new = [[ 5.0, 3.6, 1.3, 0.25]]
>>> clf.predict(X_new)
>>> clf.score(iris.data, iris.target) # fraction of misclassified samples
Exercise:
Train a logistic regression model (scikits.learn.linear_model.LogisticRegression) on the iris dataset and compute the fraction of misclassified samples.
Question:
Is Logistic Regression performing better than LinearSVC ?
All classifiers implement fit and predict, but only some can further predict probabilities of the outcome. This is the case of logistic regression models:
>>> clf2.predict_proba(X_new)
array([[ 9.07512928e-01, 9.24770379e-02, 1.00343962e-05]])
This means that the model estimates that the sample in X_new has:
- 90% likelyhood to belong to the ‘setosa’ class
- 9% likelyhood to belong to the ‘versicolor’ class
- 1% likelyhood to belong to the ‘virginica’ class
Set parameters by cross-validation¶
svm.LinearSVC has some model parameters. We want to find the best value of C by cross-validation:
>>> from scikits.learn import grid_search
>>> parameters = {'C' : (.1, .5, 1.0)}
>>> gs = grid_search.GridSearchCV(svm.LinearSVC(), parameters)
>>> gs.fit(iris.data, iris.target)
>>> gs.best_estimator
LinearSVC(C=0.5, probability=False, degree=3, coef0=0.0, tol=0.001,
shrinking=True, gamma=0.01)
>>> gs.grid_scores_
Exercise:
Find the optimal kernel (‘linear’ or ‘rbf’) for an svm.SVC() classifier by cross-validation.
Notable implementations of classifiers¶
scikits.learn.svm.SVC and {SVR, NuSVC, NuSVR}: | |
---|---|
Support Vector Machines with kernels | |
scikits.learn.linear_model.SGDClassifier: | |
Regularized linear models (SVM or logistic regression) using a Stochastic Gradient Descent algorithm written in Cython | |
scikits.learn.neighbors.NeighborsClassifier: | |
k-Nearest Neighbors classifier. |
Unsupervised Learning¶
An unsupervised learning algorithm only uses a single set of observations X and does not use any kind of labels.
Dimensionality reduction with Principal Component Analysis (PCA)¶
Warning
Depending on your version of scikit-learn PCA will be in module decomposition or pca.
>>> from scikits.learn import decomposition
>>> pca = decomposition.PCA(n_components=2)
>>> pca.fit(iris.data)
>>> X = pca.transform(iris.data)
Now we can visualize the (transformed) iris dataset!
>>> import pylab as pl
>>> pl.scatter(X[:, 0], X[:, 1], c=iris.target)
>>> pl.show()
PCA is not just useful for visualization of high dimensional datasets. It can also be used as a preprocessing step to help speed up supervised methods that are not computationally efficient with high dimensions such as SVM classifiers with gaussian kernels.
Exercise:
Visualize the iris dataset using Independent Component Analysis instead of PCA. Hint: use scikits.learn.decomposition.FastICA (scikits.learn.ica.FastICA on older versions of scikit-learn)
Clustering¶
Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance).
For instance let us reuse the output of the 2D PCA of the iris dataset and try to find 3 groups of samples using the simplest clustering algorithm (KMeans):
>>> from scikits.learn.cluster import KMeans
>>> km = KMeans(3)
>>> km.fit(X)
>>> kmeans.cluster_centers_
array([[ 1.01505989, -0.70632886],
[ 0.33475124, 0.89126382],
[-1.287003 , -0.43512572]])
We can plot the found cluster centers:
>>> pl.scatter(X[:, 0], X[:, 1], c=iris.target)
>>> pl.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='o', s=100)
>>> pl.show()
Notable implementations of clustering models¶
The following are two well-known clustering algorithms:
scikits.learn.cluster.KMeans: | |
---|---|
The simplest, yet effective clustering algorithm. Needs to be provided with the number of clusters in advance, and assumes that the data is normalized as input (but use a PCA model as preprocessor). | |
scikits.learn.cluster.MeanShift: | |
Can find better looking clusters than KMeans but is not scalable to high number of samples. |
Applications of clustering¶
Here are some common applications of clustering algorithms:
- Building customer profiles for market analysis
- Grouping related web news (e.g. Google News) and websearch results
- Grouping related stock quotes for investment portfolio management
- Can be used as a preprocessing step for recommender systems