fa.bianp.net

SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: scipy.linalg.svd and scipy.sparse.linalg.svds. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.

The first method, scipy.linalg.svd, is perhaps the best known and uses the linear algebra library LAPACK to handle the computations. This implements the Golub-Kahan-Reisch algorithm 1, which is accurate and highly efficient with a cost of O(n^3) floating-point operations 2.

The second method is scipy.sparse.linalg.svds and despite it's name it works fine also for dense arrays. This implementation is based on the ARPACK library and consists of an iterative procedure that finds the SVD decomposition by reducing the problem to an eigendecomposition on the array X -> dot(X.T, X). This method is usually very effective when the input matrix X is sparse or only the largest singular values are required. There are other SVD solvers that I did not consider, such as sparsesvd or pysparse.jdsym, but my points for the sparse solve probably hold for those packages too since they both implement iterative algorithms based on the same principles.

When the input matrix is dense and all the singular values are required, the first method is usually more efficient. To support this statement I've created a little benchmark: timings for both methods as a function of the size of the matrices. Notice that we are in a case that is clearly favorable to the linalg.svd: after all sparse.linalg.svds was not created with this setting in mind, it was created for sparse matrices or dense matrices with some special structure. We will see however that even in this setting it has interesting advantages.

I'll create random square matrices with different sizes and plot the timings for both methods. For the benchmarks I used SciPy v0.12 linked against Intel Math Kernel Library v11. Both methods are single-threaded (I had to set OMP_NUM_THREADS=1 so that MKL does not try to parallelize the computations). [code]

Lower timings are better, so this gives scipy.linalg.svd as clear winner. However, this is just part of the story. What this graph doesn't show is that this method is winning at the price of allocating a huge amount of memory for temporary computations. If we now plot the memory consumption for both methods under the same settings, the story is completely different. [code]

The memory requirements of scipy.linalg.svd scale with the number of dimensions, while for the sparse version the amount of allocated memory is constant. Notice that we are measuring the amount of total memory used, it is thus natural to see a slight increase in memory consumption since the input matrix is bigger on each iteration.

For example, in my applications, I need to compute the SVD of a matrix for whom the needed workspace does not fit in memory. In cases like this, the sparse algorithm (sparse.linalg.svds) can come in handy: the timing is just a factor worse (but I can easily parallelize jobs) and the memory requirements for this method is peanuts compared to the dense version.

1. Calculating the singular values and pseudo-inverse of a matrix, Golub, Gene H., Kahan, William, 1965, JSTOR

2. A Survey of Singular Value Decomposition Methods and Performance Comparison of Some Available Serial Codes, Plassman, Gerald E. 2005 PDF

This tutorial introduces the concept of pairwise preference used in most ranking problems. I'll use scikit-learn and for learning and matplotlib for visualization.

In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples a and b, either a < b, b > a or b and a are not comparable.

For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.

This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (Joachims 2002). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (Pedregosa et al 2012).

import itertools
import numpy as np
from scipy import stats
import pylab as pl
from sklearn import svm, linear_model, cross_validation


To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.

The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, $X = X_1 \cup X_2$, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent $X_1$ with round markers and $X_2$ with triangular markers.

np.random.seed(0)
w = np.array([np.sin(theta), np.cos(theta)])
K = 20
X = np.random.randn(K, 2)
y = [0] * K
for i in range(1, 3):
X = np.concatenate((X, np.random.randn(K, 2) + i * 4 * w))
y = np.concatenate((y, [i] * K))

# slightly displace data corresponding to our second partition
X[::2] -= np.array([3, 7])
blocks = np.array([0, 1] * (X.shape[0] / 2))

# split into train and test set
cv = cross_validation.StratifiedShuffleSplit(y, test_size=.5)
train, test = iter(cv).next()
X_train, y_train, b_train = X[train], y[train], blocks[train]
X_test, y_test, b_test = X[test], y[test], blocks[test]

# plot the result
idx = (b_train == 0)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
pl.text(0, 1, '$w$', fontsize=20)
pl.arrow(-3, -8, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
pl.text(-2.6, -7, '$w$', fontsize=20)
pl.axis('equal')
pl.show()


In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.

However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate $\hat{w}$ using an l2-regularized linear model.

ridge = linear_model.Ridge(1.)
ridge.fit(X_train, y_train)
coef = ridge.coef_ / linalg.norm(ridge.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
pl.text(2, 0, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.title('Estimation by Ridge regression')
pl.show()


To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that orders the data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use Kendall's tau correlation coefficient, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g Optimizing Search Engines using Clickthrough Data).

We thus evaluate this metric on the test set for each block separately.

for i in range(2):
tau, _ = stats.kendalltau(
ridge.predict(X_test[b_test == i]), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))

Kendall correlation coefficient for block 0: 0.71122
Kendall correlation coefficient for block 1: 0.84387


The pairwise transform

As proved in (Herbrich 1999), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into $(x'_k, y'_k) = (x_i - x_j, sign(y_i - y_j))$ for all comparable pairs.

This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.

# form all pairwise combinations
comb = itertools.combinations(range(X_train.shape[0]), 2)
k = 0
Xp, yp, diff = [], [], []
for (i, j) in comb:
if y_train[i] == y_train[j] \
or blocks[train][i] != blocks[train][j]:
# skip if same target or different group
continue
Xp.append(X_train[i] - X_train[j])
diff.append(y_train[i] - y_train[j])
yp.append(np.sign(diff[-1]))
# output balanced classes
if yp[-1] != (-1) ** k:
yp[-1] *= -1
Xp[-1] *= -1
diff[-1] *= -1
k += 1
Xp, yp, diff = map(np.asanyarray, (Xp, yp, diff))
pl.scatter(Xp[:, 0], Xp[:, 1], c=diff, s=60, marker='o', cmap=pl.cm.Blues)
x_space = np.linspace(-10, 10)
pl.plot(x_space * w[1], - x_space * w[0], color='gray')
pl.text(3, -4, '$\{x^T w = 0\}$', fontsize=17)
pl.axis('equal')
pl.show()


As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.

We will now finally train an Support Vector Machine model on the transformed data. This model is known as RankSVM, although we note that the pairwise transform is more general and can be used together with any linear model. We will then plot the training data together with the estimated coefficient $\hat{w}$ by RankSVM.

clf = svm.SVC(kernel='linear', C=.1)
clf.fit(Xp, yp)
coef = clf.coef_.ravel() / linalg.norm(clf.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
pl.arrow(-3, -8, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
pl.text(1, .7, '$\hat{w}$', fontsize=20)
pl.text(-2.6, -7, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.show()


Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.

for i in range(2):
tau, _ = stats.kendalltau(
np.dot(X_test[b_test == i], coef), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))

Kendall correlation coefficient for block 0: 0.83627
Kendall correlation coefficient for block 1: 0.84387


This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.

Original ipython notebook for this blog post can be found here

1. "Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers, 115-132, Liu Press, 2000

2. "Optimizing Search Engines Using Clickthrough Data", T. Joachims. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

3. "Learning to rank from medical imaging data", Pedregosa et al. [arXiv

4. "Efficient algorithms for ranking with SVMs", O. Chapelle and S. S. Keerthi, Information Retrieval Journal, Special Issue on Learning to Rank, 2009

My newest project is a Python library for monitoring memory consumption of arbitrary process, and one of its most useful features is the line-by-line analysis of memory usage for Python code. I wrote a basic prototype six months ago after being surprised by the lack of related tools. I wanted to plot memory consumption of a couple of Python functions but did not find a python module to do the job. I came to the conclusion that there is no standard way to get the memory usage of the Python interpreter from within Python, so I resorted to reading for from /proc/$PID/statm. From there on I realized that one the fetching of memory is done, making a line-by-line report wouldn't be hard. Back to today. I've been using the line-by-line memory monitoring to diagnose poor memory management (hidden temporaries, unused allocation, etc.) for some time. It seems to work on two different computers, so full of confidence as I am, I'll write a blog post about it ... How to use it? The easiest way to get it is to install from the Python Package Index: $ easy_install -U memory_profiler # pip install -U memory_profiler


but other options include fetching the latests from github or dropping it on your current working directory or somewhere else on your PYTHONPATH since it consist of a single file. Then next step is to write some python code to profile. It can be just about any function, but for the purpose of this blog post I'll create a function my_func() with mostly memory allocations and save it to a file named example.py:

@profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a

if __name__ == '__main__':
my_func()


Note that I've decorated the function with @profile. This tells the profiler to look into function my_func and gather the memory consumption for each line.

A little experiment to see what low rank approximation looks like. These are the best rank-k approximations (in the Frobenius norm) to the a natural image for increasing values of k and an original image of rank 512.

Python code can be found here. GIF animation made using ImageMagic's convert script.

In scipy's development version there's a new function closely related to the QR-decomposition of a matrix and to the least-squares solution of a linear system. What this function does is to compute the QR-decomposition of a matrix and then multiply the resulting orthogonal factor by another arbitrary matrix. In pseudocode:

def qr_multiply(X, Y):
Q, R = qr(X)
return dot(Q.T, Y)


but unlike this naive implementation, qr_multiply is able to do all this without explicitly computing the orthogonal Q matrix, resulting both in memory and time saving. In the following picture I measured the memory consumption as a function of time of running this computation on a 1.000 x 1.000 matrix X and a vector Y (full code can be found here):

It can be seen that not only qr_multiply is almost twice as fast as the naive approach, but also that the memory consumption is significantly reduced, since the orthogonal factor is never explicitly computed. Credit for implementing the qr_multiply function goes to Martin Teichmann.

Last week we released a new version of scikit-learn. The Changelog is particularly impressive, yet personally this release is important for other reasons. This will probably be my last release as a paid engineer. I'm starting a PhD next month, and although I plan to continue contributing to the project and make a few more releases, I will certainly have less time to devote to it. Luckily, I received a lot of help from the community while preparing the release, from Changelog writing to build of Windows binaries, thus I expect the transition to go smoothly. Almost two years have elapsed since the first 0.1 release. During this time, we did a lot of refactoring and broke the API several times. However, I've seen some concerns about API stability both at the EuroScipy conference and in the mailing list where I’ve realized we need to provide an API that does not break in every release, and do this in a way that the project remains fun for developers. That's why I'm extremely glad to see that although this release is big in changes, these have been made in a more organized manner. Yes, we've broken the API once again, but now there's a compatibility layer that ensures that code written for 0.8 will continue working with the new release.

I've been working lately in improving the scikit-learn example gallery to show also a small thumbnail of the plotted result. Here is what the gallery looks like now:

And the real thing should be already displayed in the development-documentation. The next thing is to add a static image to those that don't generate any result, examples such as the SVM-GUI should have an image to display.

Today's coding sprint was a bit more crowded, with some notable scipy hackers such as Ralph Gommers, Stefan van der Walt, David Cournapeau or Fernando Perez from Ipython joining in. On what got done: - We merged Jake's new BallTree code. This is a pure Cython implementation of a nearest-neighbor search similar to the KDTree class in scipy.spatial, but much faster. The code looks awesome and it's a big speedup compared to the older code. - Vlad is ready to merge hisdictionary learning code, something that should happen in the upcoming days. - Initial support for Python 3. scikit-learn should now at least build and import cleanly under Python 3. - some bugfixes in the Pipeline object and in docstrings. So this was the end of the scikit-learn sprint, but EuroScipy has just begun. See you tomorrow at the conference (follow the signs)!

As a warm-up for the upcoming EuroScipy-conference, some of the scikit-learn developers decided to gather and work together for a couple of days. Today was the first day and there was only a handfull of us, as the real kickoff is expected tomorrow. Some interesting coding happened, although most of us where still preparing material for the EuroScipy tutorials ... - API changes: remove of keyword parameters to fit method, added method set_params (pull request 1). - Some bugfixing in NuSVR (pull request 2) - Review of Vlad's code, developed during his Summer of Code program. - A lot of discussion about algorithm, code, APIs and buildbot dance !

Ridge coefficients for multiple values of the regularization parameter can be elegantly computed by updating the thin SVD decomposition of the design matrix:

import numpy as np
from scipy import linalg
def ridge(A, b, alphas):
"""
Return coefficients for regularized least squares

min ||A x - b||^2 + alpha ||x||^2

Parameters
----------
A : array, shape (n, p)
b : array, shape (n,)
alphas : array, shape (k,)

Returns
-------
coef: array, shape (p, k)
"""
U, s, Vt = linalg.svd(X, full_matrices=False)
d = s / (s[:, np.newaxis].T ** 2 + alphas[:, np.newaxis])
return np.dot(d * U.T.dot(y), Vt).T


This can be used to efficiently compute what it's regularization path, that is, to plot the coefficients as a function of the regularization parameter. Since the bottleneck of the algorithm is the singular value decomposition, computing the coefficients for other values of the regularization parameter basically comes for free.

A variant of this algorithm can then be used to compute the optimal regularization parameter in the sense of leave-one-out cross-validation and is implemented in scikit-learn's RidgeCV (for which Mathieu Blondel has an excelent post by ). This optimal parameter is denoted with a vertical dotted line in the following picture, full code can be found here.