Isotonic Regression

Category: misc
#isotonic regression #machine learning #Python #scikit-learn

My latest contribution for scikit-learn is an implementation of the isotonic regression model that I coded with Nelle Varoquaux and Alexandre Gramfort. This model finds the best least squares fit to a set of points, given the constraint that the fit must be a non-decreasing function. The example on the scikit-learn website gives an intuition on this model.

isotonic regression

The original points are in red, and the estimated ones are in green. As you can see, there is one estimation (green point) for each data sample (red point). Calling $y \in \mathbb{R}^n$ the input data, the model can be written concisely as an optimization problem over $x$

$$ \text{argmin}_x |y - x |^2 \\ \text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n $$

The algorithm implemented in scikit-learn 3 is the pool adjacent violators algorithm 1, which is an efficient linear time $\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data looking for violations of the monotonicity constraint. When it finds one, it adjusts the estimate to the best possible fit with constraints. Sometimes it also needs to modify previous points to make sure the new estimate does not violate the constraints. The following picture shows how it proceeds at each iteration

isotonic regression

  1. "Active set algorithms for isotonic regression; A unifying framework", Michael J. Best, Nilotpal Chakravarti 

  2. Python notebook to generate the figures: ipynb and web version 

  3. The algorithm is used through the sklearn.isotonic.IsotonicRegression object (doc) or the function sklearn.isotonic.isotonic_regression (doc

Memory plots with memory_profiler

Category: misc
#Python #memory #memory_profiler

Besides performing a line-by-line analysis of memory consumption, memory_profiler exposes some functions that allow to retrieve the memory consumption of a function in real-time, allowing e.g. to visualize the memory consumption of a given function over time.

The function to be used is memory_usage. The first argument specifies what code is to be monitored. This can represent either an external process or a Python function. In the case of an external process the first argument is an integer representing its process identifier (PID). In the case of a Python function, we need pass the function and its arguments to memory_usage. We do this by passing the tuple (f, args, kw) that specifies the function, its position arguments as a tuple and its keyword arguments as a dictionary, respectively. This will be then executed by memory_usage as f(*args, **kw).

Let's see this with an example. Take as function NumPy's pseudo-inverse function. Thus f = numpy.linalg.pinv and f takes one positional argument (the matrix to be inverted) so args = (a,) where a is the matrix to be inverted. Note that args must be a tuple consisting of the different arguments, thus the parenthesis around a. The third item is a dictionary kw specifying the keyword arguments. Here kw is optional and is omitted.

>>> from memory_profiler import memory_usage
>>> import numpy as np
# create a random matrix
>>> a = np.random.randn(500, 500)
>>> mem_usage = memory_usage((np.linalg.pinv, (a,)), interval=.01)
>>> print(mem_usage)
[57.02734375, 55.0234375, 57.078125, ...]

This has given me a list specifying at different time intervals (t0, t0 + .01, t0 + .02, ...) at which the measurements where taken. Now I can use that to for example plot the memory consumption as a function of time:

>>> import pylab as pl
>>> pl.plot(np.arange(len(mem_usage)) * .01, mem_usage, label='linalg.pinv')
>>> pl.xlabel('Time (in seconds)')
>>> pl.ylabel('Memory consumption (in MB)')
>>> pl.show()

Memory plot

This will give the memory usage of a single function across time, which might be interesting for example to detect temporaries that would be created during the execution.

Another use case for memory_usage would be to see how memory behaves as input data gets bigger. In this case we are interested in memory as a function of the input data. One obvious way we can do this is by calling the same function each time with a different input and take as memory consumption the maximum consumption over time. This way we will have a memory usage for each input.

>>> for i in range(1, 5):
...    A = np.random.randn(100 * i, 100 * i)
...    mem_usage = memory_usage((np.linalg.pinv, (A,)))
...    print max(mem_usage)


It is now possible to plot these results as a function of the dimensions.

import numpy as np
import pylab as pl
from memory_profiler import memory_usage

dims = np.linspace(100, 1000, 10)
pinv_mem = np.zeros(dims.size)

for i_dim, k in enumerate(dims):
    x = np.random.randn(k, k)
    tmp = memory_usage((np.linalg.pinv, (x,)), interval=.01)
    pinv_mem[i_dim] = np.max(tmp)

pl.plot(dims, pinv_mem, label='np.linalg.pinv')
pl.ylabel('Memory (in MB)')
pl.xlabel('Dimension of the square matrix')
pl.legend(loc='upper left')

Memory plot

Singular Value Decomposition in SciPy

Category: misc
#python #scipy #svd

SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: scipy.linalg.svd and scipy.sparse.linalg.svds. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.

The first method, scipy.linalg.svd, is perhaps the best known and uses the linear algebra library LAPACK to handle the computations. This implements the Golub-Kahan-Reisch algorithm 1, which is accurate and highly efficient with a cost of O(n^3) floating-point operations 2.

The second method is scipy.sparse.linalg.svds and despite it's name it works fine also for dense arrays. This implementation is based on the ARPACK library and consists of an iterative procedure that finds the SVD decomposition by reducing the problem to an eigendecomposition on the array X -> dot(X.T, X). This method is usually very effective when the input matrix X is sparse or only the largest singular values are required. There are other SVD solvers that I did not consider, such as sparsesvd or pysparse.jdsym, but my points for the sparse solve probably hold for those packages too since they both implement iterative algorithms based on the same principles.

When the input matrix is dense and all the singular values are required, the first method is usually more efficient. To support this statement I've created a little benchmark: timings for both methods as a function of the size of the matrices. Notice that we are in a case that is clearly favorable to the linalg.svd: after all sparse.linalg.svds was not created with this setting in mind, it was created for sparse matrices or dense matrices with some special structure. We will see however that even in this setting it has interesting advantages.

I'll create random square matrices with different sizes and plot the timings for both methods. For the benchmarks I used SciPy v0.12 linked against Intel Math Kernel Library v11. Both methods are single-threaded (I had to set OMP_NUM_THREADS=1 so that MKL does not try to parallelize the computations). [code]

svd timings

Lower timings are better, so this gives scipy.linalg.svd as clear winner. However, this is just part of the story. What this graph doesn't show is that this method is winning at the price of allocating a huge amount of memory for temporary computations. If we now plot the memory consumption for both methods under the same settings, the story is completely different. [code]

svd memory

The memory requirements of scipy.linalg.svd scale with the number of dimensions, while for the sparse version the amount of allocated memory is constant. Notice that we are measuring the amount of total memory used, it is thus natural to see a slight increase in memory consumption since the input matrix is bigger on each iteration.

For example, in my applications, I need to compute the SVD of a matrix for whom the needed workspace does not fit in memory. In cases like this, the sparse algorithm (sparse.linalg.svds) can come in handy: the timing is just a factor worse (but I can easily parallelize jobs) and the memory requirements for this method is peanuts compared to the dense version.

  1. Calculating the singular values and pseudo-inverse of a matrix, Golub, Gene H., Kahan, William, 1965, JSTOR 

  2. A Survey of Singular Value Decomposition Methods and Performance Comparison of Some Available Serial Codes, Plassman, Gerald E. 2005 PDF 

Learning to rank with scikit-learn: the pairwise transform

Category: misc
#python #scikit-learn #ranking

This tutorial introduces the concept of pairwise preference used in most ranking problems. I'll use scikit-learn and for learning and matplotlib for visualization.

In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples a and b, either a < b, b > a or b and a are not comparable.

For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.

This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (Joachims 2002). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (Pedregosa et al 2012).

import itertools
import numpy as np
from scipy import stats
import pylab as pl
from sklearn import svm, linear_model, cross_validation

To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.

The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, $X = X_1 \cup X_2$, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent $X_1$ with round markers and $X_2$ with triangular markers.

theta = np.deg2rad(60)
w = np.array([np.sin(theta), np.cos(theta)])
K = 20
X = np.random.randn(K, 2)
y = [0] * K
for i in range(1, 3):
    X = np.concatenate((X, np.random.randn(K, 2) + i * 4 * w))
    y = np.concatenate((y, [i] * K))

# slightly displace data corresponding to our second partition
X[::2] -= np.array([3, 7])
blocks = np.array([0, 1] * (X.shape[0] / 2))

# split into train and test set
cv = cross_validation.StratifiedShuffleSplit(y, test_size=.5)
train, test = iter(cv).next()
X_train, y_train, b_train = X[train], y[train], blocks[train]
X_test, y_test, b_test = X[test], y[test], blocks[test]

# plot the result
idx = (b_train == 0)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
    marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
    marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
    head_width=0.5, head_length=0.5)
pl.text(0, 1, '$w$', fontsize=20)
pl.arrow(-3, -8, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
    head_width=0.5, head_length=0.5)
pl.text(-2.6, -7, '$w$', fontsize=20)

In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.

However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate $\hat{w}$ using an l2-regularized linear model.

ridge = linear_model.Ridge(1.)
ridge.fit(X_train, y_train)
coef = ridge.coef_ / linalg.norm(ridge.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
    marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
    marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
    head_width=0.5, head_length=0.5)
pl.text(2, 0, '$\hat{w}$', fontsize=20)
pl.title('Estimation by Ridge regression')

To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that orders the data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use Kendall's tau correlation coefficient, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g Optimizing Search Engines using Clickthrough Data).

We thus evaluate this metric on the test set for each block separately.

for i in range(2):
    tau, _ = stats.kendalltau(
        ridge.predict(X_test[b_test == i]), y_test[b_test == i])
    print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))
Kendall correlation coefficient for block 0: 0.71122
Kendall correlation coefficient for block 1: 0.84387

The pairwise transform

As proved in (Herbrich 1999), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into $(x'_k, y'_k) = (x_i - x_j, sign(y_i - y_j))$ for all comparable pairs.

This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.

# form all pairwise combinations
comb = itertools.combinations(range(X_train.shape[0]), 2)
k = 0
Xp, yp, diff = [], [], []
for (i, j) in comb:
    if y_train[i] == y_train[j] \
        or blocks[train][i] != blocks[train][j]:
        # skip if same target or different group
    Xp.append(X_train[i] - X_train[j])
    diff.append(y_train[i] - y_train[j])
    # output balanced classes
    if yp[-1] != (-1) ** k:
        yp[-1] *= -1
        Xp[-1] *= -1
        diff[-1] *= -1
    k += 1
Xp, yp, diff = map(np.asanyarray, (Xp, yp, diff))
pl.scatter(Xp[:, 0], Xp[:, 1], c=diff, s=60, marker='o', cmap=pl.cm.Blues)
x_space = np.linspace(-10, 10)
pl.plot(x_space * w[1], - x_space * w[0], color='gray')
pl.text(3, -4, '$\{x^T w = 0\}$', fontsize=17)

As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.

We will now finally train an Support Vector Machine model on the transformed data. This model is known as RankSVM, although we note that the pairwise transform is more general and can be used together with any linear model. We will then plot the training data together with the estimated coefficient $\hat{w}$ by RankSVM.

clf = svm.SVC(kernel='linear', C=.1)
clf.fit(Xp, yp)
coef = clf.coef_.ravel() / linalg.norm(clf.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
    marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
    marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
    head_width=0.5, head_length=0.5)
pl.arrow(-3, -8, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
    head_width=0.5, head_length=0.5)
pl.text(1, .7, '$\hat{w}$', fontsize=20)
pl.text(-2.6, -7, '$\hat{w}$', fontsize=20)

Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.

for i in range(2):
    tau, _ = stats.kendalltau(
        np.dot(X_test[b_test == i], coef), y_test[b_test == i])
    print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))
Kendall correlation coefficient for block 0: 0.83627
Kendall correlation coefficient for block 1: 0.84387

This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.

Original ipython notebook for this blog post can be found here

  1. "Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers, 115-132, Liu Press, 2000 

  2. "Optimizing Search Engines Using Clickthrough Data", T. Joachims. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 

  3. "Learning to rank from medical imaging data", Pedregosa et al. [arXiv

  4. "Efficient algorithms for ranking with SVMs", O. Chapelle and S. S. Keerthi, Information Retrieval Journal, Special Issue on Learning to Rank, 2009 

  5. Doug Turnbull's blog post on learning to rank 

line-by-line memory usage of a Python program

Category: misc
#python #memory_profiler

My newest project is a Python library for monitoring memory consumption of arbitrary process, and one of its most useful features is the line-by-line analysis of memory usage for Python code. I wrote a basic prototype six months ago after being surprised by the lack of related tools. I wanted to plot memory consumption of a couple of Python functions but did not find a python module to do the job. I came to the conclusion that there is no standard way to get the memory usage of the Python interpreter from within Python, so I resorted to reading for from /proc/$PID/statm. From there on I realized that one the fetching of memory is done, making a line-by-line report wouldn't be hard. Back to today. I've been using the line-by-line memory monitoring to diagnose poor memory management (hidden temporaries, unused allocation, etc.) for some time. It seems to work on two different computers, so full of confidence as I am, I'll write a blog post about it ...

How to use it?

The easiest way to get it is to install from the Python Package Index:

$ easy_install -U memory_profiler # pip install -U memory_profiler

but other options include fetching the latests from github or dropping it on your current working directory or somewhere else on your PYTHONPATH since it consist of a single file. Then next step is to write some python code to profile. It can be just about any function, but for the purpose of this blog post I'll create a function `my_func()` with mostly memory allocations and save it to a file named example.py:

def my_func():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

if __name__ == '__main__':

Note that I've decorated the function with @profile. This tells the profiler to look into function my_func and gather the memory consumption for each line.

Low rank approximation

Category: misc
#machine learning #python

A little experiment to see what low rank approximation looks like. These are the best rank-k approximations (in the Frobenius norm) to the a natural image for increasing values of k and an original image of rank 512.

Python code can be found here. GIF animation made using ImageMagic's convert script.

qr_multiply function in scipy.linalg

Category: misc
#python #scipy

In scipy's development version there's a new function closely related to the QR-decomposition of a matrix and to the least-squares solution of a linear system. What this function does is to compute the QR-decomposition of a matrix and then multiply the resulting orthogonal factor by another arbitrary matrix. In pseudocode:

def qr_multiply(X, Y):
    Q, R = qr(X)
    return dot(Q.T, Y)

but unlike this naive implementation, qr_multiply is able to do all this without explicitly computing the orthogonal Q matrix, resulting both in memory and time saving. In the following picture I measured the memory consumption as a function of time of running this computation on a 1.000 x 1.000 matrix X and a vector Y (full code can be found here):

It can be seen that not only qr_multiply is almost twice as fast as the naive approach, but also that the memory consumption is significantly reduced, since the orthogonal factor is never explicitly computed. Credit for implementing the qr_multiply function goes to Martin Teichmann.

scikit-learn EuroScipy 2011 coding sprint -- day one

Category: misc
#scikit-learn #python

As a warm-up for the upcoming EuroScipy-conference, some of the scikit-learn developers decided to gather and work together for a couple of days. Today was the first day and there was only a handfull of us, as the real kickoff is expected tomorrow. Some interesting coding happened, although most of us where still preparing material for the EuroScipy tutorials ... - API changes: remove of keyword parameters to fit method, added method set_params (pull request 1). - Some bugfixing in NuSVR (pull request 2) - Review of Vlad's code, developed during his Summer of Code program. - A lot of discussion about algorithm, code, APIs and buildbot dance !

Smells like hacker spirit

Category: misc
#python #sklearn

I was last weekend in FOSDEM presenting scikits.learn (here are the slides I used at the Data Analytics Devroom). Kudos to Olivier Grisel and all the people who organized such a fun and authentic meeting!